Skip to content
Snippets Groups Projects
Commit 5d64ec7d authored by Marc Feger's avatar Marc Feger
Browse files

Add Critical Assessment

parent fe176caa
Branches
Tags
No related merge requests found
\newpage
\section{Conclusion} \section{Conclusion}
Overall, \citet{Rendle19} concludes that the last \textit{five years} of \textit{research} for the \textit{MovieLens10M-dataset} have not really produced any new findings. Although in the presented experiment the \textit{best practice} of the \textit{community} was applied, the \textit{simplest matrix-factorization} methods could clearly beat the reported results. Thus, the authors support the thesis that \textit{finding} and \textit{evaluating valid} and \textit{sharp baselines} is \textit{not trivial}. \textit{Empirical data} are collected, since there is \textit{no formal evidence} in the field of \textit{recommender systems} to make the methods comparable. From the \textit{numerical evaluation} the authors identify the \textit{rating of a work} in a \textit{scientific context} as a \textit{major problem}. Here, a \textit{publication} is classified as \textit{not worth publishing} if it achieves \textit{better results with old methods}. Rather, most papers aim to \textit{distinguish themselves} from the others by using new methods that beat the old ones. In this way, \textit{baselines} are \textit{not questioned} and the \textit{community} is steered in the wrong direction, as their work competes against \textit{insufficient} \textit{baselines}. This problem was not only solved during the \textit{Netflix-Prize} by the \textit{horrendous prize money}. However, it turns out that the \textit{insights} gained there were more \textit{profound} and can be transferred to the \textit{MovieLens10M-dataset}. Thus \textit{new techniques} but \textit{no new elementary knowledge} could be achieved on the \textit{MovieLens10M-dataset}. Overall, \citet{Rendle19} concludes that the last \textit{five years} of \textit{research} for the \textit{MovieLens10M-dataset} have not really produced any new findings. Although in the presented experiment the \textit{best practice} of the \textit{community} was applied, the \textit{simplest matrix-factorization} methods could clearly beat the reported results. Thus, the authors support the thesis that \textit{finding} and \textit{evaluating valid} and \textit{sharp baselines} is \textit{not trivial}. \textit{Empirical data} are collected, since there is \textit{no formal evidence} in the field of \textit{recommender systems} to make the methods comparable. From the \textit{numerical evaluation} the authors identify the \textit{rating of a work} in a \textit{scientific context} as a \textit{major problem}. Here, a \textit{publication} is classified as \textit{not worth publishing} if it achieves \textit{better results with old methods}. Rather, most papers aim to \textit{distinguish themselves} from the others by using new methods that beat the old ones. In this way, \textit{baselines} are \textit{not questioned} and the \textit{community} is steered in the wrong direction, as their work competes against \textit{insufficient} \textit{baselines}.
With this paper \citet{Rendle19} addresses the highly experienced reader. The simple structure of the paper convinces by the clear and direct way in which the problem is identified. Additionally, the paper can be seen as an \textit{addendum} to the \textit{Netflix-Prize}. As the authors \citet{Rendle} and \citet{Koren} were significantly \textit{involved} in this competition, the points mentioned above are convincing by the experience they have gained. With their results they support the very simple but not trivial statement that finding good \textit{baselines} requires an \textit{immense effort} and this has to be \textit{promoted} much more in a \textit{scientific context}. This implies a change in the \textit{long-established thinking} about the evaluation of scientific work. At this point it is questionable whether it is possible to change existing thinking. This should be considered especially because the scientific sector, unlike the industrial sector, cannot provide financial motivation due to limited resources. On the other hand, it must be considered that the individual focus of a work must also be taken into account. Thus, it is \textit{questionable} whether the \textit{scientific sector} is able to create such a large unit with regard to a \textit{common goal} as \textit{Netflix} did during the competition.
It should be clearly emphasized that it is immensely important to use sharp \textit{baselines} as guidelines. However, in a \textit{scientific context} the \textit{goal} is not as \textit{precisely defined} as it was in the \textit{Netflix-Prize}. Rather, a large part of the work is aimed at investigating whether new methods such as \textit{neural networks} etc. are applicable to the \textit{recommender problem}.
Regarding the results, however, it has to be said that they clearly support a \textit{rethinking} even if this should only concern a \textit{small part} of the work. On the website \textit{Papers with Code}\footnote{\url{https://paperswithcode.com/sota/collaborative-filtering-on-movielens-10m}} the \textit{public leaderboard} regarding the results obtained on the \textit{MovieLens10M-dataset} can be viewed. The source analysis of \textit{Papers with Code} also identifies the results given by \citet{Rendle19} as leading.
In addition, \textit{future work} should focus on a more \textit{in-depth source analysis} which, besides the importance of the \textit{MovieLens10M-dataset} for the \textit{scientific community}, also examines whether and to what extent \textit{other datasets} are affected by this phenomenon.
Due to the recent publication in spring \textit{2019}, this paper has not yet been cited frequently. So time will tell what impact it will have on the \textit{community}. Nevertheless, \citet{Dacrema2019} has already observed similar problems for \textit{top-n-recommender} based on this paper. According to this, \citet{Rendle} seems to have recognized an elementary and unseen problem and made it public. This is strongly reminiscent of the so-called \textit{Artificial-Intelligence-Winter (AI-Winter)} in which \textit{stagnation} in the \textit{development} of \textit{artificial intelligence} occurred due to too high expectations and other favourable factors. Overall the paper has the potential to \textit{counteract} the \textit{general hype} whose only purpose is to develop the best and only true model and thus \textit{prevent} a \textit{winter for recommender systems}.
This problem was not only solved during the \textit{Netflix-Prize} by the \textit{horrendous prize money}. However, it turns out that the \textit{insights} gained there were more \textit{profound} and can be transferred to the \textit{MovieLens10M-dataset}. Thus \textit{new techniques} but \textit{no new elementary knowledge} could be achieved on the \textit{MovieLens10M-dataset}.
\ No newline at end of file
\newpage
\section{Critical Assessment}
With this paper \citet{Rendle19} addresses the highly experienced reader. The simple structure of the paper convinces by the clear and direct way in which the problem is identified. Additionally, the paper can be seen as an \textit{addendum} to the \textit{Netflix-Prize}.
The problem addressed by \citet{Rendle19} is already known from other topics like \textit{information-retrieval} and \textit{machine learning}. For example, \citet{Armstrong09} described the phenomenon observed by \citet{Rendle19} in the context of \textit{information-retrieval systems} that too \textit{weak baselines} are used. He also sees that \textit{experiments} are \textit{misinterpreted} by giving \textit{misunderstood indicators} such as \textit{statistical significance}. In addition, \citet{Armstrong09} also sees that the \textit{information-retrieval community} lacks an adequate overview of results. In this context, he proposes a collection of works that start reminiscent of the \textit{Netflix-Leaderboard}. \citet{Lin19} also observed the problem of \textit{baselines} for \textit{neural-networks} that are \textit{too weak}. Likewise, the actual observation that \textit{too weak baselines} exist due to empirical evaluation is not unknown in the field of \textit{recommender systems}. \citet{Ludewig18} already observed the same problem for \textit{session-based recommender systems}. Such systems only work with data generated during a \textit{session} and try to predict the next \textit{user} selection. They also managed to achieve better results using \textit{session-based matrix-factorization}, which was inspired by the work of \citet{Rendle09} and \citet{Rendle10}. The authors see the problem in the fact that there are \textit{too many datasets} and \textit{different measures} of evaluation for \textit{scientific work}. In addition, \citet{Dacrema19} take up the problem addressed by \citet{Lin19} and show that \textit{neural approaches} to solving the \textit{recommender-problem} can also be beaten by simplest methods. They see the main problem in the \textit{reproducibility} of publications and suggest a \textit{rethinking} in the \textit{verification} of results in this field of work. Furthermore, they do not refrain from taking a closer look at \textit{matrix-factorization} in this context.
Compared to the listed work, it is not unknown that in some subject areas \textit{baselines} are \textit{too weak} and lead to \textit{stagnant development}. Especially when considering that \textit{information-retrieval} and \textit{machine learning} are the \textit{cornerstones} of \textit{recommender systems} it is not surprising to observe similar phenomena. Nevertheless, the work published by \citet{Rendle19} stands out from the others. Using the insights gained during the \textit{Netflix-Prize}, he underlines the problem of the \textit{lack of standards} and \textit{unity} for \textit{scientific experiments} in the work mentioned above.
However, the work published by \citet{Rendle19} also clearly stands out from the above-mentioned work. In contrast to them, not only the problem for the \textit{MovieLens10M-dataset} in combination with \textit{matrix-factorization} is recognized. Rather, the problem is brought one level higher. Thus, it succeeds in gaining a global and reflected but still distanced view of the \textit{best practice} in the field of \textit{recommender systems}.
Besides calling for \textit{uniform standards}, \citet{Rendle19} criticizes the way the \textit{scientific community} thinks. \citet{Rendle19} recognizes the \textit{publication-bias} addressed by \citet{Sterling59}. The so-called \textit{publication-bias} describes the problem that there is a \textit{statistical distortion} of the data situation within a \textit{scientific topic area}, since only successful or modern papers are published. \citet{Rendle19} clearly abstracts this problem from the presented experiment. The authors see the problem in the fact that a scientific paper is subject to a \textit{pressure to perform} which is based on the \textit{novelty} of such a paper. This thought can be transferred to the \textit{file-drawer-problem} described by \citet{Rosenthal79}. This describes the problem that many \textit{scientists} do not publish their work and, out of concern about not meeting the \textit{publication standards} such as \textit{novelty} or the question of the \textit{impact on the community}, do not submit their results at all and prefer to \textit{keep them in a drawer}. Although the problems mentioned above are not directly addressed, they can be abstracted due to the detailed presentation. In contrast to the other works, this way a wanted or unwanted abstraction and naming of concrete and comprehensible problems is achieved.
Nevertheless, criticism must also be made of the work published by \citet{Rendle19}. Despite the high standard of the work, it must be said that the problems mentioned above can be identified but are not directly addressed by the authors. The work of \citet{Rendle19} even lacks an embedding in the context above. Thus, the experienced reader who is familiar with the problems addressed by \citet{Armstrong09}, \citet{Sterling59} and \citet{Rosenthal79} becomes aware of the contextual and historical embedding and value of the work. In contrast, \citet{Lin19}, published in the same period, succeeds in this embedding in the contextual problem and in the previous work. Moreover, it is questionable whether the problem addressed can actually lead to a change in \textit{long-established thinking}. Especially if one takes into account that many scientists are also investigating the \textit{transferability} of new methods to the \textit{recommender problem}. Thus, the call for research into \textit{better baselines} must be viewed from two perspectives. On the one hand, it must be noted that \textit{too weak baselines} can lead to a false understanding of new methods. On the other hand, however, it must also be noted that this could merely trigger the numerical evaluation in a competitive process to find the best method, as was the case with the \textit{Netflix-Prize}. However, in the spirit of \citet{Sculley18}, it should always be remembered that: \textit{"the goal of science is not wins, but knowledge"}.
As the authors \citet{Rendle} and \citet{Koren} were significantly \textit{involved} in this competition, the points mentioned above are convincing by the experience they have gained. With their results they support the very simple but not trivial statement that finding good \textit{baselines} requires an \textit{immense effort} and this has to be \textit{promoted} much more in a \textit{scientific context}. This implies a change in the \textit{long-established thinking} about the evaluation of scientific work. At this point it is questionable whether it is possible to change existing thinking. This should be considered especially because the scientific sector, unlike the industrial sector, cannot provide financial motivation due to limited resources. On the other hand, it must be considered that the individual focus of a work must also be taken into account. Thus, it is \textit{questionable} whether the \textit{scientific sector} is able to create such a large unit with regard to a \textit{common goal} as \textit{Netflix} did during the competition.
It should be clearly emphasized that it is immensely important to use sharp \textit{baselines} as guidelines. However, in a \textit{scientific context} the \textit{goal} is not as \textit{precisely defined} as it was in the \textit{Netflix-Prize}. Rather, a large part of the work is aimed at investigating whether new methods such as \textit{neural networks} etc. are applicable to the \textit{recommender problem}.
Regarding the results, however, it has to be said that they clearly support a \textit{rethinking} even if this should only concern a \textit{small part} of the work.
On the website \textit{Papers with Code}\footnote{\url{https://paperswithcode.com/sota/collaborative-filtering-on-movielens-10m}} the \textit{public leaderboard} regarding the results obtained on the \textit{MovieLens10M-dataset} can be viewed. The source analysis of \textit{Papers with Code} also identifies the results given by \citet{Rendle19} as leading.
In addition, \textit{future work} should focus on a more \textit{in-depth source analysis} which, besides the importance of the \textit{MovieLens10M-dataset} for the \textit{scientific community}, also examines whether and to what extent \textit{other datasets} are affected by this phenomenon.
Due to the recent publication in spring \textit{2019}, this paper has not yet been cited frequently. So time will tell what impact it will have on the \textit{community}. Nevertheless, \citet{Dacrema2019} has already observed similar problems for \textit{top-n-recommender} based on this paper. According to this, \citet{Rendle} seems to have recognized an elementary and unseen problem and made it public.
This is strongly reminiscent of the so-called \textit{Artificial-Intelligence-Winter (AI-Winter)} in which \textit{stagnation} in the \textit{development} of \textit{artificial intelligence} occurred due to too high expectations and other favourable factors. Overall the paper has the potential to \textit{counteract} the \textit{stagnation} in development and thus \textit{prevent} a \textit{winter for recommender systems}.
...@@ -8,7 +8,7 @@ Each of the \textit{users} in $\mathcal{U}$ gives \textit{ratings} from a set $\ ...@@ -8,7 +8,7 @@ Each of the \textit{users} in $\mathcal{U}$ gives \textit{ratings} from a set $\
In the following, the two main approaches of \textit{collaborative-filtering} and \textit{content-based} \textit{recommender systems} will be discussed. In addition, it is explained how \textit{matrix-factorization} can be integrated into the two ways of thinking. In the following, the two main approaches of \textit{collaborative-filtering} and \textit{content-based} \textit{recommender systems} will be discussed. In addition, it is explained how \textit{matrix-factorization} can be integrated into the two ways of thinking.
\subsection{Content-Based} \subsection{Content-Based}
\textit{Content-based} \textit{recommender systems (CB)} work directly with \textit{feature vectors}. Such a \textit{feature vector} can, for example, represent a \textit{user profile}. In this case, this \textit{profile} contains informations about the \textit{user's preferences}, such as \textit{genres}, \textit{authors}, \textit{etc}. This is done by trying to create a \textit{model} of the \textit{user}, which best represents his preferences. The different \textit{learning algorithms} from the field of \textit{machine learning} are used to learn or create the \textit{models}. The most prominent \textit{algorithms} are: \textit{tf-idf}, \textit{bayesian learning}, \textit{Rocchio's algorithm} and \textit{neuronal networks} \citep{Lops11, Ferrari19, DeKa11}. Altogether the built and learned \textit{feature vectors} are compared with each other. Based on their closeness, similar \textit{features} can be used to generate \textit{missing ratings}. Figure \ref{fig:cb} shows a sketch of the general operation of \textit{content-based recommenders}. \textit{Content-based} \textit{recommender systems (CB)} work directly with \textit{feature vectors}. Such a \textit{feature vector} can, for example, represent a \textit{user profile}. In this case, this \textit{profile} contains informations about the \textit{user's preferences}, such as \textit{genres}, \textit{authors}, \textit{etc}. This is done by trying to create a \textit{model} of the \textit{user}, which best represents his preferences. The different \textit{learning algorithms} from the field of \textit{machine learning} are used to learn or create the \textit{models}. The most prominent \textit{algorithms} are: \textit{tf-idf}, \textit{bayesian learning}, \textit{Rocchio's algorithm} and \textit{neuronal networks} \citep{Lops11, Dacrema19, DeKa11}. Altogether the built and learned \textit{feature vectors} are compared with each other. Based on their closeness, similar \textit{features} can be used to generate \textit{missing ratings}. Figure \ref{fig:cb} shows a sketch of the general operation of \textit{content-based recommenders}.
\subsection{Collaborative-Filtering} \subsection{Collaborative-Filtering}
Unlike the \textit{content-based recommender (CF)}, the \textit{collaborative-filtering recommender} not only considers individual \textit{users} and \textit{feature vectors}, but rather a \textit{like-minded neighborhood} of each \textit{user}. Unlike the \textit{content-based recommender (CF)}, the \textit{collaborative-filtering recommender} not only considers individual \textit{users} and \textit{feature vectors}, but rather a \textit{like-minded neighborhood} of each \textit{user}.
......
...@@ -51,15 +51,6 @@ editor = {P.B. Kantor and F. Ricci and L. Rokach and B. Shapira}, ...@@ -51,15 +51,6 @@ editor = {P.B. Kantor and F. Ricci and L. Rokach and B. Shapira},
publisher={Springer}, publisher={Springer},
doi = {10.1007/978-0-387-85820-3_4} doi = {10.1007/978-0-387-85820-3_4}
} }
@inproceedings{Ferrari19,
author = {Maurizio Ferrari Dacrema and Paolo Cremonesi and Dietmar Jannach},
year = {2019},
month = {07},
pages = {},
title = {Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches},
isbn = {978-1-4503-6243-6},
doi = {10.1145/3298689.3347058}
}
@article{Kor09, @article{Kor09,
author = {Yehuda Koren and author = {Yehuda Koren and
Robert Bell and Robert Bell and
...@@ -189,3 +180,106 @@ journal = {Proceedings of KDD Cup and Workshop} ...@@ -189,3 +180,106 @@ journal = {Proceedings of KDD Cup and Workshop}
year={2019}, year={2019},
volume={abs/1911.07698} volume={abs/1911.07698}
} }
@inproceedings{Armstrong09,
author = {Armstrong, Timothy and Moffat, Alistair and Webber, William and Zobel, Justin},
year = {2009},
month = {11},
pages = {601-610},
title = {Improvements that don’t add up: Ad-hoc retrieval results since},
doi = {10.1145/1645953.1646031}
}
@article{Lin19,
author = {Lin, Jimmy},
title = {The Neural Hype and Comparisons Against Weak Baselines},
year = {2019},
issue_date = {January 2019},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {52},
number = {2},
issn = {0163-5840},
url = {https://doi.org/10.1145/3308774.3308781},
doi = {10.1145/3308774.3308781},
journal = {SIGIR Forum},
month = jan,
pages = {40–51},
numpages = {12}
}
@article{Ludewig18,
author = {Ludewig, Jannach},
title = {Evaluation of Session-based Recommendation Algorithms},
journal = {CoRR},
volume = {abs/1803.09587},
year = {2018},
url = {http://arxiv.org/abs/1803.09587},
archivePrefix = {arXiv},
eprint = {1803.09587},
timestamp = {Mon, 13 Aug 2018 16:46:25 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1803-09587},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{Rendle09,
author = {Rendle, Steffen and Freudenthaler, Christoph and Gantner, Zeno and Schmidt-Thieme, Lars},
year = {2012},
month = {05},
pages = {},
title = {BPR: Bayesian Personalized Ranking from Implicit Feedback},
journal = {Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, UAI 2009}
}
@inproceedings{Rendle10,
author = {Rendle, Steffen and Freudenthaler, Christoph and Schmidt-Thieme, Lars},
title = {Factorizing Personalized Markov Chains for Next-Basket Recommendation},
year = {2010},
isbn = {9781605587998},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1772690.1772773},
doi = {10.1145/1772690.1772773},
booktitle = {Proceedings of the 19th International Conference on World Wide Web},
pages = {811–820},
numpages = {10},
keywords = {basket recommendation, markov chain, matrix factorization},
location = {Raleigh, North Carolina, USA},
series = {WWW ’10}
}
@article{Dacrema19,
author = {Dacrema, Maurizio Ferrari and
Cremonesi Paolo and Jannach Dietmar},
title = {Are We Really Making Much Progress? {A} Worrying Analysis of Recent
Neural Recommendation Approaches},
journal = {CoRR},
volume = {abs/1907.06902},
year = {2019},
url = {http://arxiv.org/abs/1907.06902},
archivePrefix = {arXiv},
eprint = {1907.06902},
timestamp = {Tue, 23 Jul 2019 10:54:22 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1907-06902},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{Sterling59,
ISSN = {01621459},
URL = {http://www.jstor.org/stable/2282137},
abstract = {There is some evidence that in fields where statistical tests of significance are commonly used, research which yields nonsignificant results is not published. Such research being unknown to other investigators may be repeated independently until eventually by chance a significant result occurs-an "error of the first kind"-and is published. Significant results published in these fields are seldom verified by independent replication. The possibility thus arises that the literature of such a field consists in substantial part of false conclusions resulting from errors of the first kind in statistical tests of significance.},
author = {Theodore D. Sterling},
journal = {Journal of the American Statistical Association},
number = {285},
pages = {30--34},
publisher = {[American Statistical Association, Taylor & Francis, Ltd.]},
title = {Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance--Or Vice Versa},
volume = {54},
year = {1959}
}
@inproceedings{Rosenthal79,
title={The file drawer problem and tolerance for null results.},
author={Robert S. Rosenthal},
year={1979}
}
@inproceedings{Sculley18,
title={Winner's Curse? On Pace, Progress, and Empirical Rigor},
author={D. Sculley and Jasper Snoek and Alexander B. Wiltschko and Ali Rahimi},
booktitle={ICLR},
year={2018}
}
No preview for this file type
...@@ -76,6 +76,7 @@ A Study on Recommender Systems} ...@@ -76,6 +76,7 @@ A Study on Recommender Systems}
\input{recommender} \input{recommender}
\input{baselines} \input{baselines}
\input{conclusion} \input{conclusion}
\input{critical_assessment}
\newpage \newpage
\bibliography{references} \bibliography{references}
\bibliographystyle{plainnat} \bibliographystyle{plainnat}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment