diff --git a/baselines.tex b/baselines.tex index 71e2dfd494c9ec64fcf7d6f9172229c9f40b1eff..20a22216d8486a54a36b9e1358359895bcbba815 100644 --- a/baselines.tex +++ b/baselines.tex @@ -3,6 +3,7 @@ This section reviews the \textit{main part} of the work represented by \citet{Re \subsection{Motivation and Background} As in many other fields of \textit{data-science}, a valid \textit{benchmark-dataset} is required for a proper execution of experiments. In the field of \textit{recommender systems}, the best known \textit{datasets} are the \textit{Netflix-} and \textit{MovieLens-dataset}. This section introduces both \textit{datasets} and shows the relationship of \citet{Koren}, one of the authors of this paper, to the \textit{Netflix-Prize}, in addition to the existing \textit{baselines}. \subsubsection{Netflix-Prize} +\label{sec:netflix} The topic of \textit{recommender systems} was first properly promoted and made known by the \textit{Netflix-Prize}. On \textit{October 2nd 2006}, the competition announced by \textit{Netflix} began with the \textit{goal} of beating the self-developed \textit{recommender system Cinematch} with an \textit{RMSE} of \textit{0.9514} by at least \textit{10\%}. In total, the \textit{Netflix-dataset} was divided into three parts that can be grouped into two categories: \textit{training} and \textit{qualification}. In addition to a \textit{probe-dataset} for \textit{training} the algorithms, two further datasets were retained to qualify the winners. The \textit{quiz-dataset} was then used to calculate the \textit{score} of the \textit{submitted solutions} on the \textit{public leaderboard}. In contrast, the \textit{test-dataset} was used to determine the \textit{actual winners}. Each of the pieces had around \textit{1.408.000 data} and \textit{similar statistical values}. By splitting the data in this way, it was possible to ensure that an improvement could not be achieved by \textit{simple hill-climbing-algorithms}. It took a total of \textit{three years} and \textit{several hundred models} until the team \textit{"BellKor`s Pragmatic Chaos"} was chosen as the \textit{winner} on \textit{21st September 2009}. They had managed to achieve an \textit{RMSE} of \textit{0.8554} and thus an \textit{improvement} of \textit{0.096}. Such a result is extraordinary excellent, because it took \textit{one year} of work and intensive research to reduce the \textit{RMSE} from \textit{0.8712 (progress award 2007)} to \textit{0.8616 (progress award 2008)}. @@ -56,5 +57,5 @@ But where do these \textit{weak baselines} come from? In response, the authors see two main points. The first is \textit{reproducibility}. This is generally understood to mean the \textit{repetition} of an \textit{experiment} with the aim of \textit{obtaining} the \textit{specified results}. In most cases, the \textit{code} of the authors of a paper is taken and checked. Not only during the \textit{Netflix-Prize} this was a common method to compare competing methods, improve one's own and generally achieve \textit{stronger baselines}. However, the authors do not consider the \textit{simple repetition} of the experiment for the purpose of achieving the same results to be appropriate. Thus, the \textit{repetition} of the experiment only provides information about the results achieved by a specific setup. However, it does not provide deeper insights into the method, nor into its general quality. This is not only a problem of \textit{recommender-systems} but rather a general problem in the field of \textit{machine learning}. Thus, \textit{indicators} such as \textit{statistical significance}, \textit{reproducibility} or \textit{hyperparameter search} are often regarded as \textit{proof} of the quality of an experiment. But they only give information about a certain experiment, which could be performed with \textit{non-standard protocols}. The question of whether the method being used is applied and configured in a meaningful way is neglected. Thus, \textit{statistical significance} is often taken as an \textit{indication} that \textit{method A} \textit{performs better} than \textit{method B}. \subsubsection{Inadequate validations} -The authors do not doubt the relevance of such methods. They even consider them necessary but not meaningful enough for the general goodness of an experiment. Thus, their preparation, which takes up the above mentioned methods, shows that they can achieve meaningful results. -Therefore the authors see the second point of criticism of the results obtained on the MovieLens10M data set as the wrong understanding of reliable experiments. The main reason given is the difference between scientific and industrial work. For example, during the Netflix Prize, which represents industrial work, audible sums were awarded for the best results. This had several consequences. Firstly, a larger community was addressed to work on the solution of the Recommender problem. On the other hand, the high number of competitors and the simplicity in the formulation of the task encouraged each participant to investigate the simplest methods in small steps. The small-step approach was also driven by the standardized guidelines for the evaluation of the methods given in section XY and by the public competition. Thus, a better understanding of the basic relationships could be achieved through the miniscule evaluation of hundreds of models. All in all, these insights led to well-understood and sharp baselines within a community that continuously worked towards a common goal over a total of three years. Such a motivation and such a target-oriented competitive idea is mostly not available in the scientific field. Thus, publications that achieve better results with old methods are considered unpublishable. Instead, experiments are not questioned and their results are simply transferred. In some cases experiments are repeated exactly as specified in the specifications. Achieving the same result is considered a valid baseline. According to the authors, such an approach is not meaningful and, by not questioning the one-off evaluations, leads to one-hit-wonders that distort the sharpness of the baselines. As a result, the MovieLens10M dataset shows that the main results of the last five years were measured against too weak baselines. +The authors do not doubt the relevance of such methods. They even consider them \textit{necessary} but \textit{not meaningful enough} for the \textit{general goodness} of an \textit{experiment}. Thus, their preparation, which takes up the above mentioned methods, shows that they can achieve meaningful results. +Therefore the authors see the second point of criticism of the results obtained on the \textit{MovieLens10M-dataset} as the \textit{wrong understanding} of \textit{reliable experiments}. The \textit{main reason} given is the \textit{difference} between \textit{scientific} and \textit{industrial work}. For example, during the\textit{ Netflix-Prize}, which represents \textit{industrial work}, \textit{audible sums} were \textit{awarded} for the best results. This had several consequences. Firstly, a \textit{larger community} was addressed to work on the solution of the \textit{recommender problem}. On the other hand, the high number of \textit{competitors} and the \textit{simplicity} in the formulation of the \textit{task} encouraged each participant to investigate the \textit{simplest methods} in \textit{small steps}. The \textit{small-step approach} was also driven by the \textit{standardized guidelines} for the \textit{evaluation} of the methods given in \textit{section} \ref{sec:netflix} and by the \textit{public competition}. Thus, a better understanding of the \textit{basic relationships} could be achieved through the \textit{miniscule evaluation} of hundreds of models. All in all, these insights led to \textit{well-understood} and \textit{sharp baselines} within a \textit{community} that \textit{continuously} worked towards a \textit{common goal} over a total of three years. Such a \textit{motivation} and such a \textit{target-oriented competitive idea} is mostly not available in the \textit{scientific field}. Thus, publications that achieve \textit{better results} with \textit{old methods} are considered \textit{unpublishable}. Instead, experiments are \textit{not questioned} and their \textit{results} are \textit{simply transferred}. In some cases experiments are \textit{repeated exactly as specified} in the specifications. Achieving the \textit{same result} is considered a \textit{valid baseline}. According to the authors, such an approach is \textit{not meaningful} and, by not questioning the \textit{one-off evaluations}, leads to \textit{one-hit-wonders} that \textit{distort} the \textit{sharpness} of the \textit{baselines}. As a result, the \textit{MovieLens10M-dataset} shows that the main results of the last \textit{five years} were \textit{measured} against too \textit{weak baselines}. diff --git a/conclusion.tex b/conclusion.tex index 47c04f12817c0dcfec4390acd518965c9b4ebb00..eb7f631ee93653c3ce623294570727d7c3c781bf 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1 +1,8 @@ -\section{Conclusion} \ No newline at end of file +\newpage +\section{Conclusion} +Overall, Rendle et. al. 2019 concludes that the last five years of research for the MovieLens10M dataset have not really produced any new findings. Although in the presented experiment the best practice of the community was applied, the simplest matrix factorization methods could clearly beat the reported results. Thus, the authors support the thesis that finding and evaluating valid and sharp baselines is not trivial. Empirical data are collected, since there is no formal evidence in the field of recommender systems to make the methods comparable. From the numerical evaluation the authors identify the rating of a work in a scientific context as a major problem. Here, a publication is classified as not worth publishing if it achieves better results with old methods. Rather, most papers aim to distinguish themselves from the others by using new methods that beat the old ones. In this way, baselines are not questioned and the community is steered in the wrong direction, as their work competes against insufficient baselines. This problem was not only solved during the Netflix award by the horrendous prize money. However, it turns out that the insights gained there were more profound and can be transferred to the MovieLens10M dataset. Thus new techniques but no new elementary knowledge could be achieved on the MovieLens10M data set. +With this paper Rendle et. al. addresses the highly experienced reader. The simple structure of the paper convinces by the clear and direct way in which the problem is identified. Additionally, the paper can be seen as an addendum to the Netflix price. As the authors Rendle and Koren were significantly involved in this competition, the points mentioned above are convincing by the experience they have gained. With their results they support the very simple but not trivial statement that finding good baselines requires an immense effort and this has to be promoted much more in a scientific context. This implies a change in the long-established thinking about the evaluation of scientific work. At this point it is questionable whether it is possible to change existing thinking. This should be considered especially because the scientific sector, unlike the industrial sector, cannot provide financial motivation due to limited resources. On the other hand, it must be considered that the individual focus of a work must also be taken into account. Thus, it is questionable whether the scientific sector is able to create such a large unit with regard to a common goal as Netflix did during the competition. +It should be clearly emphasized that it is immensely important to use sharp baselines as guidelines. However, in a scientific context the goal is not as precisely defined as it was in the Netflix Prize. Rather, a large part of the work is aimed at investigating whether new methods such as neural networks etc. are applicable to the recommender problem. +Regarding the results, however, it has to be said that they clearly support a rethinking even if this should only concern a small part of the work. On the website "Papers with Code" the public Leaderboard regarding the results obtained on the MovieLens10M dataset can be viewed. The source analysis of "Papers with Code" also identifies the results given by Rendle as leading. +Due to the recent publication in spring 2019, this paper has not yet been cited frequently. So time will tell what impact it will have on the community. Nevertheless, XY has already observed similar problems for Top-N-Recommenders based on this paper. According to this, Rendle seems to have recognized an elementary and unseen problem and made it public. Overall the paper has the potential to counteract the general hype whose only purpose is to develop the best and only true model and thus prevent a winter for recommender systems. + diff --git a/submission.pdf b/submission.pdf index 0c2a05059e6d2b6cf53c647bc893601b884a293a..ac4351a8af21bcc219a1ad4dacdbcc0de0772dc0 100644 Binary files a/submission.pdf and b/submission.pdf differ