diff --git a/baselines.tex b/baselines.tex index 852b2ee1001a59dce257943b99415081b496812d..bbfdff1b79a5d9d3179ed9c3561c9b92d31b6733 100644 --- a/baselines.tex +++ b/baselines.tex @@ -1,7 +1,7 @@ \section{On the Diffculty of Evaluating Baselines} This section reviews the main part of the work represented by \citet{Rendle19}. In addition to a detailed description and explanation of the experiments carried out and the observations gained from them, a short introduction is given regarding the driving motivation. \subsection{Motivation and Background} -As in many other fields of \textit{data-science}, a valid \textit{benchmark-dataset} is required for a proper execution of experiments. In the field of \textit{recommender systems}, the best known datasets are the \textit{Netflix-} and \textit{MovieLens-dataset}. This section introduces both datasets and shows the relationship of \citet{Koren}, one of the authors of this paper, to the \textit{Netflix-Prize}, in addition to the existing \textit{baselines}. +As in many other fields of \textit{data-science}, a valid \textit{benchmark-dataset} is required for a proper execution of experiments. In the field of \textit{recommender systems}, the best known datasets are the \textit{Netflix-} and \textit{MovieLens-datasets}. This section introduces both datasets and shows the relationship of \citet{Koren}, one of the authors of this paper, to the \textit{Netflix-Prize}, in addition to the existing \textit{baselines}. \subsubsection{Netflix-Prize} \label{sec:netflix} The topic of \textit{recommender systems} was first properly promoted and made known by the \textit{Netflix-Prize}. On \textit{October 2nd 2006}, the competition announced by \textit{Netflix} began with the \textit{goal} of beating the self-developed \textit{recommender system Cinematch} with an \textit{RMSE} of \textit{0.9514} by at least \textit{10\%}. @@ -10,7 +10,7 @@ It took a total of \textit{three years} and \textit{several hundred models} unti The \textit{co-author} of the present paper, \citet{Koren}, was significantly involved in the work of this team. Since the beginning of the event, \textit{matrix-factorization methods} have been regarded as promising approaches. Even with the simplest \textit{SVD} methods, \textit{RMSE} values of \textit{0.94} could be achieved by \citet{Kurucz07}. The \textit{breakthrough} came through \citet{Funk06} who achieved an \textit{RMSE} of \textit{0.93} with his \textit{FunkSVD}. Based on this, more and more work has been invested in the research of simple \textit{matrix-factorization methods}. -Thus, \citet{Zh08} presented an \textit{ALS variant} with an \textit{RMSE} of \textit{0.8985} and \citet{Koren09} presented an \textit{SGD variant} with \textit{RMSE 0.8995}. +Thus, \citet{Zh08} presented an \textit{ALS variant} with an \textit{RMSE} of \textit{0.8985} and \citet{Koren09} presented an \textit{SGD variant} with \textit{RMSE 0.8998}. \textit{Implicit data} were also used. For example, \citet{Koren09} could also achieve an \textit{RMSE} of \textit{0.8762} by extending \textit{SVD++} with a \textit{time variable}. This was then called \textit{timeSVD++}. The \textit{Netflix-Prize} made it clear that even the \textit{simplest methods} are \textit{not trivial} and that a \textit{reasonable investigation} and \textit{evaluation requires} an \textit{immense effort} from within the \textit{community}. @@ -28,20 +28,20 @@ Before actually conducting the experiment, the authors took a closer look at the From the three aspects it can be seen that the models are fundamentally similar and that the main differences arise from different setups and learning procedures. Thus, the authors examined the two learning methods \textit{stochastic gradient descent} and \textit{bayesian learning} in combination with \textit{biased matrix-factorization} before conducting the actual experiment. For $b_u = b_i = 0$ this is equivalent to \textit{regulated matrix-factorization (RSVD)}. In addition, for $\alpha = \beta = 1$ the \textit{weighted regulated matrix-factorization (WR)} is equivalent to \textit{RSVD}. Thus, the only differences are explained by the different adjustments of the methods. -To prepare the two learning procedures they were initialized with a \textit{gaussian normal distribution} $\mathcal{N}(\mu, 0.1^2)$. The value for the \textit{standard deviation} of \textit{0.1} is the value suggested by the \textit{factorization machine libFM} as the default. In addition, \citet{Rendle13} achieved good results on the \textit{Netflix-Prize-dataset} with this value. Nothing is said about the parameter $\mu$. However, it can be assumed that this parameter is around the \textit{global average} of the \textit{ratings}. This can be assumed because \textit{ratings} are to be \textit{generated} with the \textit{initialization}. +To prepare the two learning procedures they were initialized with a \textit{gaussian-distribution} $\mathcal{N}(\mu, 0.1^2)$. The value for the \textit{standard deviation} of \textit{0.1} is the value suggested by the \textit{factorization machine libFM} as the default. In addition, \citet{Rendle13} achieved good results on the \textit{Netflix-Prize-dataset} with this value. Nothing is said about the parameter $\mu$. However, it can be assumed that this parameter is around the \textit{global average} of the \textit{ratings}. This can be assumed because \textit{ratings} are to be \textit{generated} with the \textit{initialization}. -For both approaches the number of \textit{sampling steps} was then set to \textit{128}. Since \textit{SGD} has two additional \textit{hyperparameters} $\lambda, \gamma$ these were also determined. Overall, the \textit{MovieLens10M-dataset} was evaluated by a \textit{10-fold cross-validation} over a \textit{random global} and \textit{non-overlapping 90:10 split}. In each step, \textit{90\%} of the data was used for \textit{training} and \textit{10\%} of the data was used for \textit{evaluation} without overlapping. In each split, \textit{95\%} of the \textit{training data} was used for \textit{training} and the remaining \textit{5\%} for \textit{evaluation} to determine the \textit{hyperparameters}. The \textit{hyperparameter search} was performed as mentioned in \textit{section} \ref{sec:sgd} using the \textit{grid} $(\lambda \in \{0.02, 0.03, 0.04, 0.05\}, \gamma \in \{0.001, 0.003\})$. This grid was inspired by findings during the \textit{Netflix-Prize} \citep{Kor08, Paterek07}. In total the parameters $\lambda=0.04$ and $\gamma=0.003$ could be determined. Afterwards both \textit{learning methods} and their settings were compared. The \textit{RMSE} was plotted against the used \textit{dimension} $f$ of $p_u, q_i \in \mathbb{R}^f$. \textit{Figure} \ref{fig:battle} shows the corresponding results. +For both approaches the number of \textit{sampling steps} was then set to \textit{128}. Since \textit{SGD} has two additional \textit{hyperparameters} $\lambda, \gamma$ these were also determined. Overall, the \textit{MovieLens10M-dataset} was evaluated by a \textit{10-fold cross-validation} over a \textit{random global} and \textit{non-overlapping 90:10 split}. In each step, \textit{90\%} of the data was used for \textit{training} and \textit{10\%} of the data was used for \textit{evaluation} without overlapping. In each split, \textit{95\%} of the \textit{training data} was used for \textit{training} and the remaining \textit{5\%} for \textit{evaluation} to determine the \textit{hyperparameters}. The \textit{hyperparameter search} was performed as mentioned in \textit{section} \ref{sec:sgd} using the \textit{grid} $(\lambda \in \{0.02, 0.03, 0.04, 0.05\}, \gamma \in \{0.001, 0.003\})$ and a $64$\textit{-dimensional embedding}. This grid was inspired by findings during the \textit{Netflix-Prize} \citep{Kor08, Paterek07}. In total the parameters $\lambda=0.04$ and $\gamma=0.003$ could be determined. Afterwards both \textit{learning methods} and their settings were compared. The \textit{RMSE} was plotted against the used \textit{dimension} $f$ of $p_u, q_i \in \mathbb{R}^f$. \textit{Figure} \ref{fig:battle} shows the corresponding results. \input{battle} \newpage As a \textit{first intermediate result} of the preparation it can be stated that both \textit{SGD} and \textit{gibbs-samper} achieve better \textit{RMSE} values for increasing \textit{dimensional embedding}. -In addition, it can be stated that learning using the \textit{bayesian approach} is better than learning using \textit{SGD}. Even if the results could be different due to more efficient setups, it is still surprising that \textit{SGD} is worse than the \textit{bayesian approach}, although the \textit{exact opposite} was reported for \textit{MovieLens10M-dataset}. For example, \textit{figure} \ref{fig:reported_results} shows that the \textit{bayesian approach BPMF} achieved an \textit{RMSE} of \textit{0.8187} while the \textit{SGD approach Biased MF} performed better with \textit{0.803}. The fact that the \textit{bayesian approach} outperforms \textit{SGD} has already been reported and validated by \citet{Rendle13}, \citet{Rus08} for the \textit{Netflix-Prize-dataset}. Looking more closely at \textit{figures} \ref{fig:reported_results} and \ref{fig:battle}, the \textit{bayesian approach} scores better than the reported \textit{BPMF} and \textit{Biased MF} for each \textit{dimensional embedding}. Moreover, it even beats all reported \textit{baselines} and new methods. Building on this, the authors have gone into the detailed examination of the methods and \textit{baselines}. +In addition, it can be stated that learning using the \textit{bayesian approach} is better than learning using \textit{SGD}. Even if the results could be different due to more efficient setups, it is still surprising that \textit{SGD} is worse than the \textit{bayesian approach}, although the \textit{exact opposite} was reported for \textit{MovieLens10M-dataset}. For example, \textit{figure} \ref{fig:reported_results} shows that the \textit{bayesian approach BPMF} achieved an \textit{RMSE} of \textit{0.8197} while the \textit{SGD approach Biased MF} performed better with \textit{0.803}. The fact that the \textit{bayesian approach} outperforms \textit{SGD} has already been reported and validated by \citet{Rendle13}, \citet{Rus08} for the \textit{Netflix-Prize-dataset}. Looking more closely at \textit{figures} \ref{fig:reported_results} and \ref{fig:battle}, the \textit{bayesian approach} scores better than the reported \textit{BPMF} and \textit{Biased MF} for each \textit{dimensional embedding}. Moreover, it even beats all reported \textit{baselines} and new methods. Building on this, the authors have gone into the detailed examination of the methods and \textit{baselines}. \subsubsection{Experiment Implementation} -For the actual execution of the experiment, the authors used the knowledge they had gained from the preparations. They noticed already for the two \textit{simple matrix-factorization models SGD-MF} and \textit{Bayesian MF}, which were trained with an \textit{embedding} of \textit{512 dimensions} and over \textit{128 epochs}, that they performed extremely well. Thus \textit{SGD-MF} achieved an \textit{RMSE} of \textit{0.7720}. This result alone was better than: \textit{RSVD (0.8256)}, \textit{Biased MF (0.803)}, \textit{LLORMA (0.7815)}, \textit{Autorec (0.782)}, \textit{WEMAREC (0.7769)} and \textit{I-CFN++ (0.7754)}. In addition, \textit{Bayesian MF} with an \textit{RMSE} of \textit{0.7633} not only beat the \textit{reported baseline BPMF (0.8197)}. It also beat the \textit{best algorithm MRMA (0.7634)}. +For the actual execution of the experiment, the authors used the knowledge they had gained from the preparations. They noticed already for the two \textit{simple matrix-factorization models SGD-MF} and \textit{Bayesian MF}, which were trained with an \textit{embedding} of \textit{512 dimensions} and over \textit{128 epochs}, that they performed extremely well. Thus \textit{SGD-MF} achieved an \textit{RMSE} of \textit{0.7720}. This result alone was better than: \textit{RSVD (0.8256)}, \textit{Biased MF (0.803)}, \textit{LLORMA (0.7815)}, \textit{I-Autorec (0.782)}, \textit{WEMAREC (0.7769)} and \textit{I-CFN++ (0.7754)}. In addition, \textit{Bayesian MF} with an \textit{RMSE} of \textit{0.7633} not only beat the \textit{reported baseline BPMF (0.8197)}. It also beat the \textit{best algorithm MRMA (0.7634)}. As the \textit{Netflix-Prize} showed, the use of \textit{implicit data} such as \textit{time} or \textit{dependencies} between \textit{users} or \textit{items} could immensely improve existing models. In addition to the two \textit{simple matrix factorizations}, \textit{table} \ref{table:models} shows the extensions of the authors regarding the \textit{bayesian approach}. \input{model_table} -As it turned out that the \textit{bayesian approach} gave more promising results, the given models were trained with it. For this purpose, the \textit{dimensional embedding} as well as the \textit{number of sampling steps} for the models were examined again. Again the \textit{gaussian normal distribution} was used for \textit{initialization} as indicated in \textit{section} \ref{sec:experiment_preparation}. \textit{Figure} \ref{fig:bayes_evaluation} shows the corresponding results. +As it turned out that the \textit{bayesian approach} gave more promising results, the given models were trained with it. For this purpose, the \textit{dimensional embedding} as well as the \textit{number of sampling steps} for the models were examined again. Again the \textit{gaussian-distribution} was used for \textit{initialization} as indicated in \textit{section} \ref{sec:experiment_preparation}. \textit{Figure} \ref{fig:bayes_evaluation} shows the corresponding results. \input{bayes_evaluation} \subsection{Obeservations} @@ -54,8 +54,8 @@ From \textit{figure} \ref{fig:corrected_results} the \textit{improved baselines} \input{corrected_results} \subsubsection{Reproducability} But where do these \textit{weak baselines} come from? -In response, the authors see two main points. The first is \textit{reproducibility}. This is generally understood to mean the \textit{repetition} of an \textit{experiment} with the aim of \textit{obtaining} the \textit{specified results}. In most cases, the \textit{code} of the authors of a paper is taken and checked. Not only during the \textit{Netflix-Prize}, this was a common method to compare competing methods, improve one's own and generally achieve \textit{stronger baselines}. However, the authors do not consider the \textit{simple repetition} of the experiment for the purpose of achieving the same results to be appropriate. Thus, the \textit{repetition} of the experiment only provides information about the results achieved by a specific setup. However, it does not provide deeper insights into the method, nor into its general quality. This is not only a problem of \textit{recommender-systems} but rather a general problem in the field of \textit{machine learning}. Thus, \textit{indicators} such as \textit{statistical significance}, \textit{reproducibility} or \textit{hyperparameter search} are often regarded as \textit{proof} of the quality of an experiment. But they only give information about a certain experiment, which could be performed with \textit{non-standard protocols}. The question of whether the method being used is applied and configured in a meaningful way is neglected. Thus, \textit{statistical significance} is often taken as an \textit{indication} that \textit{method A} \textit{performs better} than \textit{method B}. +In response, the authors see two main points. The first is \textit{reproducibility}. This is generally understood to mean the \textit{repetition} of an \textit{experiment} with the aim of \textit{obtaining} the \textit{specified results}. In most cases, the \textit{code} of the authors of a paper is taken and checked. Not only during the \textit{Netflix-Prize}, this was a common method to compare competing methods, improve one's own and generally achieve \textit{stronger baselines}. However, the authors do not consider the \textit{simple repetition} of the experiment for the purpose of achieving the same results to be appropriate. Thus, the \textit{repetition} of the experiment only provides information about the results achieved by a specific setup. However, it does not provide deeper insights into the method, nor into its general quality. This is not only a problem of \textit{recommender systems} but rather a general problem in the field of \textit{machine learning}. Thus, \textit{indicators} such as \textit{statistical significance}, \textit{reproducibility} or \textit{hyperparameter search} are often regarded as \textit{proof} of the quality of an experiment. But they only give information about a certain experiment, which could be performed with \textit{non-standard protocols}. The question of whether the method being used is applied and configured in a meaningful way is neglected. Thus, \textit{statistical significance} is often taken as an \textit{indication} that \textit{method A} \textit{performs better} than \textit{method B}. \subsubsection{Inadequate validations} The authors do not doubt the relevance of such methods. They even consider them \textit{necessary} but \textit{not meaningful enough} for the \textit{general goodness} of an \textit{experiment}. Thus, their preparation, which takes up the above mentioned methods shows, that they can achieve meaningful results. -Therefore the authors see the second point of criticism of the results obtained on the \textit{MovieLens10M-dataset} as the \textit{wrong understanding} of \textit{reliable experiments}. The \textit{main reason} given is the \textit{difference} between \textit{scientific} and \textit{industrial work}. For example, during the\textit{ Netflix-Prize}, which represents \textit{industrial work}, \textit{audible sums} were \textit{awarded} for the best results. This had several consequences. Firstly, a \textit{larger community} was addressed to work on the solution of the \textit{recommender problem}. On the other hand, the high number of \textit{competitors} and the \textit{simplicity} in the formulation of the task encouraged each participant to investigate the \textit{simplest methods} in \textit{small steps}. The \textit{small-step approach} was also driven by the \textit{standardized guidelines} for the \textit{evaluation} of the methods given in \textit{section} \ref{sec:netflix} and by the \textit{public competition}. Thus, a better understanding of the \textit{basic relationships} could be achieved through the \textit{miniscule evaluation} of hundreds of models. All in all, these insights led to \textit{well-understood} and \textit{sharp baselines} within a \textit{community} that \textit{continuously} worked towards a \textit{common goal} over a total of \textit{three years}. Such a \textit{motivation} and such a \textit{target-oriented competitive idea} is mostly not available in the \textit{scientific field}. Thus, publications that achieve \textit{better results} with \textit{old methods} are considered \textit{unpublishable}. Instead, experiments are \textit{not questioned} and their \textit{results} are \textit{simply transferred}. In some cases experiments are \textit{repeated exactly as specified} in the instructions. Achieving the \textit{same result} is considered a \textit{valid baseline}. According to the authors, such an approach is \textit{not meaningful} and, by not questioning the \textit{one-off evaluations}, leads to \textit{one-hit-wonders} that \textit{distort} the \textit{sharpness} of the \textit{baselines}. Therefore, the \textit{MovieLens10M-dataset} shows that the main results of the last \textit{five years} were \textit{measured} against too \textit{weak baselines}. +Therefore the authors see the second point of criticism of the results obtained on the \textit{MovieLens10M-dataset} as the \textit{wrong understanding} of \textit{reliable experiments}. The \textit{main reason} given is the \textit{difference} between \textit{scientific} and \textit{industrial work}. For example, during the\textit{ Netflix-Prize}, which represents \textit{industrial work}, \textit{audible sums} were \textit{awarded} for the best results. This had several consequences. Firstly, a \textit{larger community} was addressed to work on the solution of the \textit{recommender problem}. On the other hand, the high number of \textit{competitors} and the \textit{simplicity} in the formulation of the task encouraged each participant to investigate the \textit{simplest methods} in \textit{small steps}. The \textit{small-step approach} was also driven by the \textit{standardized guidelines} for the \textit{evaluation} of the methods given in \textit{section} \ref{sec:mf} and by the \textit{public competition}. Thus, a better understanding of the \textit{basic relationships} could be achieved through the \textit{miniscule evaluation} of hundreds of models. All in all, these insights led to \textit{well-understood} and \textit{sharp baselines} within a \textit{community} that \textit{continuously} worked towards a \textit{common goal} over a total of \textit{three years}. Such a \textit{motivation} and such a \textit{target-oriented competitive idea} is mostly not available in the \textit{scientific field}. Thus, publications that achieve \textit{better results} with \textit{old methods} are considered \textit{unpublishable}. Instead, experiments are \textit{not questioned} and their \textit{results} are \textit{simply transferred}. In some cases experiments are \textit{repeated exactly as specified} in the instructions. Achieving the \textit{same result} is considered a \textit{valid baseline}. According to the authors, such an approach is \textit{not meaningful} and, by not questioning the \textit{one-off evaluations}, leads to \textit{one-hit-wonders} that \textit{distort} the \textit{sharpness} of the \textit{baselines}. Therefore, the \textit{MovieLens10M-dataset} shows that the main results of the last \textit{five years} were \textit{measured} against too \textit{weak baselines}. diff --git a/critical_assessment.tex b/critical_assessment.tex index cbf577b80d507f7e387a7873556e72395174ce12..8341d9d80ae8851c968387963db9640deb3ab0a9 100644 --- a/critical_assessment.tex +++ b/critical_assessment.tex @@ -2,13 +2,13 @@ \section{Critical Assessment} With this paper \citet{Rendle19} addresses the highly experienced reader. The simple structure of the paper convinces by the clear and direct way in which the problem is identified. Additionally, the paper can be seen as an \textit{addendum} to the \textit{Netflix-Prize}. -The problem addressed by \citet{Rendle19} is already known from other topics like \textit{information-retrieval} and \textit{machine learning}. For example, \citet{Armstrong09} described the phenomenon observed by \citet{Rendle19} in the context of \textit{information-retrieval systems} that too \textit{weak baselines} are used. He also sees that \textit{experiments} are \textit{misinterpreted} by giving \textit{misunderstood indicators} such as \textit{statistical significance}. In addition, \citet{Armstrong09} also sees that the \textit{information-retrieval community} lacks an adequate overview of results. In this context, he proposes a collection of works that is reminiscent of the \textit{Netflix-Leaderboard}. \citet{Lin19} also observed the problem of \textit{baselines} for \textit{neural-networks} that are \textit{too weak}. Likewise, the actual observation that \textit{too weak baselines} exist due to empirical evaluation is not unknown in the field of \textit{recommender systems}. \citet{Ludewig18} already observed the same problem for \textit{session-based recommender systems}. Such systems only work with data generated during a \textit{session} and try to predict the next \textit{user} selection. They also managed to achieve better results using \textit{session-based matrix-factorization}, which was inspired by the work of \citet{Rendle09} and \citet{Rendle10}. The authors see the problem in the fact that there are \textit{too many datasets} and \textit{different measures} of evaluation for \textit{scientific work}. In addition, \citet{Dacrema19} take up the problem addressed by \citet{Lin19} and shows that \textit{neural approaches} to solving the \textit{recommender-problem} can also be beaten by simplest methods. They see the main problem in the \textit{reproducibility} of publications and suggest a \textit{rethinking} in the \textit{verification} of results in this field of work. Furthermore, they do not refrain from taking a closer look at \textit{matrix-factorization} in this context. +The problem addressed by \citet{Rendle19} is already known from other topics like \textit{information-retrieval} and \textit{machine learning}. For example, \citet{Armstrong09} described the phenomenon in the context of \textit{information-retrieval systems}, that too \textit{weak baselines} are used. He also sees that \textit{experiments} are \textit{misinterpreted} by giving \textit{misunderstood indicators} such as \textit{statistical significance}. In addition, \citet{Armstrong09} also sees that the \textit{information-retrieval community} lacks an adequate overview of results. In this context, he proposes a collection of works that is reminiscent of the \textit{Netflix-Leaderboard}. \citet{Lin19} also observed the problem of \textit{baselines} for \textit{neural-networks} that are \textit{too weak}. Likewise, the actual observation that \textit{too weak baselines} exist due to empirical evaluation is not unknown in the field of \textit{recommender systems}. \citet{Ludewig18} already observed the same problem for \textit{session-based recommender systems}. Such systems only work with data generated during a \textit{session} and try to predict the next \textit{user} selection. They also managed to achieve better results using \textit{session-based matrix-factorization}, which was inspired by the work of \citet{Rendle09} and \citet{Rendle10}. The authors see the problem in the fact that there are \textit{too many datasets} and \textit{different measures} of evaluation for \textit{scientific work}. In addition, \citet{Dacrema19} take up the problem addressed by \citet{Lin19} and shows that \textit{neural approaches} to solving the \textit{recommender-problem} can also be beaten by simplest methods. They see the main problem in the \textit{reproducibility} of publications and suggest a \textit{rethinking} in the \textit{verification} of results in this field of work. Furthermore, they do not refrain from taking a closer look at \textit{matrix-factorization} in this context. Compared to the listed work, it is not unknown that in some subject areas \textit{baselines} are \textit{too weak} and lead to \textit{stagnant development}. Especially when considering that \textit{information-retrieval} and \textit{machine learning} are the \textit{cornerstones} of \textit{recommender systems} it is not surprising to observe similar phenomena. Nevertheless, the work published by \citet{Rendle19} stands out from the others. Using the insights gained during the \textit{Netflix-Prize}, he underlines the problem of the \textit{lack of standards} and \textit{unity} for \textit{scientific experiments} in the work mentioned above. However, the work published by \citet{Rendle19} also clearly stands out from the above-mentioned work. In contrast to them, not only the problem for the \textit{MovieLens10M-dataset} in combination with \textit{matrix-factorization} is recognized. Rather, the problem is brought one level higher. Thus, it succeeds in gaining a global and reflected but still distanced view of the \textit{best practice} in the field of \textit{recommender systems}. Besides calling for \textit{uniform standards}, \citet{Rendle19} criticizes the way the \textit{scientific community} thinks. \citet{Rendle19} recognizes the \textit{publication-bias} addressed by \citet{Sterling59}. The so-called \textit{publication-bias} describes the problem that there is a \textit{statistical distortion} of the data situation within a \textit{scientific topic area}, since only successful or modern papers are published. \citet{Rendle19} clearly abstracts this problem from the presented experiment. The authors see the problem in the fact that a scientific paper is subject to a \textit{pressure to perform} which is based on the \textit{novelty} of such a paper. This thought can be transferred to the \textit{file-drawer-problem} described by \citet{Rosenthal79}. This describes the problem that many \textit{scientists} do not publish their work and, out of concern about not meeting the \textit{publication standards} such as \textit{novelty} or the question of the \textit{impact on the community}, do not submit their results at all and prefer to \textit{keep them in a drawer}. Although the problems mentioned above are not directly addressed, they can be abstracted due to the detailed presentation. In contrast to the other works, this way a wanted or unwanted abstraction and naming of concrete and comprehensible problems is achieved. -Nevertheless, criticism must also be made of the work published by \citet{Rendle19}. Despite the high standard of the work, it must be said that the problems mentioned above can be identified but are not directly addressed by the authors. The work of \citet{Rendle19} even lacks an embedding in the context above. Thus, the experienced reader who is familiar with the problems addressed by \citet{Armstrong09}, \citet{Sterling59} and \citet{Rosenthal79} becomes aware of the contextual and historical embedding and value of the work. In contrast, \citet{Lin19}, published in the same period, succeeds in this embedding in the contextual problem and in the previous work. Moreover, it is questionable whether the problem addressed can actually lead to a change in \textit{long-established thinking}. Especially if one takes into account that many scientists are also investigating the \textit{transferability} of new methods to the \textit{recommender problem}. Thus, the call for research into \textit{better baselines} must be viewed from two perspectives. On the one hand, it must be noted that \textit{too weak baselines} can lead to a false understanding of new methods. On the other hand, it must also be noted that this could merely trigger the numerical evaluation in a competitive process to find the best method, as was the case with the \textit{Netflix-Prize}. However, in the spirit of \citet{Sculley18}, it should always be remembered that: \textit{"the goal of science is not wins, but knowledge"}. +Nevertheless, criticism must also be made of the work published by \citet{Rendle19}. Despite the high standard of the work, it must be said that the problems mentioned above can be identified but are not directly addressed by the authors. The work of \citet{Rendle19} even lacks an embedding in the context above. Thus, the experienced reader who is familiar with the problems addressed by \citet{Armstrong09}, \citet{Sterling59} and \citet{Rosenthal79} becomes aware of the contextual and historical embedding and value of the work. In contrast, \citet{Lin19} and \citet{Dacrema19}, published in the same period, succeed in this embedding in the contextual problem and in the previous work. Moreover, it is questionable whether the problem addressed can actually lead to a change in \textit{long-established thinking}. Especially if one takes into account that many scientists are also investigating the \textit{transferability} of new methods to the \textit{recommender problem}. Thus, the call for research into \textit{better baselines} must be viewed from two perspectives. On the one hand, it must be noted that \textit{too weak baselines} can lead to a false understanding of new methods. On the other hand, it must also be noted that this could merely trigger the numerical evaluation in a competitive process to find the best method, as was the case with the \textit{Netflix-Prize}. However, in the spirit of \citet{Sculley18}, it should always be remembered that: \textit{"the goal of science is not wins, but knowledge"}. As the authors \citet{Rendle} and \citet{Koren} were significantly \textit{involved} in this competition, the points mentioned above are convincing by the experience they have gained. With their results they support the very simple but not trivial statement that finding good \textit{baselines} requires an \textit{immense effort} and this has to be \textit{promoted} much more in a \textit{scientific context}. This implies a change in the \textit{long-established thinking} about the evaluation of scientific work. At this point it is questionable whether it is possible to change existing thinking. This should be considered especially because the scientific sector, unlike the industrial sector, cannot provide financial motivation due to limited resources. On the other hand, it must be considered that the individual focus of a work must also be taken into account. Thus, it is \textit{questionable} whether the \textit{scientific sector} is able to create such a large unit with regard to a \textit{common goal} as \textit{Netflix} did during the competition. It should be clearly emphasized that it is immensely important to use sharp \textit{baselines} as guidelines. However, in a \textit{scientific context} the \textit{goal} is not as \textit{precisely defined} as it was in the \textit{Netflix-Prize}. Rather, a large part of the work is aimed at investigating whether new methods such as \textit{neural networks} etc. are applicable to the \textit{recommender problem}. @@ -16,7 +16,9 @@ Regarding the results, however, it has to be said that they clearly support a \t On the website \textit{Papers with Code}\footnote{\url{https://paperswithcode.com/sota/collaborative-filtering-on-movielens-10m}} the \textit{public leaderboard} regarding the results obtained on the \textit{MovieLens10M-dataset} can be viewed. The source analysis of \textit{Papers with Code} also identifies the results given by \citet{Rendle19} as leading. In addition, \textit{future work} should focus on a more \textit{in-depth source analysis} which, besides the importance of the \textit{MovieLens10M-dataset} for the \textit{scientific community}, also examines whether and to what extent \textit{other datasets} are affected by this phenomenon. -Due to the recent publication in spring \textit{2019}, this paper has not yet been cited frequently. So time will tell what impact it will have on the \textit{community}. Nevertheless, \citet{Dacrema2019} has already observed similar problems for \textit{top-n-recommender} based on this paper. According to this, \citet{Rendle} seems to have recognized an elementary and unseen problem and made it public. +Due to the recent publication in spring \textit{2019}, this paper has not yet been cited frequently. So time will tell what impact it will have on the \textit{community}. +Nevertheless, \citet{Dacrema2019} was able to base his own work on this article and expand it. +According to this, \citet{Rendle} seems to have recognized an elementary and unseen problem and made it public. This is strongly reminiscent of the so-called \textit{Artificial-Intelligence-Winter (AI-Winter)} in which \textit{stagnation} in the \textit{development} of \textit{artificial intelligence} occurred due to too high expectations and other favourable factors. Overall the paper has the potential to \textit{counteract} the \textit{stagnation} in development and thus \textit{prevent} a \textit{winter for recommender systems}. diff --git a/recommender.tex b/recommender.tex index c843fda42d539bba98b5e8cbc8e00b6402b29f9f..4dd0e7575c1e3d71305483f1256ab0b75bc7e089 100644 --- a/recommender.tex +++ b/recommender.tex @@ -26,14 +26,14 @@ The core idea of \textit{matrix-factorization} is to supplement the not complete In the following, the four most classical \textit{matrix-factorization} approaches are described in detail. Afterwards, the concrete learning methods with which the vectors are learned are presented. In addition, the \textit{training data} for which a \textit{concrete rating} is available should be referred to as $\mathcal{B} = \lbrace(u,i) | r_{ui} \in \mathcal{R}\rbrace$. \subsubsection{Basic Matrix-Factorization} -The first and easiest way to solve \textit{matrix-factorization} is to connect the \textit{feature vectors} of the \textit{users} and the \textit{items} using the \textit{inner product}. The result is the \textit{user-item interaction}. In addition, the \textit{error} should be as small as possible. Therefore, $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}}$ is defined as an associated \textit{minimization problem}. +The first and easiest way to solve \textit{matrix-factorization} is to connect the \textit{feature vectors} of the \textit{users} and the \textit{items} using the \textit{inner product}. The result is the \textit{user-item interaction}. In addition, the \textit{error} should be as small as possible. Therefore, $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}}$ is defined as an associated \textit{minimization problem} \citep{Kor09}. \subsubsection{Regulated Matrix-Factorization}\label{subsec:rmf} This problem extends the \textit{basic matrix-factorization} by a \textit{regulation factor} $\lambda$ in the corresponding \textit{minimization problem}. Since $\mathcal{R}$ is thinly occupied, the effect of \textit{overfitting} may occur due to learning from the few known values. The problem with \textit{overfitting} is that the generated \textit{ratings} are too tight. To counteract this, the magnitudes of the previous vectors is taken into account. High magnitudes are punished by a factor $\lambda(\lVert q_i \rVert^2 + \lVert p_u \lVert^2)$ in the \textit{minimization problem}. Overall, the \textit{minimization problem} $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\lVert q_i \lVert^2 + \lVert p_u \lVert^2)$ is to be solved. -The idea is that especially large entries in $q_i$ or $p_u$ cause $\lVert q_i \rVert, \lVert p_u \rVert$ to become larger. Accordingly, $\lVert q_i \rVert$ and $\lVert p_u \rVert$ increases the larger its entries become. This value is then additionally punished by squaring it. Small values are rewarded and large values are penalized. Additionally the influence of this value can be regulated by $\lambda$. +The idea is that especially large entries in $q_i$ or $p_u$ cause $\lVert q_i \rVert, \lVert p_u \rVert$ to become larger. Accordingly, $\lVert q_i \rVert$ and $\lVert p_u \rVert$ increases the larger its entries become. This value is then additionally punished by squaring it. Small values are rewarded and large values are penalized. Additionally the influence of this value can be regulated by $\lambda$ \citep{Kor09}. \subsubsection{Weighted Regulated Matrix-Factorization} -A \textit{regulation factor} $\lambda$ is introduced in analogy to \textit{regulated matrix-factorization}. Additional \textit{weights} $\alpha$ and $\beta$ are introduced to take into account the individual magnitude of a vector. The \textit{minimization problem} then corresponds to $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\alpha\lVert q_i \rVert^2 + \beta\lVert p_u \lVert^2)$. +The \textit{weighted regulated matrix-factorization} builds on the \textit{regulated matrix-factorization}. Additional \textit{weights} $\alpha$ and $\beta$ are introduced to take into account the individual magnitude of a vector. The \textit{minimization problem} then corresponds to $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\alpha\lVert q_i \rVert^2 + \beta\lVert p_u \lVert^2)$ \citep{Zh08}. \subsubsection{Biased Matrix-Factorization}\label{subsec:bmf} A major advantage of \textit{matrix-factorization} is the ability to model simple relationships according to the application. Thus, an excellent data source cannot always be assumed. Due to the \textit{natural interaction} of the \textit{users} with the \textit{items}, \textit{preferences} arise. Such \textit{preferences} lead to \textit{behaviour patterns} which manifest themselves in the form of a \textit{bias} in the data. A \textit{bias} is not bad overall, but it must be taken into account when modeling the \textit{recommender system}. @@ -42,7 +42,7 @@ In addition, the \textit{missing rating} is no longer determined only by the \te Furthermore, $b_u = \mu_u - \mu$ and $b_i = \mu_i - \mu$. Here $\mu_u$ denotes the \textit{average} of all \textit{assigned ratings} of the \textit{user} $u$. Similarly, $\mu_i$ denotes the \textit{average} of all \textit{received ratings} of an \textit{item} $i$. Thus $b_u$ indicates the \textit{deviation} of the \textit{average assigned rating} of a \textit{user} from the \textit{global average}. Similarly, $b_i$ indicates the \textit{deviation} of the \textit{average rating} of an \textit{item} from the \textit{global average}. -In addition, the \textit{minimization problem} can be extended by the \textit{bias}. Accordingly, the \textit{minimization problem} is then $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\lVert q_i \rVert^2 + \lVert p_u \lVert^2 + b_u^2 + b_i^2)$. Analogous to the \textit{regulated matrix-factorization}, the values $b_u$ and $b_i$ are penalized in addition to $\lVert q_i \rVert, \lVert p_u \rVert$. In this case $b_u, b_i$ are penalized more if they assume a large value and thus deviate strongly from the \textit{global average}. +In addition, the \textit{minimization problem} can be extended by the \textit{bias}. Accordingly, the \textit{minimization problem} is then $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\lVert q_i \rVert^2 + \lVert p_u \lVert^2 + b_u^2 + b_i^2)$. Analogous to the \textit{regulated matrix-factorization}, the values $b_u$ and $b_i$ are penalized in addition to $\lVert q_i \rVert, \lVert p_u \rVert$. In this case $b_u, b_i$ are penalized more if they assume a large value and thus deviate strongly from the \textit{global average} \citep{Kor09}. \subsubsection{Advanced Matrix-Factorization}\label{subsec:amf} This section is intended to show that there are \textit{other approaches} to \textit{matrix-factorization}. diff --git a/references.bib b/references.bib index 5d03be60ba282ee8674adde31ee730145b78fdec..99164fe2e2e608cc433f3f9f3d8b9b469ad581df 100644 --- a/references.bib +++ b/references.bib @@ -173,6 +173,21 @@ pages = {}, title = {Improving regularized singular value decomposition for collaborative filtering}, journal = {Proceedings of KDD Cup and Workshop} } +@article{Dacrema19, + author = {Dacrema, Maurizio Ferrari and + Cremonesi Paolo and Jannach Dietmar}, + title = {Are We Really Making Much Progress? {A} Worrying Analysis of Recent + Neural Recommendation Approaches}, + journal = {CoRR}, + volume = {abs/1907.06902}, + year = {2019}, + url = {http://arxiv.org/abs/1907.06902}, + archivePrefix = {arXiv}, + eprint = {1907.06902}, + timestamp = {Tue, 23 Jul 2019 10:54:22 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1907-06902}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} @article{Dacrema2019, title={A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research}, author={Maurizio Ferrari Dacrema and Simone Boglio and Paolo Cremonesi and Dietmar Jannach}, @@ -242,21 +257,6 @@ keywords = {basket recommendation, markov chain, matrix factorization}, location = {Raleigh, North Carolina, USA}, series = {WWW ’10} } -@article{Dacrema19, - author = {Dacrema, Maurizio Ferrari and - Cremonesi Paolo and Jannach Dietmar}, - title = {Are We Really Making Much Progress? {A} Worrying Analysis of Recent - Neural Recommendation Approaches}, - journal = {CoRR}, - volume = {abs/1907.06902}, - year = {2019}, - url = {http://arxiv.org/abs/1907.06902}, - archivePrefix = {arXiv}, - eprint = {1907.06902}, - timestamp = {Tue, 23 Jul 2019 10:54:22 +0200}, - biburl = {https://dblp.org/rec/bib/journals/corr/abs-1907-06902}, - bibsource = {dblp computer science bibliography, https://dblp.org} -} @article{Sterling59, ISSN = {01621459}, URL = {http://www.jstor.org/stable/2282137}, diff --git a/submission.pdf b/submission.pdf index 4c8e6f1a2a560f359284396d620d315374eab0bd..688a3d035bc0e6005a20bba8de28a6ab026e4658 100644 Binary files a/submission.pdf and b/submission.pdf differ