baselines.tex

  
 \section{On the Diffculty of Evaluating Baselines}
This section reviews the main part of the work represented by \citet{Rendle19}. In addition to a detailed description and explanation of the experiments carried out and the observations gained from them, a short introduction is given regarding the driving motivation.
\subsection{Motivation and Background}
As in many other fields of \textit{data-science}, a valid \textit{benchmark-dataset} is required for a proper execution of experiments. In the field of \textit{recommender systems}, the best known datasets are the \textit{Netflix-} and \textit{MovieLens-datasets}. This section introduces both datasets and shows the relationship of \citet{Koren}, one of the authors of this paper, to the \textit{Netflix-Prize}, in addition to the existing \textit{baselines}.
\subsubsection{Netflix-Prize}
\label{sec:netflix}
The topic of \textit{recommender systems} was first properly promoted and made known by the \textit{Netflix-Prize}. On \textit{October 2nd 2006}, the competition announced by \textit{Netflix} began with the \textit{goal} of beating the self-developed \textit{recommender system Cinematch} with an \textit{RMSE} of \textit{0.9514} by at least \textit{10\%}.
In total, the \textit{Netflix-dataset} was divided into three parts that can be grouped into two categories: \textit{training} and \textit{qualification}. In addition to a \textit{probe-dataset} for \textit{training} the algorithms, two further datasets were retained to qualify the winners. The \textit{quiz-dataset} was then used to calculate the \textit{score} of the \textit{submitted solutions} on the \textit{public leaderboard}. In contrast, the \textit{test-dataset} was used to determine the actual winners. Each of the pieces had around \textit{1.408.000 elements} and similar statistical values. By splitting the data in this way, it was possible to ensure that an improvement could not be achieved by \textit{simple hill-climbing-algorithms}.
It took a total of \textit{three years} and \textit{several hundred models} until the team \textit{BellKor`s Pragmatic Chaos} was chosen as the \textit{winner} on \textit{21st September 2009}. They had managed to achieve an \textit{RMSE} of \textit{0.8554} and thus an improvement of \textit{0.096}. Such a result is extraordinary excellent, because it took \textit{one year} of work and intensive research to reduce the \textit{RMSE} from \textit{0.8712 (progress award 2007)} to \textit{0.8616 (progress award 2008)}.
The \textit{co-author} of the present paper, \citet{Koren}, was significantly involved in the work of this team. Since the beginning of the event, \textit{matrix-factorization methods} have been regarded as promising approaches. Even with the simplest \textit{SVD} methods, \textit{RMSE} values of \textit{0.94} could be achieved by \citet{Kurucz07}.
The \textit{breakthrough} came through \citet{Funk06} who achieved an \textit{RMSE} of \textit{0.93} with his \textit{FunkSVD}.
Based on this, more and more work has been invested in the research of simple \textit{matrix-factorization methods}.
Thus, \citet{Zh08} presented an \textit{ALS variant} with an \textit{RMSE} of \textit{0.8985} and \citet{Koren09} presented an \textit{SGD variant} with \textit{RMSE 0.8998}.
\textit{Implicit data} were also used. For example, \citet{Koren09} could also achieve an \textit{RMSE} of \textit{0.8762} by extending \textit{SVD++} with a \textit{time variable}. This was then called \textit{timeSVD++}.

The \textit{Netflix-Prize} made it clear that even the \textit{simplest methods} are \textit{not trivial} and that a \textit{reasonable investigation} and \textit{evaluation requires} an \textit{immense effort} from within the \textit{community}.
\subsubsection{MovieLens}
In the \textit{non-commercial sector} of \textit{recommender systems} the \textit{MovieLens10M-dataset} is mostly used. It consists of \textit{10.000.054 elements} and was published by the research group \textit{GroupLens} in \textit{2009} \citep{Harper15}. In most cases a \textit{global} and \textit{random} \textit{90:10 split} of the data is used to evaluate the \textit{RMSE}. This means that through a \textit{random selection 90\%} of the data is used for \textit{training} and \textit{10\%} of the remaining data is used for \textit{testing}. Over the last \textit{five years} a large number of algorithms on this dataset have been evaluated and the results have been published on \textit{well-known conferences} such as \textit{ICML}, \textit{NeurIPS}, \textit{WWW}, \textit{SIGIR} and \textit{AAAI}. \textit{Figure} \ref{fig:reported_results} shows the \textit{results obtained} over the last \textit{five years} on the \textit{MovieLens10M-dataset}.
It can be clearly stated that the \textit{existing baselines} have been \textit{beaten} and \textit{newer methods} have made \textit{steady progress}.
\input{reported_results}

\subsection{Experiment Realization}
As the \textit{Netflix-Prize} has shown, \textit{research} and \textit{validation} is \textit{complex} even for very \textit{simple methods}. Not only during the \textit{Netflix-Prize} was intensive work done on researching \textit{existing} and \textit{new reliable methods}. The \textit{MovieLens10M-dataset} was used just as often. With their experiment, the authors \textit{doubt} that the \textit{baselines} of \textit{MovieLens10M} are \textit{adequate} for the evaluation of new methods. To test their hypothesis, the authors transferred all the findings from the \textit{Netflix-Prize} to the existing baselines of \textit{MovieLens10M}.

\subsubsection{Experiment Preparation}\label{sec:experiment_preparation}
Before actually conducting the experiment, the authors took a closer look at the given \textit{baselines}. In the process, they noticed some \textit{systematic overlaps}. These can be taken from the \textit{table} below.
\input{overlaps}

From the three aspects it can be seen that the models are fundamentally similar and that the main differences arise from different setups and learning procedures.
Thus, the authors examined the two learning methods \textit{stochastic gradient descent} and \textit{bayesian learning} in combination with \textit{biased matrix-factorization} before conducting the actual experiment. For $b_u = b_i = 0$ this is equivalent to \textit{regulated matrix-factorization (RSVD)}. In addition, for $\alpha = \beta = 1$ the \textit{weighted regulated matrix-factorization (WR)} is equivalent to \textit{RSVD}. Thus, the only differences are explained by the different adjustments of the methods.
To prepare the two learning procedures they were initialized with a \textit{gaussian-distribution} $\mathcal{N}(\mu, 0.1^2)$. The value for the \textit{standard deviation} of \textit{0.1} is the value suggested by the \textit{factorization machine libFM} as the default. In addition, \citet{Rendle13} achieved good results on the \textit{Netflix-Prize-dataset} with this value. Nothing is said about the parameter $\mu$. However, it can be assumed that this parameter is around the \textit{global average} of the \textit{ratings}. This can be assumed because \textit{ratings} are to be \textit{generated} with the \textit{initialization}.

For both approaches the number of \textit{sampling steps} was then set to \textit{128}. Since \textit{SGD} has two additional \textit{hyperparameters} $\lambda, \gamma$ these were also determined. Overall, the \textit{MovieLens10M-dataset} was evaluated by a \textit{10-fold cross-validation} over a \textit{random global} and \textit{non-overlapping 90:10 split}. In each step, \textit{90\%} of the data was used for \textit{training} and \textit{10\%} of the data was used for \textit{evaluation} without overlapping. In each split, \textit{95\%} of the \textit{training data} was used for \textit{training} and the remaining \textit{5\%} for \textit{evaluation} to determine the \textit{hyperparameters}. The \textit{hyperparameter search} was performed as mentioned in \textit{section} \ref{sec:sgd} using the \textit{grid} $(\lambda \in \{0.02, 0.03, 0.04, 0.05\}, \gamma \in \{0.001, 0.003\})$ and a $64$\textit{-dimensional embedding}. This grid was inspired by findings during the \textit{Netflix-Prize} \citep{Kor08, Paterek07}. In total the parameters $\lambda=0.04$ and $\gamma=0.003$ could be determined. Afterwards both \textit{learning methods} and their settings were compared. The \textit{RMSE} was plotted against the used \textit{dimension} $f$ of $p_u, q_i \in \mathbb{R}^f$. \textit{Figure} \ref{fig:battle} shows the corresponding results.
\input{battle}
\newpage
As a \textit{first intermediate result} of the preparation it can be stated that both \textit{SGD} and \textit{gibbs-samper} achieve better \textit{RMSE} values for increasing \textit{dimensional embedding}.

In addition, it can be stated that learning using the \textit{bayesian approach} is better than learning using \textit{SGD}. Even if the results could be different due to more efficient setups, it is still surprising that \textit{SGD} is worse than the \textit{bayesian approach}, although the \textit{exact opposite} was reported for \textit{MovieLens10M-dataset}. For example, \textit{figure} \ref{fig:reported_results} shows that the \textit{bayesian approach BPMF} achieved an \textit{RMSE} of \textit{0.8197} while the \textit{SGD approach Biased MF} performed better with \textit{0.803}. The fact that the \textit{bayesian approach} outperforms \textit{SGD} has already been reported and validated by \citet{Rendle13}, \citet{Rus08} for the \textit{Netflix-Prize-dataset}. Looking more closely at \textit{figures} \ref{fig:reported_results} and \ref{fig:battle}, the \textit{bayesian approach} scores better than the reported \textit{BPMF} and \textit{Biased MF} for each \textit{dimensional embedding}. Moreover, it even beats all reported \textit{baselines} and new methods. Building on this, the authors have gone into the detailed examination of the methods and \textit{baselines}.
\subsubsection{Experiment Implementation}
For the actual execution of the experiment, the authors used the knowledge they had gained from the preparations. They noticed already for the two \textit{simple matrix-factorization models SGD-MF} and \textit{Bayesian MF}, which were trained with an \textit{embedding} of \textit{512 dimensions} and over \textit{128 epochs}, that they performed extremely well. Thus \textit{SGD-MF} achieved an \textit{RMSE} of \textit{0.7720}. This result alone was better than: \textit{RSVD (0.8256)}, \textit{Biased MF (0.803)}, \textit{LLORMA (0.7815)}, \textit{I-Autorec (0.782)}, \textit{WEMAREC (0.7769)} and \textit{I-CFN++ (0.7754)}. In addition, \textit{Bayesian MF} with an \textit{RMSE} of \textit{0.7633} not only beat the \textit{reported baseline BPMF (0.8197)}. It also beat the \textit{best algorithm MRMA (0.7634)}.
As the \textit{Netflix-Prize} showed, the use of \textit{implicit data} such as \textit{time} or \textit{dependencies} between \textit{users} or \textit{items} could immensely improve existing models. In addition to the two \textit{simple matrix factorizations}, \textit{table} \ref{table:models} shows the extensions of the authors regarding the \textit{bayesian approach}.

\input{model_table}
As it turned out that the \textit{bayesian approach} gave more promising results, the given models were trained with it. For this purpose, the \textit{dimensional embedding} as well as the \textit{number of sampling steps} for the models were examined again. As indicated in \textit{section} \ref{sec:experiment_preparation}, the \textit{gaussian-distribution} was used for \textit{initialization}. \textit{Figure} \ref{fig:bayes_evaluation} shows the corresponding results.
\input{bayes_evaluation}

\subsection{Obeservations}
The first observation that emerges from \textit{figure} \ref{fig:bayes_sampling_steps} is that the \textit{increase} in \textit{sampling steps} with a \textit{fixed dimensional embedding} also results in an \textit{improvement} in \textit{RMSE} for all models. Based on this, \textit{figure} \ref{fig:bayes_dimensional_embeddings} also shows that an \textit{increase} in the \textit{dimensional embedding} for \textit{512 sampling steps} also leads to an \textit{improvement} in the \textit{RMSE} for all models. Thus, both the \textit{number of sampling steps} and the size of the \textit{dimensional embedding} are involved in the \textit{RMSE} of \textit{matrix-factorization models} when they are trained using the \textit{bayesian approach}.

\subsubsection{Stronger Baselines}
As a second finding, the \textit{RMSE} values of the created models can be taken from \textit{figure} \ref{fig:bayes_dimensional_embeddings}. Several points can be addressed. Firstly, it can be seen that the \textit{individual inclusion} of \textit{implicit knowledge} such as \textit{time} or \textit{user behaviour} leads to a significant \textit{improvement} in the \textit{RMSE}. For example, models like \textit{Bayesian timeSVD (0.7587)} and \textit{Bayesian SVD++ (0.7563)}, which already use \textit{single implicit knowledge}, beat the \textit{simple Bayesian MF} with an \textit{RMSE} of \textit{0.7633}. In addition, it also shows that the \textit{combination} of \textit{implicit data} further improves the \textit{RMSE}. \textit{Bayesian timeSVD++} achieves an \textit{RMSE} of \textit{0.7523}. Finally, \textit{Bayesian timeSVD++ flipped} can achieve an \textit{RMSE} of \textit{0.7485} by adding \textit{more implicit data}.
This results in the third and most significant observation of the experiment. Firstly, the \textit{simple Bayesian MF} with an \textit{RMSE} of \textit{0.7633} already beat the best method \textit{MRMA} with an \textit{RMSE} of \textit{0.7634}. Furthermore, the best method \textit{MRMA} could be surpassed with \textit{bayesian timeSVD++} by 0.0149 with respect to the \textit{RMSE}. Such a result is astonishing, as it took \textit{one year} during the \textit{Netflix-Prize} to reduce the leading \textit{RMSE} from \textit{0.8712 (progress award 2007)} to \textit{0.8616 (progress award 2008)}. Additionally, this result is remarkable as it \textit{challenges} the \textit{last five years} of research on the \textit{MovieLens10M-dataset}. Based on the results obtained, the \textit{authors} see the first problem with the \textit{results} achieved on the \textit{MovieLens10M-dataset} as being that they were \textit{compared against} too \textit{weak baselines}.
From \textit{figure} \ref{fig:corrected_results} the \textit{improved baselines} and the \textit{results} of the \textit{new methods} can be examined.
\input{corrected_results}
\subsubsection{Reproducability}
But where do these \textit{weak baselines} come from?
In response, the authors see two main points. The first is \textit{reproducibility}. This is generally understood to mean the \textit{repetition} of an \textit{experiment} with the aim of \textit{obtaining} the \textit{specified results}. In most cases, the \textit{code} of the authors of a paper is taken and checked. Not only during the \textit{Netflix-Prize}, this was a common method to compare competing methods, improve one's own and generally achieve \textit{stronger baselines}. However, the authors do not consider the \textit{simple repetition} of the experiment for the purpose of achieving the same results to be appropriate. Thus, the \textit{repetition} of the experiment only provides information about the results achieved by a specific setup. However, it does not provide deeper insights into the method, nor into its general quality. This is not only a problem of \textit{recommender systems} but rather a general problem in the field of \textit{machine learning}. Thus, \textit{indicators} such as \textit{statistical significance}, \textit{reproducibility} or \textit{hyperparameter search} are often regarded as \textit{proof} of the quality of an experiment. But they only give information about a certain experiment, which could be performed with \textit{non-standard protocols}. The question of whether the method being used is applied and configured in a meaningful way is neglected. Thus, \textit{statistical significance} is often taken as an \textit{indication} that \textit{method A} \textit{performs better} than \textit{method B}.

\subsubsection{Inadequate validations}
The authors do not doubt the relevance of such methods. They even consider them \textit{necessary} but \textit{not meaningful enough} for the \textit{general goodness} of an \textit{experiment}. Thus, their preparation, which takes up the above mentioned methods shows, that they can achieve meaningful results.
Therefore the authors see the second point of criticism of the results obtained on the \textit{MovieLens10M-dataset} as the \textit{wrong understanding} of \textit{reliable experiments}. The \textit{main reason} given is the \textit{difference} between \textit{scientific} and \textit{industrial work}. For example, during the\textit{ Netflix-Prize}, which represents \textit{industrial work}, \textit{audible sums} were \textit{awarded} for the best results. This had several consequences. Firstly, a \textit{larger community} was addressed to work on the solution of the \textit{recommender problem}. On the other hand, the high number of \textit{competitors} and the \textit{simplicity} in the formulation of the task encouraged each participant to investigate the \textit{simplest methods} in \textit{small steps}. The \textit{small-step approach} was also driven by the \textit{standardized guidelines} for the \textit{evaluation} of the methods given in \textit{section} \ref{sec:mf} and by the \textit{public competition}. Thus, a better understanding of the \textit{basic relationships} could be achieved through the \textit{miniscule evaluation} of hundreds of models. All in all, these insights led to \textit{well-understood} and \textit{sharp baselines} within a \textit{community} that \textit{continuously} worked towards a \textit{common goal} over a total of \textit{three years}. Such a \textit{motivation} and such a \textit{target-oriented competitive idea} is mostly not available in the \textit{scientific field}. Thus, publications that achieve \textit{better results} with \textit{old methods} are considered \textit{unpublishable}. Instead, experiments are \textit{not questioned} and their \textit{results} are \textit{simply transferred}. In some cases experiments are \textit{repeated exactly as specified} in the instructions. Achieving the \textit{same result} is considered a \textit{valid baseline}. According to the authors, such an approach is \textit{not meaningful} and, by not questioning the \textit{one-off evaluations}, leads to \textit{one-hit-wonders} that \textit{distort} the \textit{sharpness} of the \textit{baselines}. Therefore, the \textit{MovieLens10M-dataset} shows that the main results of the last \textit{five years} were \textit{measured} against too \textit{weak baselines}.