baselines.tex

  
 \section{On the Diffculty of Evaluating Baselines}
This section reviews the \textit{main part} of the work represented by \citet{Rendle19}. In addition to a \textit{detailed description} and \textit{explanation} of the \textit{experiments} carried out and the \textit{observations} gained from them, a short introduction is given regarding the driving \textit{motivation}
\subsection{Motivation and Background}
As in many other fields of \textit{data-science}, a valid \textit{benchmark-dataset} is required for a proper execution of experiments. In the field of \textit{recommender systems}, the best known \textit{datasets} are the \textit{Netflix-} and \textit{MovieLens-dataset}. This section introduces both \textit{datasets} and shows the relationship of \citet{Koren}, one of the authors of this paper, to the \textit{Netflix-Prize}, in addition to the existing \textit{baselines}.
\subsubsection{Netflix-Prize}
The topic of \textit{recommender systems} was first properly promoted and made known by the \textit{Netflix-Prize}. On \textit{October 2nd 2006}, the competition announced by \textit{Netflix} began with the \textit{goal} of beating the self-developed \textit{recommender system Cinematch} with an \textit{RMSE} of \textit{0.9514} by at least \textit{10\%}.
In total, the \textit{Netflix-dataset} was divided into three parts that can be grouped into two categories: \textit{training} and \textit{qualification}. In addition to a \textit{probe-dataset} for \textit{training} the algorithms, two further datasets were retained to qualify the winners. The \textit{quiz-dataset} was then used to calculate the \textit{score} of the \textit{submitted solutions} on the \textit{public leaderboard}. In contrast, the \textit{test-dataset} was used to determine the \textit{actual winners}. Each of the pieces had around \textit{1.408.000 data} and \textit{similar statistical values}. By splitting the data in this way, it was possible to ensure that an improvement could not be achieved by \textit{simple hill-climbing-algorithms}.
It took a total of \textit{three years} and \textit{several hundred models} until the team \textit{"BellKor`s Pragmatic Chaos"} was chosen as the \textit{winner} on \textit{21st September 2009}. They had managed to achieve an \textit{RMSE} of \textit{0.8554} and thus an \textit{improvement} of \textit{0.096}. Such a result is extraordinary excellent, because it took \textit{one year} of work and intensive research to reduce the \textit{RMSE} from \textit{0.8712 (progress award 2007)} to \textit{0.8616 (progress award 2008)}.
The \textit{co-author} of the present paper, \citet{Koren}, was significantly involved in the work of this team. Since the beginning of the event, \textit{matrix-factorization methods} have been regarded as promising approaches. Even with the simplest \textit{SVD} methods, \textit{RMSE values} of \textit{0.94} could be achieved by \citet{Kurucz07}.
The \textit{breakthrough} came through \citet{Funk06} who achieved an \textit{RMSE} of \textit{0.93} with his \textit{FunkSVD}.
Based on this, more and more work has been invested in the research of simple \textit{matrix-factorization methods}.
Thus, \citet{Zh08} presented an \textit{ALS variant} with an \textit{RMSE} of \textit{0.8985} and \citet{Koren09} presented an \textit{SGD variant} with \textit{RMSE 0.8995}.
\textit{Implicit data} were also used. For example, \citet{Koren09} could also achieve an \textit{RMSE} of \textit{0.8762} by extending \textit{SVD++} with a \textit{time variable}. This was then called \textit{timeSVD++}.

The \textit{Netflix-Prize} made it clear that even the \textit{simplest methods} are \textit{not trivial} and that a \textit{reasonable investigation} and \textit{evaluation requires} an \textit{immense effort} from within the \textit{community}.
\subsubsection{MovieLens}
In the \textit{non-commercial sector} of \textit{recommender systems} the \textit{MovieLens10M-dataset} is mostly used. It consists of \textit{10.000.054 data} and was published by the research group \textit{GroupLens} in \textit{2009} \citep{Harper15}. In most cases a \textit{global} and \textit{random} \textit{90:10 split} of the data is used to evaluate the \textit{RMSE}. This means that through a \textit{random selection 90\%} of the data is used for \textit{training} and \textit{10\%} of the remaining data is used for \textit{testing}. Over the last \textit{5 years} a large number of \textit{algorithms} on this data set have been evaluated and the results have been published on \textit{well-known convergences} such as \textit{ICML}, \textit{NeurIPS}, \textit{WWW}, \textit{SIGIR} and \textit{AAAI}. \textit{Figure} \ref{fig:reported_results} shows the \textit{results obtained} over the last \textit{5 years} on the \textit{MovieLens10M-dataset}.
It can be clearly stated that the \textit{texisting baselines} have been \textit{beaten} and \textit{newer methods} have made \textit{steady progress}.
\input{reported_results}

\subsection{Experiment Realization}
As the \textit{Netflix-Prize} has shown, \textit{research} and \textit{validation} is \textit{complex} even for very \textit{simple methods}. Not only during the \textit{Netflix-Prize} was intensive work done on researching \textit{existing} and \textit{new reliable methods}. The \textit{MovieLens10M-dataset} was used just as often. With their \textit{experiment} the authors \textit{doubt} that the \textit{baselines} of \textit{MovieLens10M} are \textit{inadequate} for the evaluation of new methods. To test their hypothesis, the authors transferred all the findings from the \textit{Netflix-Prize} to the existing baselines of \textit{MovieLens10M}.

\subsubsection{Experiment Preparation}
Before actually conducting the experiment, the authors took a closer look at the given baselines. In the process, they noticed some \textit{systematic overlaps}. These can be taken from \textit{table} below.
\input{overlaps}

From the three aspects it can be seen that the models are fundamentally similar and that the main differences arise from different setups and learning procedures.
Thus, the authors examined the two learning methods \textit{stochastic gradient descent} and \textit{bayesian learning} in combination with \textit{biased matrix-factorization} before conducting the actual experiment. For $b_u = b_i = 0$ this is equivalent to \textit{regulated matrix-factorization (RSVD)}. In addition, for $\alpha = \beta = 1$ the \textit{weighted regulated matrix-factorization (WR)} is equivalent to \textit{RSVD}. Thus, the only differences are explained by the different adjustments of the methods.
To prepare the two learning procedures they were initialized with a \textit{gaussian normal distribution} $\mathcal{N}(\mu, 0.1^2)$. The value for the \textit{standard deviation} of 0.1 is the value suggested by the \textit{factorization machine libFM} as the default. In addition, \citet{Rendle13} achieved good results on the \textit{Netflix-Prize-dataset} with this value. Nothing is said about the parameter $\mu$. However, it can be assumed that this parameter is around the \textit{global average} of the \textit{ratings}. This can be assumed because \textit{ratings} are to be \textit{generated} with the \textit{initialization}.

For both approaches the number of \textit{sampling steps} was then set to \textit{128}. Since \textit{SGD} has two additional \textit{hyperparameters} $\lambda, \gamma$ these were also determined. Overall, the \textit{MovieLens10M-dataset} was evaluated by a \textit{10-fold cross-validation} over a \textit{random global} and \textit{non-overlapping 90:10 split}. In each split, \textit{90\%} of the data was used for \textit{training} and \textit{10\%} of the data was used for \textit{evaluation} without overlapping. In each split, \textit{95\%} of the \textit{training data} was used for \textit{training} and the remaining \textit{5\%} for \textit{evaluation} to determine the \textit{hyperparameters}. The \textit{hyperparameter search} was performed as mentioned in \textit{section} \ref{sec:sgd} using the \textit{grid} $(\lambda \in \{0.02, 0.03, 0.04, 0.05\}, \gamma \in \{0.001, 0.003\})$. This grid was inspired by findings during the \textit{Netflix-Prize} \citep{Kor08, Paterek07}. In total the parameters $\lambda=0.04$ and $\gamma=0.003$ could be determined. Afterwards both \textit{learning methods} and their settings were compared. The \textit{RMSE} was plotted against the used \textit{dimension} $f$ of $p_u, q_i \in \mathbb{R}^f$. \textit{Figure} \ref{fig:battle} shows the corresponding results.
\input{battle}

As a \textit{first intermediate result} of the preparation it can be stated that both \textit{SGD} and \textit{gibbs-samper} achieve better \textit{RMSE values} for increasing \textit{dimensional embedding}.

In addition, it can be stated that learning using the \textit{bayesian approach} is better than learning using \textit{SGD}. Even if the results could be different due to more efficient setups, it is still surprising that \textit{SGD} is worse than the \textit{bayesian approach}, although the \textit{exact opposite} was reported for \textit{MovieLens10M}. For example, \textit{figure} \ref{fig:reported_results} shows that the \textit{bayesian approach BPMF} achieved an \textit{RMSE} of \textit{0.8187} while the \textit{SGD approach Biased MF} performed better with \textit{0.803}. The fact that the \textit{bayesian approach} outperforms \textit{SGD} has already been reported and validated by \citet{Rendle13}, \citet{Rus08} for the \textit{Netflix-Prize-dataset}. Looking more closely at \textit{figures} \ref{fig:reported_results} and \ref{fig:battle}, the \textit{bayesian approach} scores better than the reported \textit{BPMF} and \textit{Biased MF} for each \textit{dimensional embedding}. Moreover, it even beats all reported baselines and new methods. Building on this, the authors have gone into the detailed examination of the methods and baselines.
\subsubsection{Experiment Implementation}
\subsection{Obeservations}
\subsubsection{Stronger Baselines}
\subsubsection{Reproducability}
\subsubsection{Inadequate validations}