Skip to content
Snippets Groups Projects
Commit 39a94ba4 authored by Marc Feger's avatar Marc Feger
Browse files

Add more text for experiments

parent a23768e3
Branches
No related tags found
No related merge requests found
......@@ -49,6 +49,12 @@ The first observation that emerges from \textit{figure} \ref{fig:bayes_sampling_
\subsubsection{Stronger Baselines}
As a second finding, the \textit{RMSE values} of the created models can be taken from \textit{figure} \ref{fig:bayes_dimensional_embeddings}. Several points can be addressed. Firstly, it can be seen that the \textit{individual inclusion} of \textit{implicit knowledge} such as \textit{time} or \textit{user behaviour} leads to a significant \textit{improvement} in the \textit{RMSE}. For example, models like \textit{bayesian timeSVD (0.7587)} and \textit{bayesian SVD++ (0.7563)}, which already use single implicit knowledge, beat the \textit{simple bayesian MF} with an \textit{RMSE} of \textit{0.7633}. In addition, it also shows that the \textit{combination} of \textit{implicit data} further improves the \textit{RMSE}. \textit{Bayesian timeSVD++} achieves an \textit{RMSE} of \textit{0.7523}. Finally, \textit{bayesian timeSVD++ flipped} can achieve an \textit{RMSE} of \textit{0.7485} by adding \textit{more implicit data}.
This results in the third and most significant observation of the experiment. Firstly, the \textit{simple bayesian MF} with an \textit{RMSE} of \textit{0.7633} already beat the best method \textit{MRMA} with an \textit{RMSE} of \textit{0.7634}. Furthermore, the best method \textit{MRMA} could be surpassed with \textit{bayesian timeSVD++} by 0.0149 with respect to the \textit{RMSE}. Such a result is astonishing, as it took \textit{one year} during the \textit{Netflix-Prize} to reduce the leading \textit{RMSE} from \textit{0.8712 (progress award 2007)} to \textit{0.8616 (progress award 2008)}. Additionally, this result is remarkable as it \textit{challenges} the \textit{last 5 years} of research on the \textit{MovieLens10M-dataset}. Based on the results obtained, the \textit{authors} see the first problem with the \textit{results} achieved on the \textit{MovieLens10M-dataset} as being that they were \textit{compared against} too \textit{weak baselines}.
From \textit{figure} \ref{fig:corrected_results} the \textit{improved baselines} and the \textit{results} of the \textit{new methods} can be examined.
\input{corrected_results}
\subsubsection{Reproducability}
But where do these \textit{weak baselines} come from?
In response, the authors see two main points. The first is \textit{reproducibility}. This is generally understood to mean the \textit{repetition} of an \textit{experiment} with the aim of \textit{obtaining} the \textit{specified results}. In most cases, the \textit{code} of the authors of a paper is taken and checked. Not only during the \textit{Netflix-Prize} this was a common method to compare competing methods, improve one's own and generally achieve \textit{stronger baselines}. However, the authors do not consider the \textit{simple repetition} of the experiment for the purpose of achieving the same results to be appropriate. Thus, the \textit{repetition} of the experiment only provides information about the results achieved by a specific setup. However, it does not provide deeper insights into the method, nor into its general quality. This is not only a problem of \textit{recommender-systems} but rather a general problem in the field of \textit{machine learning}. Thus, \textit{indicators} such as \textit{statistical significance}, \textit{reproducibility} or \textit{hyperparameter search} are often regarded as \textit{proof} of the quality of an experiment. But they only give information about a certain experiment, which could be performed with \textit{non-standard protocols}. The question of whether the method being used is applied and configured in a meaningful way is neglected. Thus, \textit{statistical significance} is often taken as an \textit{indication} that \textit{method A} \textit{performs better} than \textit{method B}.
\subsubsection{Inadequate validations}
The authors do not doubt the relevance of such methods. They even consider them necessary but not meaningful enough for the general goodness of an experiment. Thus, their preparation, which takes up the above mentioned methods, shows that they can achieve meaningful results.
Therefore the authors see the second point of criticism of the results obtained on the MovieLens10M data set as the wrong understanding of reliable experiments. The main reason given is the difference between scientific and industrial work. For example, during the Netflix Prize, which represents industrial work, audible sums were awarded for the best results. This had several consequences. Firstly, a larger community was addressed to work on the solution of the Recommender problem. On the other hand, the high number of competitors and the simplicity in the formulation of the task encouraged each participant to investigate the simplest methods in small steps. The small-step approach was also driven by the standardized guidelines for the evaluation of the methods given in section XY and by the public competition. Thus, a better understanding of the basic relationships could be achieved through the miniscule evaluation of hundreds of models. All in all, these insights led to well-understood and sharp baselines within a community that continuously worked towards a common goal over a total of three years. Such a motivation and such a target-oriented competitive idea is mostly not available in the scientific field. Thus, publications that achieve better results with old methods are considered unpublishable. Instead, experiments are not questioned and their results are simply transferred. In some cases experiments are repeated exactly as specified in the specifications. Achieving the same result is considered a valid baseline. According to the authors, such an approach is not meaningful and, by not questioning the one-off evaluations, leads to one-hit-wonders that distort the sharpness of the baselines. As a result, the MovieLens10M dataset shows that the main results of the last five years were measured against too weak baselines.
\begin{figure}[!ht]
\centering
\includegraphics[scale=0.60]{Bilder/corrected_results.png}
\caption{\textit{Improved baselines} and \textit{new methods}}
\label{fig:corrected_results}
\end{figure}
\ No newline at end of file
......@@ -85,6 +85,6 @@ The third approach is known as \textit{bayesian learning}. With this approach th
The approaches shown in sections 2.4.1 to 2.4.4 in combination with this learning approach are also known as \textit{bayesian probabilistic matrix-factorization (BPMF)}. A detailed elaboration of the \textit{BPMF} and the \textit{gibbs-sampler} was written by \citet{Rus08}.
\subsection{Short Summary of Recommender Systems}
As the previous section clearly shows, the field of \textit{recommender systems} is versatile. Likewise, the individual approaches from the \textit{CB} and \textit{CF} areas can be assigned to unambiguous subject areas. \textit{CF} works rather with \textit{graph-theoretical-approaches} while \textit{CB} uses methods from \textit{machine-learning}. Of course there are \textit{overlaps} between the approaches. Such overlaps are mostly found in \textit{matrix-factorization}. In addition to \textit{classical matrix- factorization}, which is limited to \textit{simple matrix-decomposition}, approaches such as \textit{SVD++} and \textit{BPMF} work with methods from \textit{CB} and \textit{CF}. \textit{SVD++} uses \textit{graph-based information} while \textit{BPMF} uses classical approaches from \textit{machine learning}. Nevertheless, \textit{matrix-factorization} forms a separate part in the research field of \textit{recommender systems}, which is strongly influenced by \textit{CB} and \textit{CF} ways of thinking. Figure \ref{fig:overview} finally shows a detailed overview of the different \textit{recommender-systems} and their dependencies.
As the previous section clearly shows, the field of \textit{recommender systems} is versatile. Likewise, the individual approaches from the \textit{CB} and \textit{CF} areas can be assigned to unambiguous subject areas. \textit{CF} works rather with \textit{graph-theoretical-approaches} while \textit{CB} uses methods from \textit{machine learning}. Of course there are \textit{overlaps} between the approaches. Such overlaps are mostly found in \textit{matrix-factorization}. In addition to \textit{classical matrix- factorization}, which is limited to \textit{simple matrix-decomposition}, approaches such as \textit{SVD++} and \textit{BPMF} work with methods from \textit{CB} and \textit{CF}. \textit{SVD++} uses \textit{graph-based information} while \textit{BPMF} uses classical approaches from \textit{machine learning}. Nevertheless, \textit{matrix-factorization} forms a separate part in the research field of \textit{recommender systems}, which is strongly influenced by \textit{CB} and \textit{CF} ways of thinking. Figure \ref{fig:overview} finally shows a detailed overview of the different \textit{recommender-systems} and their dependencies.
\input{overview}
\ No newline at end of file
No preview for this file type
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment