recommender.tex

\section{A Study on Recommender Systems}
This section explains the basics of \textit{recommender systems} necessary for the essential understanding of the paper presented. Besides the general definition of the \textit{recommender problem}, the corresponding solution approaches are presented. Furthermore, the main focus will be on the solution approach of  \textit{matrix factorization}.

\subsection{Recommender Problem}
The \textit{recommender problem} consists of the entries of the sets $\mathcal{U}$ and $\mathcal{I}$, where $\mathcal{U}$ represents the set of all \textit{users} and $\mathcal{I}$ the set of all \textit{items}.
Each of the \textit{users} in $\mathcal{U}$ gives \textit{ratings} from a set $\mathcal{S}$ of possible scores for the available \textit{items} in $\mathcal{I}$. The resulting \textit{rating-matrix} $\mathcal{R}$ is composed of $\mathcal{R} = \mathcal{U} \times \mathcal{I}$. The entries in $\mathcal{R}$ indicate the \textit{rating} from \textit{user} $u \in \mathcal{U}$ to \textit{item} $i \in \mathcal{I}$. This entry is then referred to as $r_{ui}$. Due to incomplete \textit{item-ratings}, $\mathcal{R}$ may also be incomplete. In the following, the subset of all \textit{users} who have rated a particular \textit{item} $i$ is referred to as $\mathcal{U}_i$. Similarly, $\mathcal{I}_u$ refers to the subset of \textit{items} that were rated by \textit{user} $u$. Since $\mathcal{R}$ is not completely filled, there are missing values for some \textit{user-item relations}. The aim of the \textit{recommender system} is to estimate the missing \textit{ratings} $\hat{r}_{ui}$ using a \textit{prediction-function} $p(u,i)$. The \textit{prediction-function} consists of $p: \mathcal{U} \times \mathcal{I} \rightarrow \mathcal{S}$ \citep{DeKa11}. In the further course of the work different methods are presented to determine $p(u,i)$.

In the following, the two main approaches of \textit{collaborative-filtering} and \textit{content-based} \textit{recommender systems} will be discussed. In addition, it is explained how \textit{matrix factorization} can be integrated into the two ways of thinking.

\subsection{Content-Based}
\textit{Content-based} \textit{recommender systems (CB)} work directly with \textit{feature vectors}. Such a \textit{feature vector} can, for example, represent a \textit{user profile}. In this case, this \textit{profile} contains information about the \textit{user's preferences}, such as \textit{genres}, \textit{authors}, \textit{etc}.  This is done by trying to create a \textit{model} of the \textit{user}, which best represents his preferences. The different \textit{learning algorithms} from the field of \textit{machine learning} are used to learn or create the \textit{models}. The most prominent \textit{algorithms} are: \textit{tf-idf}, \textit{bayesian learning}, \textit{Rocchio's algorithm} and \textit{neural networks} \citep{Lops11, Ferrari19, DeKa11}. Altogether the built and learned \textit{feature vectors} are compared with each other. Based on their closeness, similar \textit{features} can be used to generate \textit{missing ratings}. Figure \ref{fig:cb} shows a sketch of the general operation of \textit{content-based recommenders}.

\subsection{Collaborative-Filtering}
Unlike the \textit{content-based recommender (CF)}, the \textit{collaborative-filtering recommender} not only considers individual \textit{users} and \textit{feature vectors}, but rather a \textit{like-minded neighborhood} of each \textit{user}.
Missing \textit{user ratings} can be extracted by this \textit{neighbourhood} and \textit{networked} to form a whole. It is assumed that a \textit{missing rating} of the considered \textit{user} for an unknown \textit{item} $i$ will be similar to the \textit{rating} of a \textit{user} $v$ as soon as $u$ and $v$ have rated some \textit{items} similarly. The similarity of the \textit{users} is determined by the \textit{community ratings}. This type of \textit{recommender system} is also known by the term \textit{neighborhood-based recommender} \citep{DeKa11}. The main focus of \textit{neighbourhood-based methods} is on the application of iterative methods such as \textit{k-nearest-neighbours} or \textit{k-means}.
A \textit{neighborhood-based recommender} can be viewed from two angles: The first and best known problem is the so-called \textit{user-based prediction}.  Here, the \textit{missing ratings} of a considered \textit{user} $u$ are to be determined from his \textit{neighborhood} $\mathcal{N}_i(u)$. 
$\mathcal{N}_i(u)$ denotes the subset of the \textit{neighborhood} of all \textit{users} who have a similar manner of evaluation to $u$ via the \textit{item} $i$. The second problem is that of \textit{item-based prediction}. Analogously, the similarity of the items is determined by their received ratings.
This kind of problem consideres the \textit{neighborhood} $\mathcal{N}_u(i)$ of all \textit{items} $i$ which were similar rated via the \textit{user} $u$. The similarity between the objects of a \textit{neighborhood} is determined by \textit{distance functions} such as \textit{mean-squared-difference}, \textit{pearson-correlation} or \textit{cosine-similarity}.
Figure \ref{fig:cf} shows a sketch of the general operation of the \textit{collaborative-filtering recommender}.

\input{content-based-collaborative-filtering-comparison}

\subsection{Matrix-Factorization}\label{sec:mf}
The core idea of \textit{matrix factorization} is to supplement the not completely filled out \textit{rating-matrix} $\mathcal{R}$. For this purpose the \textit{users} and \textit{items} are to be mapped to a joined \textit{latent feature space} with \textit{dimensionality} $f$. The \textit{user} is represented by the vector $p_u \in \mathbb{R}^{f}$ and the item by the vector $q_i \in \mathbb{R}^{f}$. As a result, the \textit{missing ratings} and thus the \textit{user-item interaction} are to be determined via the \textit{inner product} $\hat{r}_{ui}=q_i^Tp_u$ of the corresponding vectors \citep{Kor09}. In the following, the four most classical matrix factorization approaches are described in detail. Afterwards, the concrete learning methods with which the vectors are learned are presented. In addition, the \textit{training data} for which a \textit{concrete rating} is available should be referred to as $\mathcal{B} = \lbrace(u,i) | r_{ui} \in \mathcal{R}\rbrace$.

\subsubsection{Basic Matrix-Factorization}
The first and easiest way to solve \textit{matrix-factorization} is to connect the \textit{feature vectors} of the \textit{users} and the \textit{items} using the \textit{inner product}. The result is the \textit{user-item interaction}. In addition, the \textit{error} should be as small as possible. Therefore, $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}}$  is defined as an associated \textit{minimization problem}.

\subsubsection{Regulated Matrix-Factorization}\label{subsec:rmf}
This problem extends the \textit{basic matrix-factorization} by a \textit{regulation factor} $\lambda$ in the corresponding \textit{minimization problem}. Since $\mathcal{R}$ is thinly occupied, the effect of \textit{overfitting} may occur due to learning from the few known values. The problem with \textit{overfitting} is that the generated \textit{ratings} are too tight. To counteract this, the magnitudes of the previous vectors is taken into account. High magnitudes are punished by a factor $\lambda(\lVert q_i \rVert^2 + \lVert p_u \lVert^2)$ in the \textit{minimization problem}. Overall, the \textit{minimization problem} $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\lVert q_i \lVert^2 + \lVert p_u \lVert^2)$ is to be solved.
The idea is that especially large entries in $q_i$ or $p_u$ cause $\lVert q_i \rVert, \lVert p_u \rVert$ to become larger. Accordingly, $\lVert q_i \rVert, \lVert p_u \rVert$ increases the larger its entries become. This value is then additionally punished by squaring it. Small values are rewarded and large values are penalized. Additionally the influence of this value can be regulated by $\lambda$.

\subsubsection{Weighted Regulated Matrix-Factorization}
A \textit{regulation factor} $\lambda$  is introduced in analogy to \textit{regulated matrix-factorization}. Additional \textit{weights} $\alpha$ and $\beta$ are introduced to take into account the individual magnitude of a vector. The \textit{minimization problem} then corresponds to $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\alpha\lVert q_i \rVert^2 + \beta\lVert p_u \lVert^2)$.

\subsubsection{Biased Matrix-Factorization}\label{subsec:bmf}
A major advantage of \textit{matrix-factorization} is the ability to model simple relationships according to the application. Thus, an excellent data source cannot always be assumed. Due to the \textit{natural interaction} of the \textit{users} with the \textit{items}, \textit{preferences} arise. Such \textit{preferences} lead to \textit{behaviour patterns} which manifest themselves in the form of a \textit{bias} in the data. In principle, a \textit{bias} is not bad, but it must be taken into account when modeling the \textit{recommender system}.
The most popular model that takes \textit{bias} into account is called \textit{biased matrix-factorization}.
In addition, the \textit{missing rating} is no longer determined only by the \textit{inner product} of the two vectors $q_i$ and $p_u$. Rather, the \textit{bias} is also considered. Accordingly, a \textit{missing rating} is calculated by $\hat{r}_{ui} = b_{ui} + q_i^Tp_u$, where $b_{ui}$ is the \textit{bias} of a \textit{user} $u$ and an \textit{item} $i$. The \textit{bias} is determined by $b_{ui}=\mu + b_u + b_i$. The parameter $\mu$ is the \textit{global average} of all \textit{ratings} $r_{ui} \in \mathcal{R}$.
Furthermore, $b_u = \mu_u - \mu$ and $b_i = \mu_i - \mu$.
Here $\mu_u$ denotes the \textit{average} of all \textit{assigned ratings} of the \textit{user} $u$. Similarly, $\mu_i$ denotes the \textit{average} of all \textit{received ratings} of an \textit{item} $i$.
Thus $b_u$ indicates the \textit{deviation} of the \textit{average assigned rating} of a \textit{user} from the \textit{global average}. Similarly, $b_i$ indicates the \textit{deviation} of the \textit{average rating} of an item from the \textit{global average}.
In addition, the \textit{minimization problem} can be extended by the \textit{bias}. Accordingly, the \textit{minimization problem} is then $\min_{p_u, q_i}{\sum_{(u,i) \in \mathcal{B}} (r_{ui} - \hat{r}_{ui})^{2}} + \lambda(\lVert q_i \rVert^2 + \lVert p_u \lVert^2 + b_u^2 + b_i^2)$. Analogous to the \textit{regulated matrix-factorization}, the values $b_u$ and $b_i$ are penalized in addition to $\lVert q_i \rVert, \lVert p_u \rVert$. In this case $b_u, b_i$ are penalized more if they assume a large value and thus deviate strongly from the \textit{global average}.

\subsubsection{Advanced Matrix-Factorization}\label{subsec:amf}
This section is intended to show that there are \textit{other approaches} to \textit{matrix-factorization}.
Thus, \textit{implicit data} can also be included.
First of all, it should be mentioned that \textit{temporary dynamics} can also be included.
On the one hand, it is not realistic that a \textit{user} cannot change his taste. On the other hand, the properties of an \textit{item} remain constant. Therefore, \textit{missing ratings} can also be determined \textit{time-based}. A \textit{missing rating} is then determined by $\hat{r}_{ui}=\mu + b_i(t) + b_u(t) + q_i^{T}p_u(t)$ \citep{Kor09}.
As a second possibility, \textit{implicit influence} can be included. This can involve the \textit{properties} of the \textit{items} a \textit{user} is dealing with. A \textit{missing rating} can be determined by $\hat{r}_{ui}=\mu + b_i + b_u + q_i^{T}(p_u + |\mathcal{I}_u|^{-\frac{1}{2}}\sum_{i \in \mathcal{I}_u}{y_i})$. $y_i \in \mathbb{R}^{f}$ describes the \textit{feature vectors} of the \textit{items} $i \in \mathcal{I}_u$ which have been evaluated by \textit{user} $u$. The corresponding \textit{minimization problems} can be adjusted as mentioned in the sections above \citep{Kor08}.

\subsection{Optimization and Learning}
An important point that does not emerge from the above points is the question of how the individual components $p_u, q_i, b_u, b_i$ are constructed. In the following, the three most common methods are presented.

\subsubsection{Stochastic Gradient Descent}
\label{sec:sgd}
The best known and most common method when it comes to \textit{machine learning} is \textit{stochastic gradient descent (SGD)}. The goal of \textit{SGD} is to \textit{minimize} the \textit{error} of a given \textit{objective function}. Thus the estimators mentioned in section \ref{sec:mf} can be used as \textit{objective functions}. In the field of \textit{recommender systems}, \citet{Funk06} presented a \textit{modified} variant of \textit{SGD} in the context of the \textit{Netflix Challenge}. \textit{SGD} can be applied to \textit{regulated matrix-factorization} with \textit{bias} as well as without \textit{bias}. This method can be described by the following pseudo code:
\begin{algorithm}\label{alg:sgd}
	\caption{SGD of Funk}
	\begin{algorithmic}[1]
		\REQUIRE training-matrix $\mathcal{R}_{train}$, initial mean $\mu$, initial standard deviation $\sigma^2$, regularization parameter $\lambda$, learning rate $\gamma$, feature embedding $f$, epochs to train $n_{epochs}$
		\STATE $\mathcal{P} \leftarrow \mathcal{N}(\mu, \sigma^2)^{|\mathcal{U}|\times f}$
		\STATE $\mathcal{Q} \leftarrow \mathcal{N}(\mu, \sigma^2)^{f\times |\mathcal{I}|}$
		\FOR{$epoch \in \lbrace 0, \cdots, n_{epochs}-1\rbrace$}
			\FOR{$(u,i) \in \mathcal{R}_{train}$}
				\STATE $e_{ui} \leftarrow r_{ui} - \hat{r}_{ui}$
				\STATE $q_i \leftarrow q_i + \gamma(e_{ui}p_u -\lambda q_i)$
				\STATE $p_u \leftarrow p_u + \gamma(e_{ui}q_i - \lambda p_u)$
				\STATE $b_i \leftarrow b_i + \gamma(e_{ui}-\lambda b_i)$
				\STATE $b_u \leftarrow b_u + \gamma(e_{ui}-\lambda b_u)$
			\ENDFOR
		\ENDFOR
		\RETURN $\mathcal{P}, \mathcal{Q}$
	\end{algorithmic}
\end{algorithm}

At the beginning, the matrices $\mathcal{P}, \mathcal{Q}$ are filled with \textit{random numbers}. According to \citet{Funk06} this can be done by a \textit{gaussian-distribution}. Then, for each element in the \textit{training set}, the entries of the corresponding vectors $p_u \in \mathcal{P}, q_i \in \mathcal{Q}$ are recalculated on the basis of the \textit{error} that occurred in an \textit{epoch}. The parameters $\mu, \gamma$ are introduced to avoid \textit{over}- and \textit{underfitting}. These can be determined using \textit{grid-search} and \textit{k-fold cross-validation}. For the \textit{optimization} of the parameters $\mu$ and $\gamma$ the so-called \textit{grid-search} procedure is used. A \textit{grid} of possible parameters is defined before the analysis. This \textit{grid} consists of the sets $\Lambda$ and $\Gamma$. The \textit{grid-search} method then trains the algorithm to be considered with each possible pair of $(\lambda \in \Lambda, \gamma \in \Gamma)$. The models trained in this way are then tested using a \textit{k-fold cross-validation}. The data set is divided into $k$-equally large fragments. Each of the $k$ parts is used once as a test set while the remaining ($k-1)$ parts are used as training data. The average error is then determined via the $k$-\textit{folds} and entered into the \textit{grid}. Thus the pair $(\lambda \in \Lambda, \gamma \in \Gamma)$ can be determined for which the \textit{error} is lowest.
This approach is also called \textit{Funk-SVD} or \textit{SVD} in combination with section \ref{subsec:rmf} and \ref{subsec:bmf} \citep{Rendle19}.  The algorithm shown above can also be extended. Thus procedures like in section \ref{subsec:amf} can be solved. The second method from section \ref{subsec:amf} is then also called \textit{SVD++}. A coherent \textit{SGD} approach was given by \citet{Kor11}.

\subsubsection{Alternating Least Square}
The second method often used is \textit{alternating least square (ALS)}. In contrast to \textit{SGD}, the vectors $q_i, p_u$ are adjusted in \textit{two steps}. Since \textit{SGD} $q_i$ and $p_u$ are both unknown, this is a \textit{non-convex problem}. The idea of \textit{ALS} is to capture one of the two vectors and work with one unknown variable each. Thus the problem becomes \textit{quadratic} and can be solved optimally. For this purpose the matrix $\mathcal{P}$ is filled with \textit{random numbers} at the beginning. These should be as small as possible and can be generated by a \textit{gaussian-distribution}. Then $\mathcal{P}$ is recorded and all $q_i \in \mathcal{Q}$ are recalculated according to the \textit{least-square problem}. This step is then repeated in reverse order. \textit{ALS} terminated if a \textit{termination condition} such as the \textit{convergence} of the error is satisfied for both steps \citep{Zh08}.

\subsubsection{Bayesian Learning}
The third approach is known as \textit{bayesian learning}. With this approach the so-called \textit{gibbs-sampler} is often used. The aim is to determine the \textit{common distribution} of the vectors in $\mathcal{P}, \mathcal{Q}$. For this purpose the \textit{gibbs-sampler} is given an initialization of \textit{hyperparameters} to generate the \textit{initial distribution}. The \textit{common distribution} of the vectors $q_i \in \mathcal{Q}, p_u \in \mathcal{P}$ is approximated by the \textit{conditional probabilities}. The basic principle is to select a variable in a \textit{reciprocal way} and to generate a value dependent on the values of the other variable according to its \textit{conditional distribution}, with the other values remaining unchanged in each \textit{epoch}. 
The approaches shown in sections 2.4.1 to 2.4.4 in combination with this learning approach are also known as \textit{bayesian probabilistic matrix-factorization (BPMF)}. A detailed elaboration of the \textit{BPMF} and the \textit{gibbs-sampler} was written by \citet{Rus08}.

\subsection{Short Summary of Recommender Systems}
As the previous section clearly shows, the field of \textit{recommender systems} is versatile. Likewise, the individual approaches from the \textit{CB} and \textit{CF} areas can be assigned to unambiguous subject areas. \textit{CF} works rather with \textit{graph-theoretical-approaches} while \textit{CB} uses methods from \textit{machine-learning}. Of course there are \textit{overlaps} between the approaches. Such overlaps are mostly found in \textit{matrix-factorization}. In addition to \textit{classical matrix- factorization}, which is limited to \textit{simple matrix-decomposition}, approaches such as \textit{SVD++} and \textit{BPMF} work with methods from \textit{CB} and \textit{CF}. \textit{SVD++} uses \textit{graph-based information} while \textit{BPMF} uses classical approaches from \textit{machine learning}. Nevertheless, \textit{matrix-factorization} forms a separate part in the research field of \textit{recommender systems}, which is strongly influenced by \textit{CB} and \textit{CF} ways of thinking. Figure \ref{fig:overview} finally shows a detailed overview of the different \textit{recommender-systems} and their dependencies.

\input{overview}