Recommendation Engine For Retail Marketing¶
Introduction¶
Recommender systems are among the most popular applications of data science we see all around us. They are used to predict the "rating" or "preference" that a user would give to an item. In the retail market Amazon and ebay use it to personlize our web experience by recommending items to shoppers. Facebook uses it to recommend friends and pages to follow. Pandora uses to suggest which songs to play and Netflix recommends movies with it. A recommender system studies customer behavior like similarites between items purchased, past purchases, demographic info, their search history or movie review by consumers and use that information to predict other items to those customers. The benefits of recommender systems include improving retention,increasing sales and brand loyalty.
Among some of the most popular recommender systems include the following:
Simple recommenders: This approach generate generalized recommendations to every user based on some attributes like movie popularity and genre. The weighted average score is a common choice for building simple recommendes for movie dataset. The assumption behind this is to recommend an item to all users if based on popularity.
Content-Based Filtering The recommendation of a product is based on its discrete set of descriptive qualities.This type of recommendation system focuses on the products themselves and recommends other products that have similar attributes. For example, the site may utilize a keyword system that suggests items with similar keywords in its description to an item the user has previously purchased.T Content-based filtering relies on the characteristics of the products themselves, so it doesn’t rely on other users to interact with the products before making a recommendation.
Demographic Based Filtering This approach uses demographic information of the user such as age,gender,etc. to make recommendations.. It Prediction is done by suggessting items selected by other users that fit the same demographic pro
Utility-Based Filtering Utility Based Filtering elicits the utility that the user will get from the product. It can include information such as vendor reliability and availability to make sure it recommends products easily obtained by the user.
Knowledge-Based Filtering Recommendation selections is based on a user’s known preferences and buying patterns. Since the system will know what the consumer purchased in the past, it can make recommendations based on what might fill those needs in the future.
Hybrid Filtering This type of system combines two or more different recommender techniques to create a more thorough recommending system.
Collaborative Filtering¶
This type of recommender system uses the preferences or ratings of other users to make predictions for other users. Collaborative filtering involves filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.Collaborative Filtering can be claasified into thre main parts, memory based CF,model based CF and hybrid recommenders.
Memory based approach:¶
User-Item Collaborative Filtering: The similarity between two users x and y is calculated based on similarity in patterns of items they both rated (Users like you also bought y item). An application of this user-based collaborative filtering is the Nearest Neighbor algorithm.
Item-Item Collaborative Filtering:: item-based collaborative filtering (users who bought x also bought y), proceeds in an item-centric manner:
Build an item-item matrix determining relationships between pairs of items Infer the tastes of the current user by examining the matrix and matching that user's data See, for example, the Slope One item-based collaborative filtering family.
- Item-Based Collaborative Filtering
Common measures of similarity between two users or items include the Pearson correlation and vector cosine given below:
The similarity between two users/items $x$, $y$ is measured by computing the Pearson correlation between users/items $x$, $y$ and is defined as:
\begin{equation*} \text{Similarity }(x,y)=\frac{\displaystyle \sum\limits_{i \in I_{x,y}}(r_{x,i}-\bar{r}_{x})(r_{y,i}-\bar{r}_{y})}{\sqrt{\displaystyle \sum\limits_{i \in I_{x} }(r^{2}_{x,i}-\bar{r}_{x})^{2}} \sqrt{\displaystyle \sum\limits_{i \in I_{y} }(r^{2}_{y,i}-\bar{r}_{y}})^{2}} \end{equation*}
where the $i ∈ I$ summations are over the items that both the users $x$ and $y$ have rated and $\bar{r_{x}}$ is the average rating of the co-rated items of the $x$th user.
The cosine-based approach defines the cosine-similarity (cosine of the angle between the two users/items vectors) between two users/items x and y as:
\begin{equation*} \text{Similarity }(x,y)= \text{Cosine }(\theta)=\frac{ \overrightarrow{x} \cdot \overrightarrow{y}}{||\overrightarrow {x}|| \cdot ||\overrightarrow{y}||}=\frac{\displaystyle \sum\limits_{i \in I_{x,y}}r_{x,i}r_{y,i}}{\sqrt{\displaystyle \sum\limits_{i \in I_{x} }r^{2}_{x,i}} \sqrt{\displaystyle \sum\limits_{i \in I_{y} }r^{2}_{y,i}}} \end{equation*}
For example for the vector $\overrightarrow{y}=\{a_{1},a_{2}\}$ and $\overrightarrow{x}=\{b_{1},b_{2}\}$, the vector cosine similarity $\overrightarrow{y}$ and $\overrightarrow{x}$ is geven by:
\begin{equation*} \text{Similarity }(x,y)= \text{Cosine }(\theta)=\frac{ \overrightarrow{x} \cdot \overrightarrow{y}}{||\overrightarrow {x}|| \cdot ||\overrightarrow{y}||}=\frac{a_{1}\cdot b_{1}+a_{2}\cdot b_{2}}{\sqrt{b_{1}^{2}+b_{2}^{2}} \sqrt{a_{1}^{2}+a_{2}^{2}}} \end{equation*}
Predictions can be made from the user_similarity and item_similarity matricec by applying following formula for user-based CF:
In user based approaches, the value of ratings user $u$ gives to item $i$ is calculated as an aggregation of some similar users' rating of the item:
$r_{u,i}= agg_{u^{\prime} \in U } r_{u^{\prime},i}$
where $U$ denotes the set of top $N$ users that are most similar to user $u$ who rated item $i$. Typical aggregation function includes:
\begin{equation*} \hat{r}_{u,i}= \bar{r_{u}} + \frac{\displaystyle \sum\limits_{u^{\prime} \in U }\text{Similarity }(u,u^{\prime})(r_{u^{\prime},i}-\bar{r}_{u^{\prime}})}{\displaystyle \sum\limits_{u^{\prime} \in U }\left|\text{Similarity }(u,u^{\prime})\right|} \end{equation*}
Model based approach :¶
In this approach, CF models are developed using machine learning algorithms to predict user’s rating of unrated items. As per my understanding, the algorithms in this approach can further be broken down into 3 sub-types.
In Model based approach, models are developed using different data mining, machine learning algorithms to predict users' rating of unrated items.
%matplotlib inline
import pandas as pd
import csv
import datetime
import re
import codecs
import requests
import random as r
import matplotlib.pyplot as plt
import os, random, time, shutil, requests
import numpy as np
from skimage import transform
from matplotlib import pyplot
import glob
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')
import pandas_datareader.data as web
import seaborn as sns; sns.set()
from datetime import datetime, timedelta
from dateutil.parser import parse
from dateutil.parser import parse
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from ast import literal_eval
#Surprise is a Python scikit building and analyzing recommender systems.
from surprise import Reader, Dataset, SVD, evaluate
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
import warnings; warnings.simplefilter('ignore')
import missingno as msno
from sklearn.model_selection import train_test_split
from numpy import linalg as LA
import multiprocessing
path='/PythonRecommenderSystem'
# Check current working directory.
retval = os.getcwd()
print ("Current working directory %s" % retval)
# Now change the directory
os.chdir( path )
# Check current working directory.
retval = os.getcwd()
print("Directory changed successfully %s" % retval)
# list files in current directory
#files = os.listdir(path)
#print(files)
The pet product reviews dataset from the Amazon product dataset used in this work can be found here.This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
import json
petreviews = []
for line in open('reviews_Pet_Supplies_5.json', 'r'):
petreviews.append(json.loads(line))
df=pd.DataFrame(petreviews )
df.head(3)
Equivalently, the zip file can be opened
import pandas as pd
import gzip
import os
path='/Downloads'
# Check current working directory.
retval = os.getcwd()
print ("Current working directory %s" % retval)
# Now change the directory
os.chdir( path )
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
df = getDF('reviews_Pet_Supplies_5.json.gz')
df.info()
df.describe()
#df['reviewTime'].iloc[x][0:2]
month=[]
for i in range(df.shape[0]):
month.append(df['reviewTime'].iloc[i][0:2])
# apply a function to a list
def Month(i):
return df['reviewTime'].iloc[i][0:2]
Month=list(map(Month,range(df.shape[0])))
# apply a function to a list
def Day(i):
return df['reviewTime'].iloc[i][3:5]
Day=list(map(Day,range(df.shape[0])))
# apply a function to a list
def Year(i):
return df['reviewTime'].iloc[i][7:11]
Year=list(map(Year,range(df.shape[0])))
df['Month']= pd.DataFrame(Month).astype(str).astype(int)
df['Day']= pd.DataFrame(Day)
df['Day']=df['Day'].str.replace(',', '').astype(str).astype(int)
df['Year']= pd.DataFrame(Year).astype(str).astype(int)
df.head(3)
df.info()
Alternative if we convert the variable unixReviewTime which is in unix time format,thus the number of seconds since January 01 1970 (UTC) into a standard date format, we can easily use pandas to extract the day, month and year from the date.
#
#convert unix time in seconds to date-time
df["Date_Time"] = pd.to_datetime(df["unixReviewTime"], unit='s')
df['year'] = df['Date_Time'].dt.year
df['month'] = df['Date_Time'].dt.month
df['day'] = df['Date_Time'].dt.day
df.head(3)
Check which columns have missing values¶
print(df.isnull().sum())
# equivalently
#check for missing values
#print(events.apply(lambda x: sum(x.isnull()),axis=0))
print('Number of rows = {} | Number of columns = {}'.format(df.shape[0], df.shape[1]))
# percent missing
print((df.isnull().sum()/df.shape[0])*100)
Visualize Unique Review Column¶
bg_color = (0.5, 0.5, 0.5)
sns.set(rc={"font.style":"normal",
"axes.facecolor":bg_color,
"axes.titlesize":30,
"figure.facecolor":bg_color,
"text.color":"black",
"xtick.color":"black",
"ytick.color":"black",
"axes.labelcolor":"black",
"axes.grid":False,
'axes.labelsize':30,
'figure.figsize':(10.0, 10.0),
'xtick.labelsize':25,
'ytick.labelsize':20})
#plt.rcParams.update(params)
sns.countplot(df["overall"],palette="Blues")
plt.xlabel('rating ')
plt.title('rating of the product ')
plt.show()
Visualize missing Values¶
There are no missing observations in the data.
msno.bar(df)
plt.show()
check for correlation among variables¶
We use the seaborn heatmap function heatmap to visualize the correlations between the variables. Common choices of cmap include inferno,viridis,YlGnBu,Blues,BuPu and Greens.
corr = df[df.columns].corr()
sns.set(rc={"font.style":"normal",
"axes.titlesize":30,
"text.color":"black",
"xtick.color":"black",
"ytick.color":"black",
"axes.labelcolor":"black",
"axes.grid":False,
'axes.labelsize':30,
'figure.figsize':(10.0, 10.0),
'xtick.labelsize':20,
'ytick.labelsize':20})
sns.heatmap(corr,annot = True, annot_kws={"size": 15},cmap="viridis")
plt.show()
Memory based Collaborative Filtering¶
Construct the user-to-item matrix
#the shape of a vector is equivalent to
# its length
n_users = df['reviewerID'].unique().shape[0]
n_items = df['asin'].unique().shape[0]
n_ratings = df['overall'].unique().shape[0]
print('Number of users = {} | Number of items = {}'.format(str(n_users), str(n_items)))
print ('%i Number of unique ratings' %n_ratings)
Model-based Collaborative Filtering¶
Singular Value Decomposition¶
If a matrix $A$ has a matrix of eigenvectors $P$ that is not invertible then $A$ does not have an eigen decomposition.The singular value decomposition of an $m\times n$ matrix $A$ is the factorization of $A$ into the product of three matrices
$A=UDV^{T}$.
Where the $U$ is an $m \times m$ matrix and an $n \times n$ matrix $V$ both have orthogonal column vectors such that
$UU^{T}=I$
and
$VV^{T}=I$
the matrix $D$ is an $m \times n$ diagonal matrix with non-negative real numbers on the diagonal. The diagonal elements $\sigma_{i}$ of $D$ are singular values of $A$.The singular values of a matrix A are the square roots of the non-zero eigenvalues of the matrix $A^{T}A$, where $A^{T}$ the transpose-conjugate matrix if it has complex coefficients, or the transpose if it has real coefficients.
Common applications of SVD include Principal Component Analysis, computing the pseudoinverse, least squares fitting of data, multivariable control, matrix approximation, and determining the rank, range and null space of a matrix.
In matrix factorization based recommendation systems,interest is in maintaining the same dimensionality.The matrix factorization is done on the user-item ratings matrix. Matrix factorization can be understood as finding 2 matrices whose product is the original matrix. The $U$ matrix represents the feature vectors corresponding to the users in the hidden feature space and the $V$ matrix represents the feature vectors corresponding to the items in the hidden feature space.
The predictions can be made by taking a dot product of $U, S$ and $V^{T}$.
Each item can be represented by a vector $q_{i}$. Similarly each user can be represented by a vector $p_{u}$ such that the dot product of those 2 vectors is the unbiased expected rating
$\text{The prediction or expected rating :} \hspace{5mm} \hat{r}_{ui}= q_{i}^{T}P_{u}$
$q_{i}$ and $p_{u}$ can be found by minimizing the squared error difference between the observed rating and the expected rating of the user-item matrix.
The addition of a penalty term in the minimization equation reduces overfitting the training set and allows the model to generalize to unseen data. This is represented by a regularization parameter $\lambda$ multiplied by the square sum of the magnitudes of user and item vectors.
$ min(P_{u},q_{i}) \displaystyle \sum\limits_{r_{ui} \in R_{train} } \left( \hat{r}_{ui}-r_{ui} \right)^{2} +\lambda \left( \Vert q_{i} \Vert^{2} +\|P_{u}\|^{2}\right) $
The residual error between the predicted and actual value can be reduced by adding some characteristics from the dataset to the prediction model.For each user-item $(u,i)$ pair we can extract 3 parameters. $\mu$ which is the average ratings of all items, $b_{i}$ which is the average rating of item $i$ minus $\mu$ and $b_{u} $ which is the average rating given by user $u$ minus $\mu$ which makes the expected rating:
$\text{The prediction or expected rating :}\hat{r}_{ui}= \mu +b_{u} +b_{i}+ q_{i}^{T}P_{u}$
$q_{i}$ and $p_{u}$ can be found by minimizing the squared error difference between the observed rating and the expected rating of the user-item matrix.
$ min(P_{u},q_{i},b_{u},b_{i}) \displaystyle \sum\limits_{r_{ui} \in R_{train} } \left( \hat{r}_{ui}-r_{ui} \right)^{2} +\lambda \left( b_{u}^{2} +b_{i}^{2}+\Vert q_{i} \Vert^{2} +\|P_{u}\|^{2}\right) $
The minimization can be performed by a very stochastic gradient descent or alternating least squares.
from surprise import SVD
from surprise import KNNBasic
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold
from sklearn.model_selection import train_test_split
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
trainset, testset = train_test_split(df, test_size=0.20)
# A reader is still needed but only the rating_scale param is required.
# The Reader class is used to parse a file containing ratings.
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
traindata = Dataset.load_from_df(trainset[['reviewerID', 'asin', 'overall']], reader)
traindata.split(n_folds=5)
from surprise import SVD, evaluate
from surprise import NMF
algo = SVD()
evaluate(algo, traindata, measures=['RMSE', 'MAE'])
#algo = NMF()
#evaluate(algo, traindata, measures=['RMSE', 'MAE'])
#algorithms = (SVD, KNNBasic, KNNWithMeans, NormalPredictor)
#pd.DataFrame.as_matrix(testset)
#df.values
predictions = algo.test(testset[['reviewerID', 'asin', 'overall']].values)
# Then compute RMSE
accuracy.rmse(predictions)
accuracy.mae(predictions)
A collaborative filtering algorithm based on Non-negative Matrix Factorization¶
Non-negative matrix factorization is a linear, non-negative approximate data representation. Let $T$ be measurements of $N$ non-negative scalar variables of some data. For an $N$ dimensional measurement vectors v^{t}\hspace{5mm} t=1,\cdots, T$ a linear approximation of the data is given by :
$v^{T} \approx \displaystyle \sum\limits_{i=1}^{M}W_{i}h_{i}^{t}=Wh^{t}$
In matrix notation
$ V \approx WH $
where $W$ is an $N \times M $ matrix with basis vectors $w_{i}$ on its columns and
where each column of $H$ contains the coefficient vector $h^{t}$ corresponding to the measurement vector $v^{t}$.All three matrices exhibit the property that they have no negative elements.
some common applications of NMF include astronomy, computer vision, document clustering, chemometrics, audio signal processing, recommender systems and bioinformatics.
In order to construct recommendations, user $\times$ item matrix is constructed such that every element is a rating if it exist.
from surprise import SVD
from surprise import KNNBasic
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold
from sklearn.model_selection import train_test_split
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
trainset, testset = train_test_split(df, test_size=0.20)
# A reader is still needed but only the rating_scale param is required.
# The Reader class is used to parse a file containing ratings.
# path to dataset file
#file_path = '/home/nico/.surprise_data/ml-100k/ml-100k/u.data' # change this
# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
#reader = Reader(line_format='user item rating timestamp', sep='\t')
#data = Dataset.load_from_file(file_path, reader=reader)
#data.split(n_folds=5)
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
traindata = Dataset.load_from_df(trainset[['reviewerID', 'asin', 'overall']], reader)
traindata.split(n_folds=5)
from surprise import SVD, evaluate
from surprise import NMF
algo = NMF()
evaluate(algo, traindata, measures=['RMSE', 'MAE'])
#algo = NMF()
#evaluate(algo, traindata, measures=['RMSE', 'MAE'])
#algorithms = (SVD, KNNBasic, KNNWithMeans, NormalPredictor)
#pd.DataFrame.as_matrix(testset)
#df.values
predictions = algo.test(testset[['reviewerID', 'asin', 'overall']].values)
# Then compute RMSE
accuracy.rmse(predictions)
accuracy.mae(predictions)
pd.DataFrame(predictions).head()
We can obtain the rating prediction for single user as below:
#uid = str(A2K2BWL8PYPV3S) # raw user id (as in the ratings file). They are **strings**!
#iid = str(B0042CVBNM) # raw item id (as in the ratings file). They are **strings**!
# get a prediction for specific users and items.
l=testset[['reviewerID', 'asin', 'overall']].values[0]
uid=l[0]
iid=l[1]
r=l[2]
pred = algo.predict(uid, iid, r, verbose=True )
SVD With scipy¶
Create User-Item Matrix
user_item = df.pivot(index = 'reviewerID', columns ='asin', values = 'overall').fillna(0)
user_item_matrix=user_item.as_matrix( )
user_item_matrix[0:4,0:5]
Matrix Sparsity (Density)¶
The percentage on non-zero elements to the number of of elements in total (1-the ratio of non-zero elements to the number of elemnts).If most of the elements are nonzero, then the matrix is considered dense
$\text{sparsity} = \frac{\text{count zero elements}}{ \text{total elements}} =1-\text{density}$
# several ways to find sparsity of a matrix
from numpy import count_nonzero
sparsity =(1.0 - count_nonzero(user_item_matrix) / user_item_matrix.size)*100
print('The sparsity of the data is {} %'.format(sparsity))
sparsity=round(1.0-len(df)/float(n_users*n_items),10)*100
print('The sparsity of the data is {} %'.format(sparsity))
sparsity = float(len((user_item_matrix).nonzero()[0]))
sparsity /= (user_item_matrix.shape[0] * user_item_matrix.shape[1])
sparsity = 1-sparsity
sparsity *= 100
print('The sparsity of the data is {} %'.format(sparsity))
sparsity = float(len((user_item_matrix).nonzero()[0]))
sparsity /= (user_item_matrix.shape[0] * user_item_matrix.shape[1])
sparsity *= 100
print ('percentage of user-items that have a rating: {:.5f}%'.format(sparsity))
# convert to sparse matrix (CSR method)
from scipy.sparse import csr_matrix
user_item_sparse=csr_matrix(user_item_matrix)
# reconstruct dense matrix
#user_item_sparse.todense()
Split user_item_matrix into train and test set¶
trainset_data, testset_data = train_test_split(user_item_sparse, test_size=0.20)
#A=trainset_user.as_matrix()
A=trainset_data
user_ratings_mean = np.mean(A, axis = 1)
A_demeaned = A - user_ratings_mean.reshape(-1, 1)
from scipy.sparse.linalg import svds
U, D, Vt = svds(A_demeaned, k = 50)
D = np.diag(D)
user_predicted_ratings = np.dot(np.dot(U, D), Vt) + user_ratings_mean.reshape(-1, 1)
user_predicted_ratings[0:2,0:5]
prediction = pd.DataFrame(user_predicted_ratings, columns = user_item.columns )
prediction.iloc[0:4,0:8]
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
prediction = prediction[ground_truth.nonzero()].flatten()
ground_truth = ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction, ground_truth))
The pairwise_distances function from sklearn can calculate the cosine similarity. The output ranges from 0 to 1 since the ratings are all positive.
Memory-Based Collaborative Filtering
This creats a validation dataset by selecting rows (user) that have 35 or more ratings, then randomly select 15 of those ratings for validation set, but set those values to 0 in the training set.
def train_test_splitt(user_item_matrix):
validation = np.zeros(user_item_matrix.shape)
train = user_item_matrix.copy()
for user in np.arange(user_item_matrix.shape[0]):
# 35 seems to be best, it depends on sparsity of your user-item matrix
if len(user_item_matrix[user,:].nonzero()[0])>=35:
val_ratings = np.random.choice(user_item_matrix[user, :].nonzero()[0],
size=15, #tweak this, 15 seems to be optimal
replace=False)
train[user, val_ratings] = 0
validation[user, val_ratings] = user_item_matrix[user, val_ratings]
print(validation.shape)
return train, validation
train, val = train_test_split(user_item_matrix)
from numpy import linalg as LA
def cos_similarity(user_item_matrix, kind='user', epsilon=1e-9):
# epsilon -> small number for handling division-by-zero errors
if kind == 'user':
sim = user_item_matrix.dot(user_item_matrix.T) + epsilon
norms=np.array(LA.norm(user_item_matrix)*LA.norm(user_item_matrix.T))
elif kind == 'item':
sim = user_item_matrix.T.dot(user_item_matrix) + epsilon
norms = np.array(LA.norm(user_item_matrix.T)*LA.norm(user_item_matrix))
return (sim / norms )
user_similarity = cos_similarity(train, kind='user')
item_similarity=cos_similarity(train, kind='item')
user_similarity[0:4,0:5]
def predict_nobias(user_item_matrix, similarity, kind='user'):
if kind == 'user':
user_bias = user_item_matrix.mean(axis=1)
user_item_matrix = (user_item_matrix - user_bias[:, np.newaxis]).copy()
pred = similarity.dot(user_item_matrix) / np.array([np.abs(similarity).sum(axis=1)]).T
pred += user_bias[:, np.newaxis]
elif kind == 'item':
item_bias = user_item_matrix.mean(axis=0)
user_item_matrix = (user_item_matrix - item_bias[np.newaxis, :]).copy()
pred = user_item_matrix.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
pred += item_bias[np.newaxis, :]
return pred
user_prediction2 = predict_nobias(train, user_similarity, kind='user')
item_prediction2 = predict_nobias(train, item_similarity, kind='item')
user_prediction2[0:3,0:5]
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
prediction = prediction[ground_truth.nonzero()].flatten()
ground_truth = ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction, ground_truth))
print ('User-based bias-adjusted CF RMSE: %.2f' %rmse(user_prediction2, val))
print ('Item-based bias-adjusted CF RMSE: %.2f' %rmse(item_prediction2, val))
Model based collaborative filtering - matrix factorization (via SVD)¶
import scipy.sparse as sp
from scipy.sparse.linalg import svds
#get SVD components from train matrix. Choose k.
u, s, vt = svds(train, k = 50)# dimensionality for rank matrix
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print ('matrix-factorization CF RMSE: %.2f' %rmse(X_pred, val))
pd.DataFrame(X_pred, columns = user_item.columns).iloc[0:4,0:9]
Using Stochastic Gradient Descent method to update the above low rank matrices for the latent features (online learning).¶
Stochastic gradient descent (SGD), also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function, a stochastic approximation of gradient descent optimization.
The standard gradient descent algorithm updates the parameters θ of the objective $J(θ)$ as,
$\theta=\theta-\eta \nabla_{\theta}E[J(\theta)] $
where ${\displaystyle \eta } $ is a step size (learning rate in machine learning). The expectation in the above equation is approximated by evaluating the cost and gradient over the full training set. Stochastic Gradient Descent (SGD) replaces the expectation in the update and computes the gradient of the parameters using only a single or a few randomly shuffled examples in the training set. The new update is given by,
$\theta=\theta-\eta \nabla_{\theta}J(\theta;x^{(i)},y^{(i)}) $
with a pair $ (x(i),y(i))$ from the training set.
#P is latent user feature matrix
#Q is latent item feature matrix
def prediction(P,Q):
return np.dot(P.T,Q)
lmbda = 0.4 # Regularization parameter
k = 3 #tweak this parameter
m, n = train.shape # Number of users and items
n_epochs = 100 # Number of epochs
eta=0.01 # Learning rate
P = 3 * np.random.rand(k,m) # Latent user feature matrix
Q = 3 * np.random.rand(k,n) # Latent movie feature matrix
train_errors = []
val_errors = []
#Only consider items with ratings
users,items = train.nonzero()
for epoch in range(n_epochs):
for u, i in zip(users,items):
e = train[u, i] - prediction(P[:,u],Q[:,i]) # Calculate error for gradient update
P[:,u] += eta * ( e * Q[:,i] - lmbda * P[:,u]) # Update latent user feature matrix
Q[:,i] += eta * ( e * P[:,u] - lmbda * Q[:,i]) # Update latent item feature matrix
train_rmse = rmse(prediction(P,Q),train)
val_rmse = rmse(prediction(P,Q),val)
train_errors.append(train_rmse)
val_errors.append(val_rmse)
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(rc={"font.style":"normal",
#"axes.facecolor":bg_color,
"axes.titlesize":30,
#"figure.facecolor":bg_color,
"text.color":"black",
"xtick.color":"black",
"ytick.color":"black",
"axes.labelcolor":"black",
"axes.grid":True,
'axes.labelsize':20,
'figure.figsize':(10.0, 7.0),
'xtick.labelsize':25,
'legend.fontsize': 16,
'legend.handlelength': 2,
'ytick.labelsize':20})
plt.plot(range(n_epochs), train_errors, marker='o', label='Training Data');
plt.plot(range(n_epochs), val_errors, marker='v', label='Validation Data');
plt.xlabel('Number of Epochs');
plt.ylabel('RMSE');
plt.legend()
plt.grid()
plt.show()
Take a look at prediction vs. actual ratings¶
SGD_prediction=prediction(P,Q)
estimation= SGD_prediction[val.nonzero()]
ground_truth = val[val.nonzero()]
results=pd.DataFrame({'prediction':estimation, 'actual rating':ground_truth})
results.head()
Finding Optimal k (dimensionality rank for matrix)¶
Top-k Collaborative Filtering¶
The prediction MSE can be improved by considering the top k users who are most similar to the input user (or, similarly, the top $k$ items). That is, when we calculate the sums over $u′$
\begin{equation*} \hat{r}_{u,i}= \frac{\displaystyle \sum\limits_{u^{\prime} \in U }\text{Similarity }(u,u^{\prime})r_{u^{\prime},i}}{\displaystyle \sum\limits_{u^{\prime} \in U }\left|\text{Similarity }(u,u^{\prime})\right|} \end{equation*}
The sum is over the top k most similar users. Parameter tuning can be used to find the optimal value for minimizing RMSE in the test set.
def predict_topk(user_item_matrix, similarity, kind='user', k=40):
pred = np.zeros(user_item_matrix.shape)
if kind == 'user':
for i in range(user_item_matrix.shape[0]):
top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]
for j in range(user_item_matrix.shape[1]):
pred[i, j] = similarity[i, :][top_k_users].dot(user_item_matrix[:, j][top_k_users])
pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))
if kind == 'item':
for j in range(user_item_matrix.shape[1]):
top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]
for i in range(user_item_matrix.shape[0]):
pred[i, j] = similarity[j, :][top_k_items].dot(user_item_matrix[i, :][top_k_items].T)
pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items]))
return pred
from sklearn.metrics import mean_squared_error
def get_rmse(pred, actual):
# Ignore nonzero terms.
pred = pred[actual.nonzero()].flatten()
actual = actual[actual.nonzero()].flatten()
return sqrt(mean_squared_error(pred, actual))
pred = predict_topk(train, user_similarity, kind='user', k=40)
print('Top-k User-based CF MSE: {}'.format(str(get_rmse(pred, val))))
pred = predict_topk(train, item_similarity, kind='item', k=40)
print('Top-k Item-based CF MSE: {}'.format(str(get_rmse(pred, val))))
k_array = [5, 15, 30, 50, 100, 200]
user_train_rmse = []
user_test_rmse = []
item_test_rmse = []
item_train_rmse = []
for k in k_array:
user_pred = predict_topk(train, user_similarity, kind='user', k=k)
item_pred = predict_topk(train, item_similarity, kind='item', k=k)
user_train_rmse += [get_rmse(user_pred, train)]
user_test_rmse += [get_rmse(user_pred, val)]
item_train_rmse += [get_rmse(item_pred, train)]
item_test_rmse += [get_rmse(item_pred, val)]
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
pal = sns.color_palette("viridis", 2)
plt.figure(figsize=(12, 8))
plt.plot(k_array, user_train_rmse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5)
plt.plot(k_array, user_test_rmse, c=pal[0], label='User-based test', linewidth=5)
plt.plot(k_array, item_train_rmse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5)
plt.plot(k_array, item_test_rmse, c=pal[1], label='Item-based test', linewidth=5)
plt.legend(loc='best', fontsize=20)
plt.xticks(fontsize=16);
plt.yticks(fontsize=16);
plt.xlabel('K', fontsize=20);
plt.ylabel('RMSE', fontsize=20);