Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the capability of incremental SVD to python-recsys which is "Folding-in" new users (or items) to the SVD model so they can receive recommendations instantly without re-building the model from scratch each time new users come to the system. #31

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
1 change: 1 addition & 0 deletions AUTHORS
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
Oscar Celma (ocelma __at__ gmail __dot__ com), http://ocelma.net
Ibrahim Abou Elseoud (Ibrahim__dot__Elseoud__at__ gmail __dot__ com), for updating SVD model part
5 changes: 5 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,8 @@ See: https://github.com/ocelma/python-recsys/commits/master

2011-10-08
* Added the whole project at github

2017-06-08
* Added updating the SVD model incrementally by folding-in
* Added a capability to split the dataset for train,test,foldin to facilitate testing the Fold-in implementation
* Added update to github
160 changes: 147 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,20 @@ python-recsys

A python library for implementing a recommender system.

Incremental SVD update for python-recsys
========================================
- python-recsys now supports incrementally adding new users or items instead of building the model from scratch for these new users or items via the folding-in technique which was mentioned in Sarwar et al.'s `paper`_ (Titled: Incremental Singular Value Decomposition Algorithms for Highly Scalable Recommender Systems), this latest commit is simply an implementation to it for python-recsys.

.. _`paper`: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.7894&rep=rep1&type=pdf

- A `Demonstration video is available`_ for this latest commit in form of a demo site built using the MEAN stack which uses the updated python-recsys as backend for the recommender which folds-in the website's user in to the SVD model and gets recommendations instantaneously instead of building the model from scratch.

.. _`Demonstration video is available`: https://youtu.be/tIvQxBfa2d4

- There is also an accompanying `bachelor thesis paper`_ (For those interested) which outlines the background, architecture and discusses the "Folding-in" approach.

.. _`bachelor thesis paper`: https://drive.google.com/file/d/0BylQe2cRVWE_RmZoUTJYSGZNaXM/view

Installation
============

Expand Down Expand Up @@ -57,20 +71,20 @@ Example

from recsys.algorithm.factorize import SVD
svd = SVD()
svd.load_data(filename='./data/movielens/ratings.dat',
sep='::',
svd.load_data(filename='./data/movielens/ratings.dat',
sep='::',
format={'col':0, 'row':1, 'value':2, 'ids': int})

2. Compute Singular Value Decomposition (SVD), M=U Sigma V^t:

::

k = 100
svd.compute(k=k,
min_values=10,
pre_normalize=None,
mean_center=True,
post_normalize=True,
svd.compute(k=k,
min_values=10,
pre_normalize=None,
mean_center=True,
post_normalize=True,
savefile='/tmp/movielens')

3. Get similarity between two movies:
Expand Down Expand Up @@ -111,10 +125,10 @@ Example
USERID = 1

svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING)
# Predicted value 5.0
# Predicted value 5.0

svd.get_matrix().value(ITEMID, USERID)
# Real value 5.0
# Real value 5.0

6. Recommend (non-rated) movies to a user:

Expand Down Expand Up @@ -152,7 +166,129 @@ Example
(4801, 5.4947999354188548),
(1131, 5.4941438045650068),
(2339, 5.4916048051511659)]


Example for incremental update
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Load Movielens dataset and prepare for training and testing:

::

import recsys.algorithm
recsys.algorithm.VERBOSE = True

from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data

filename = “(your movielens file path here)”

#In movielens dataset, the user is at 0 so I made them the row (could keep it as above {'col':0, 'row':1, 'value':2, 'ids': int} but I changed order to emphasis a parameter in an upcoming function)
format = {'col':1, 'row':0, 'value':2, 'ids': int}

data = Data()
data.load(filename, sep='::', format=format)
#splits the dataset according to row or column (based on is_row=true or false) which causes there to be no overlap (of users for example) between train and foldin dataset
train, test, foldin = data.split_train_test_foldin(base=60,percentage_base_user=80,shuffle_data=True,is_row=True) #since users are in the row so is_row=true

# Returns: a tuple <Data, Data, Data> for train, test, foldin
# Prints: (If VERBOSE=True)
total number of tuples: 1000209
percentage of data for training: 48.0 % with 479594 tuples
percentage of data for testing: 20.0 % with 200016 tuples # 100-percentage_base_user per user (percentage of tuples which means the ratings since a user has many tuples(ratings))
percentage of data for foldin: 32.0 % with 320599 tuples
_____________
percentage of users for foldin: 40.0 % with 2416 users # 100-base= foldin (percentage of users)
percentage of users for training: 60.0 % with 3624 users #base for training (percentage of users)

2. Compute Singular Value Decomposition (SVD), M=U Sigma V^t:

::

svd = SVD()
svd.set_data(train)
svd.compute(k=100,
min_values=1,
pre_normalize=None,
mean_center=False,
post_normalize=True)

# Prints:
Creating matrix (479594 tuples)
Matrix density is: 3.7007%
Updating matrix: squish to at least 1 values
Computing svd k=14, min_values=1, pre_normalize=None, mean_center=False, post_normalize=False

3. "Foldin" those new users or items (update model instead of updating from scratch)

::

svd.load_updateDataBatch_foldin(data=foldin,is_row=True)

# Prints: (If VERBOSE=True)
before updating, M= (3624, 3576)
done updating, M= (6040, 3576) # Folds in all the new users (not previously in model)

4. Recommend (non-rated) movies to a NEW user
::

user_id=foldin[0][1] #returns userID which is in foldin dataset BUT not in train dataset
svd.recommend(user_id,is_row=True,only_unknowns=True) #The userID is in row and gets only the unrated (unknowns)

# Returns: <ITEMID, Predicted Rating>
[(1307, 3.6290483094468913),
(1394, 3.5741565545425957),
(1259, 3.5303836262378048),
(1968, 3.4565426585553927),
(2791, 3.3470277643217203),
(1079, 3.268283171487782),
(1198, 3.2381080336246675),
(593, 3.204915630088236),
(1270, 3.1859618303393233),
(2918, 3.1548530640630252)]

5. Recommend (non-rated) movies to a NEW user and validate not in base model (prior to folding-in)
::

# BEFORE running points 3 and 4 (prior to calling svd.load_updateDataBatch_foldin)

user_id=foldin[0][1] #returns userID which is in foldin dataset BUT not in train dataset

# Try block to validate that the userID is new and not in the base model
try:
print "Getting recommendation for user_id which was not in original model training set"
print "recommendations:",svd.recommend(user_id)
except Exception:
print "New user not in base model so in except block and will foldin the foldin dataset (update the model NOT calculate from scratch)"
svd.load_updateDataBatch_foldin(data=foldin,format=format,is_row=True,truncate=True,post_normalize=True)
print "recommendations:",svd.recommend(user_id,is_row=True,only_unknowns=True) #The userID is in row and get us only the unrated (unknowns)


# Prints:
Getting recommendation for user_id which was not in original model training set
recommendations: New user not in base model so in except block and will foldin the foldin dataset (update the model NOT calculate from scratch)
before updating, M= (3624, 3576)
done updating, M= (6040, 3576)
recommendations: [(1307, 3.6290483094468913), (1394, 3.5741565545425957), (1259, 3.5303836262378048), (1968, 3.4565426585553927), (2791, 3.3470277643217203), (1079, 3.268283171487782), (1198, 3.2381080336246675), (593, 3.204915630088236), (1270, 3.1859618303393233), (2918, 3.1548530640630252)]


6. Load previous SVD model and foldin NEW users from file then instantly get recommendations
::

format = {'col':1, 'row':0, 'value':2, 'ids': int}

svd = SVD()
#load base svd model
svd.load_model('SVDModel')

# load new users by their movie rating data file and use it to fold-in the users into the model (loads data and folds in)
svd.load_updateDataBatch_foldin(filename = 'newUsers.dat', sep='::', format=formate, is_row=True)

# gets recommendedations
print "recommendations:", svd.recommend(new_userID,is_row=True,only_unknowns=True)


- All the normal functionalities of python-recsys are compatible with the incremental update commit. The incremental update can even work if you load the model then foldin a new user or users or even items.

- Please note that preexisting users can't be folded-in only new users which aren't already in the svd model.

Documentation
~~~~~~~~~~~~~
Expand All @@ -168,10 +304,8 @@ To create the HTML documentation files from doc/source do:
cd doc
make html

HTML files are created here:
HTML files are created here:

::

doc/build/html/index.html


11 changes: 11 additions & 0 deletions recsys/algorithm/baseclass.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,15 @@ def __init__(self):
self._matrix_similarity = None #self-similarity matrix (only for the input Matrix rows)
self._matrix_and_data_aligned = False #both Matrix and Data contain the same info?

#new for update
self._updateData=Data()
self._singleUpdateMatrix=SparseMatrix()
#new for batch
self._batchDict={}
#new for foldin additional
self._singleAdditionalFoldin=SparseMatrix()


def __len__(self):
return len(self.get_data())

Expand Down Expand Up @@ -97,6 +106,8 @@ def load_data(self, filename, force=True, sep='\t', format={'value':0, 'row':1,
self._matrix_similarity = None

self._data.load(filename, force, sep, format, pickle)



def save_data(self, filename, pickle=False):
"""
Expand Down
Loading