forked from phonedude/cs532-s17
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patha7.txt
163 lines (124 loc) · 6.39 KB
/
a7.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
CS 432/532 Web Science
Spring 2017
http://phonedude.github.io/cs532-s17/
Assignment #7
Due: 11:59pm April 6
The goal of this project is to use the basic recommendation principles
we have learned for user-collected data. You will modify the code
given to you which performs movie recommendations from the MovieLense
data sets.
The MovieLense data sets were collected by the GroupLens Research
Project at the University of Minnesota during the seven-month period
from September 19th, 1997 through April 22nd, 1998. We are using the
"100k dataset"; available for download from:
http://grouplens.org/datasets/movielens/100k/
There are three files which we will use:
1. u.data: 100,000 ratings by 943 users on 1,682 movies. Each
user has rated at least 20 movies. Users and items are numbered
consecutively from 1. The data is randomly ordered. This is a tab
separated list of
user id | item id | rating | timestamp
The time stamps are unix seconds since 1/1/1970 UTC.
Example:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
2. u.item: Information about the 1,682 movies. This is a tab
separated list of
movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie is of
that genre, a 0 indicates it is not; movies can be in several genres
at once. The movie ids are the ones used in the u.data data set.
Example:
161|Top Gun (1986)|01-Jan-1986||http://us.imdb.com/M/title-exact?Top%20Gun%20(1986)|0|1|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0
162|On Golden Pond (1981)|01-Jan-1981||http://us.imdb.com/M/title-exact?On%20Golden%20Pond%20(1981)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
163|Return of the Pink Panther, The (1974)|01-Jan-1974||http://us.imdb.com/M/title-exact?Return%20of%20the%20Pink%20Panther,%20The%20(1974)|0|0|0|0|0|1|0|0|0|0|0|0|0|0| 0|0|0|0|0
3. u.user: Demographic information about the users. This is a tab
separated list of:
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.
Example:
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
The code for reading from the u.data and u.item files and creating
recommendations is described in the book Programming Collective
Intelligence. Feel free to modify the PCI code to answer the
following questions.
Questions (10 points).
1. Find 3 users who are closest to you in terms of age,
gender, and occupation. For each of those 3 users:
- what are their top 3 favorite films?
- bottom 3 least favorite films?
Based on the movie values in those 6 tables (3 users X (favorite +
least)), choose a user that you feel is most like you. Feel
free to note any outliers (e.g., "I mostly identify with user 123,
except I did not like ``Ghost'' at all").
This user is the "substitute you".
2. Which 5 users are most correlated to the substitute you? Which
5 users are least correlated (i.e., negative correlation)?
3. Compute ratings for all the films that the substitute you
have not seen. Provide a list of the top 5 recommendations for films
that the substitute you should see. Provide a list of the bottom
5 recommendations (i.e., films the substitute you is almost certain
to hate).
4. Choose your (the real you, not the substitute you) favorite and
least favorite film from the data. For each film, generate a list
of the top 5 most correlated and bottom 5 least correlated films.
Based on your knowledge of the resulting films, do you agree with
the results? In other words, do you personally like / dislike
the resulting films?
Extra credit
(3 points)
5. Rank the 1,682 movies according to the 1997/1998 MovieLense
data. Now rank the same 1,682 movies according to todays (March
2016) IMDB data (break ties based on # of users, for example: 7.2
with 10,000 raters > 7.2 with 9,000 raters).
Draw a graph, where each dot is a film (i.e., 1,682 dots). The
x-axis is the MovieLense ranking and the y-axis is today's IMDB
ranking.
What is Pearon's r for the two lists (along w/ the p-value)? Assuming
the two user bases are interchangable (which might not be a good
assumption), what does this say about the attitudes about the films
after nearly 20 years?
(3 points)
6. Repeat #6, but IMDB data from approximately July 31, 2005. What
is the cumulative error (in days) from the desired target day of
July 31, 2005? For example, if 1 memento is from July 1, 2005 and
another memento is from July 31, 2006, then the cumulative error
for the two mementos is 30 days + 365 days = 385 days.
Note: the URIs in the MovieLens data redirect, be sure to use
the final values as URI-Rs for the archives:
$ curl -i -L --silent "http://us.imdb.com/M/title-exact?Top%20Gun%20(1986)"
HTTP/1.1 301 Moved Permanently
Date: Wed, 16 Mar 2016 18:37:06 GMT
Server: Server
Location: http://www.imdb.com/M/title-exact?Top%20Gun%20(1986)
Content-Length: 260
Content-Type: text/html; charset=iso-8859-1
HTTP/1.1 302 Found
Date: Wed, 16 Mar 2016 18:37:06 GMT
Server: HTTPDaemon
X-Frame-Options: SAMEORIGIN
Cache-Control: private
Location: http://www.imdb.com/title/tt0092099/
Content-Type: text/plain
Set-Cookie: uu=BCYuNIAbuc9FDeWcqVNAaaXLjXbagPPhyTQbhxr8CTOkHFcqkeyRbKqvk_m6buuHjmHkufNf5z5S4WGfKlG6BPOhzgA-jcsRZ5Q7GW2MJP0wNI9AZMnd245Mw_xI6spRuK_VF2lSxUGPIRXy4d-NY-YwZkqTEZ8uTOXchLSvqBpgsDI;expires=Thu, 30 Dec 2037 00:00:00 GMT;path=/;domain=.imdb.com
Vary: Accept-Encoding,User-Agent
P3P: policyref="http://i.imdb.com/images/p3p.xml",CP="CAO DSP LAW CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA HEA PRE LOC GOV OTC "
Content-Length: 0
HTTP/1.1 200 OK
Date: Wed, 16 Mar 2016 18:37:06 GMT
Server: Server
X-Frame-Options: SAMEORIGIN
Content-Security-Policy: frame-ancestors 'self' imdb.com *.imdb.com *.media-imdb.com withoutabox.com *.withoutabox.com amazon.com *.amazon.com amazon.co.uk *.amazon.co.uk amazon.de *.amazon.de translate.google.com images.google.com www.google.com www.google.co.uk search.aol.com bing.com www.bing.com
Content-Type: text/html;charset=UTF-8
Content-Language: en-US
Vary: Accept-Encoding,User-Agent
[deletia...]