-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathEDA
205 lines (205 loc) · 15.3 KB
/
EDA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback** \n",
"\n",
"Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.\n",
"\n",
"Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# COGS 108 - Data Checkpoint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Names\n",
"\n",
"- Leila Andrade\n",
"- Suhaib Chowdhury\n",
"- Yuang Cui\n",
"- Noah Kim\n",
"- Yvonne Ng"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Research Question"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What characteristics of any given Lego set contributes the most, the least or varying degree to its popularity amongst a population of online consumers worldwide? How accurately can these relationships derived from analyzing the characteristics amongst existing Lego sets predict the popularity of future sets? Some of the characteristics that we would like to explore further in the following sub-questions:\n",
"\n",
"- What LEGO sets have been inspired by movies that contribute to set popularity?\n",
"- How do varying number of pieces amongst sets influence their popularity?\n",
"- Does the color variability contribute to set popularity?\n",
"- Does price variability have a significant impact on set popularity?\n",
"- What impact do consumer ratings and reviews have on set popularity?\n",
"- How does the presence of LEGO Minifigures affect the set popularity?\n",
"- Does the designer of the LEGO set have any influence on its popularity?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Background and Prior Work"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"LEGO® Group was built brick by brick to become the powerhouse that it is today, reaching millions of consumers and accumulating billions in revenue each year. Increases in global market share and profit margins reflect their rise in popularity. In 2022, the LEGO Group experienced a whopping 12 percent growth in consumer sales which can be attributed to introduction of sets from themes such as LEGO® City, LEGO® Icons and LEGO® Technic™, as well as themes with intellectual property partners such as LEGO® *Star Wars*™ and LEGO® Harry Potter™ <a name=\"cite_ref-1\"></a>[<sup>1</sup>](#cite_note-1). All these LEGO sets have one thing in common - the brick - which comes in different shapes, sizes, and colors.\n",
"LEGO®s were not always synonymous with the versatile plastic pieces meant for play, exploration, art or more. Ole Kirk Kristiansen, a master carpenter from Billund, Denmark, began crafting wooden toys, and combined the Danish words “Leg Godt” for “play well” to form the renowned “LEGO®” in 1936 <a name=\"cite_ref-2\"></a>[<sup>2</sup>](#cite_note-2). Ole also launched the LEGO® bricks predecessors, Automatic Binding Bricks, however it was Godtfred Kirk Christiansen, his son, that catapulted LEGO® Group to new heights in the 1950s by introducing LEGO® System in Play and re-designing the LEGO® brick <a name=\"cite_ref-2\"></a>[<sup>2</sup>](#cite_note-2). The next few decades witness an explosion in available LEGO® sets due to the introduction of new pieces, themes, and entertainment as well as construction of stores, factories and theme parks. \n",
"\n",
"A 2016 analysis conducted by Mode data scientist Joel Carron, revealed some interesting findings from a 67-year timeline of LEGO® sets. For instance, LEGO® bricks are graying and large sets with more pieces continue to grow while small sets are relatively the same <a name=\"cite_ref-3\"></a>[<sup>3</sup>](#cite_note-3). Thus, we find it critical to analyze the colors and number of pieces across all types of sets. Another notable trend is that the number of sets released each year have increased except when the LEGO® group was battling against bankruptcy between the years of 2004 and 2009 <a name=\"cite_ref-3\"></a>[<sup>3</sup>](#cite_note-3). This is not surprising as the whole world was experiencing a financial crisis and many companies struggled to stay afloat. Hence, this period of financial uncertainty could have an impact on our data, and we need to be cognizant of the fact. \n",
"\n",
"Subsequently, a 2021 analysis by data enthusiast Yvonne Zhang that was inspired by the previous study disclosed more updated findings. For example, 9,625 sets out of 16,691 unique sets have themes <a name=\"cite_ref-4\"></a>[<sup>4</sup>](#cite_note-4). Simply put, themed LEGO® sets make-up 58 percent of all sets and could indicate a greater influence than generic sets. Zhang also emphasized that majority of sets were derived from LEGO® System theme before 1970, then shifted to Legoland theme in the 1970s and onto other themes such as Duplo, Universal Building, Bionicle from the 1980s to the 2010s <a name=\"cite_ref-4\"></a>[<sup>4</sup>](#cite_note-4). This goes to show that themes can come and go, but the bricks are here forever. LEGO® Group needs to continue developing and releasing new themes to keep capture the heart of their fans and the wallets of their consumers. Zhang reaffirms the upward trend for number of sets and number of pieces per set <a name=\"cite_ref-4\"></a>[<sup>4</sup>](#cite_note-4). This further indicates that the two quantitative characteristics could have a significant impact on measuring the popularity of future sets. Lastly, Zhang elaborated on how the bricks are graying by showing that black, white, light bluish gray, dark bluish gray, and red are the dominant colors in the past 72 years <a name=\"cite_ref-4\"></a>[<sup>4</sup>](#cite_note-4). When color tastes changes in society, then color palettes on products do the same. As a result, color cannot be overlooked when attempting to predict the popularity of a future set. \n",
"\n",
"A recent 2023 analysis by data analyst Tyran Christian provided the most extensive and eye-opening look into LEGO sets by using a data set from Rebrickable. Star Wars and Technic sets have the highest number of parts and have longevity amongst other favorites including Duplo, City, Creator, Ninjago with Marvel Super Heros being a rising theme <a name=\"cite_ref-5\"></a>[<sup>5</sup>](#cite_note-5). Here we can see that older themes remain strong in the market, yet new themes are able to breach the rankings. Another relationship to unpack would be the popularity of classic LEGO themes in comparison to licensed themes. Christian demonstrated that classic themes like Western, Vikings or Adventurers increased in value after their end-of-life similar to licensed themes such as Batman, Spider-Man and Indiana Jones despite assumption that licensed themes would perform better <a name=\"cite_ref-5\"></a>[<sup>5</sup>](#cite_note-5). Christian even recommended two ways to further his analysis: implement a predictive machine learning model and add LEGO Minifigures data <a name=\"cite_ref-5\"></a>[<sup>5</sup>](#cite_note-5). We plan to take his recommendations into account and develop these pathways. \n",
"\n",
"0 8630-1 Gold Hunt Agents 2008 352.0 3.0 \n",
"1 8631-1 Jetpack Pursuit Agents 2008 88.0 2.0 \n",
"2 8632-1 Swamp Raid Agents 2008 231.0 2.0 \n",
"3 8633-1 Speedboat Rescue Agents 2008 340.0 3.0 \n",
"4 8634-1 Turbocar Chase Agents 2008 498.0 3.0 \n",
"\n",
" Minifig_list newValue usedValue No_owned No_want \\\n",
"0 ['agt007', 'agt008', 'agt009'] ~$87.35 ~$22.75 3182.0 1188 \n",
"1 ['agt004', 'agt005'] ~$36.18 ~$10.11 4534.0 960 \n",
"2 ['agt003', 'agt006'] ~$51.25 ~$17.15 3564.0 1085 \n",
"3 ['agt001a', 'agt002', 'agt003'] ~$90.56 ~$35.70 2490.0 1239 \n",
"4 ['agt001', 'agt008', 'agt015'] ~$220.00 ~$52.70 2439.0 1433 \n",
"\n",
" Rating Num_ratings Num_reviews \\\n",
"0 4.0 95.0 11.0 \n",
"1 3.9 124.0 27.0 \n",
"2 3.9 106.0 18.0 \n",
"3 3.9 75.0 9.0 \n",
"4 4.3 76.0 9.0 \n",
"\n",
" Tags Designer \\\n",
"0 ['Agent Fuse', 'Gold Tooth', 'Henchman', '4X4'... ['Mark Stafford'] \n",
"1 ['Agent Chase', 'Saw Fist', 'Aircraft', 'Dr In... NaN \n",
"2 ['Agent Charge', 'Break Jaw', 'Boat', 'Crimina... ['Mark Stafford'] \n",
"3 ['Agent Chase', 'Agent Trace', 'Break Jaw', 'B... ['Raphael Pretesacque'] \n",
"4 ['Agent Chase', 'Henchman', 'Spy Clops', 'Airc... ['Raphael Pretesacque'] \n",
"\n",
" Launch_Date Exit_Date \n",
"0 1 Jul 2002 31 Dec 2003 \n",
"1 1 Jul 2002 31 Dec 2003 \n",
"2 1 Jul 2002 31 Dec 2003 \n",
"3 1 Jul 2002 31 Dec 2003 \n",
"4 1 Jul 2002 31 Dec 2003 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Finally, review the updated dataset \n",
"\n",
"brickset.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ethics & Privacy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our project necessitates careful consideration of ethical and privacy concerns throughout the data science lifecycle, from question formulation to data collection, analysis, and dissemination of results. One significant concern is the potential for sampling bias, where data may overrepresent LEGO enthusiasts and underrepresent casual consumers. Additionally, our data is likely to exhibit age bias, as the primary audience for LEGO sets (children) may not directly contribute data, skewing our dataset towards adult perspectives. Time bias is another factor, with older sets possibly having more accumulated data than newer ones.\n",
"\n",
"We must also navigate privacy and ethical issues related to using third-party data, particularly regarding user-generated content on platforms like Brickset, BrickEconomy, and others. These concerns include ensuring compliance with data privacy laws and the ethical treatment of information, especially when it pertains to minors. Our approach includes anonymizing any personal data and relying on aggregated or anonymized datasets provided by third-party sites, assuming they adhere to ethical standards and privacy laws.\n",
"\n",
"To mitigate biases and ethical risks, we plan to incorporate a diverse range of data sources, aiming for a dataset that as accurately as possible reflects the global LEGO set market. We will also transparently communicate the limitations and potential biases of our analysis to users, highlighting that our findings may not fully represent the broader consumer base. Ensuring data privacy, especially for children, will be paramount, requiring strict adherence to legal and ethical guidelines for data handling and analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Team Expectations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our team has reviewed the COGS108 Team Policies and have set forth the following expectations to ensure the successful completion of our project:\n",
"\n",
"- All members should respond to messages relevant to the project within 12 hours through iMessage, acknowledging the message with a response or at minimum, a 'thumbs up'.\n",
"- Members anticipating delays in their contributions must inform the group at least 3 days in advance, though earlier communication is encouraged.\n",
"- We will utilize Notion for organizing our group work and scheduling, and Zoom for our meetings, unless a need for flexibility necessitates a different platform.\n",
"- Should any group member have concerns or questions, they are encouraged to communicate openly with the group to discuss and resolve the issue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project Timeline Proposal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is our team's specific project timeline, detailing our plan from initial discussions through to the final submission. This timeline incorporates both our collective efforts on the project and individual responsibilities for various components. Our aim is to follow this schedule closely, adjusting as necessary to ensure the timely and successful completion of our project.\n",
"\n",
"| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |\n",
"|---|---|---|---|\n",
"| 2/7 | 5 PM | Begin initial research and divide sections of the Project Proposal among team members (Shawn/Suhaib - data, Leena - Ethics/Privacy, Leila/Yvonne - RQ, Hypothesis, Background & Prior Work, Project Timeline) | Share and discuss initial research; suggest edits; set goals for the final draft of the project proposal |\n",
"| 2/10 | 3 PM | Prepare final draft sections of the project proposal incorporating feedback from the previous meeting | Review and finalize proposal sections; submit the final proposal; plan next steps and assign responsibilities |\n",
"| 2/17 | 3 PM | Import and wrangle data; begin drafting other project sections | Update the team on data wrangling progress; review work on other sections; adjust project plan as needed |\n",
"| 2/24 | 3 PM | Sub-Team 1: Leila, Noah, Shawn are working with BrickSet, BrickEconomy and BrickLink datasets. Sub-Team 2: Suhaib & Yvonne: working on Rebrickable and LEGO datasets. Both sub-teams wrangle data & write-up; plan for EDA | Review and finalize data wrangling; discuss initial EDA findings; plan for comprehensive EDA and analysis |\n",
"| 3/2 | 3 PM | Sub-teams work and finalize EDA for their datasets; cross-communicate and begin working on analysis | Finalize EDA; review analysis progress; discuss preliminary results and draft conclusions |\n",
"| 3/9 | 3 PM | Complete full draft of project; start planning & working on the video presentation | Discuss the complete draft; identify areas for improvement; finalize analysis and conclusions |\n",
"| 3/17 | 3 PM | Finalize all edits for project & presentation | Review final project; submit the completed project |"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}