This repository contains all the data science projects I have completed as part of my Dataquest "Data Scientist Career Path" training.
Please bear in mind that these projects were completed at different steps of my training and that they reflect the level of knowledge at a certain time.
For this project, we pretend we're working as a data analyst for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store. We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.
In this project, we analyze data to help our developers understand what type of apps are likely to attract more users.
In this project, we work with a data set of submissions to the popular technology site Hacker News and we determine what the best time to publish a post is in order to receive the most comments.
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
In this project, we work with a dataset of used cars from eBay Kleinanzeigen, a section of the German eBay website. We clean the data and analyze the included used car listings to gain some insight on the second-hand cars market (most represented brands, prices distribution, average mileage, ...).
In this project, we work with a dataset listing the job outcomes of students who graduated from college in the US between 2010 and 2012.
Using visualizations, we answer the following questions:
- Do students in more popular majors make more money?
- How many majors are predominantly male? Predominantly female?
- Which category of majors have the most students?
In this project, we work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.
We play the role of data analyst and answer the following questions to provide valuable insight to our stakeholders:
- Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
- What about employees who have been there longer?
- Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests like the SAT, and whether they're unfair to certain groups.
The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.
In this project, we investigate the correlations between SAT scores and demographics and search for trends that would confirm the unfairness of such a test.
While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fan. In particular, they wondered: does the rest of America realize that The Empire Strikes Back is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which we will download from their GitHub repository.
In this project, we clean and explore the data set to gain insight on the data and answer two main questions:
- Which episode is the most viewed?
- Which episode is ranked the highest?
In this project, we work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information such as each countrie's global population, annual population growth rate as a percentage, total land and water area, ...
We use SQL in Jupyter Notebook to analyze data from this database and answer questions like the following:
- What country has the highest growth rate?
- Which countries have the highest ratio of water to land?
- Which countries will add the most people to their population next year?
- ...
In this project, we practice using our SQL skills on a record store database.
Using the right queries, we answer the following business questions:
- What are the best three new albums to add to the store?
- Whose employees have the highest sale perfomance?
- Which countries do the store sell the most to?
- Is buying only selected single tracks from record companies more profitable for the store than buying full new albums?
In this project, we pretend we work for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms like Dataquest.
Our goal is to figure out what the best content to write about is.
We scour the internet to understand what people want to learn about in Data Science (as opposed to determining the most profitable content).
In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in an article.
Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars. Hickey found that there's a significant discrepancy between the number of stars displayed to users and the actual rating, which he was able to find in the HTML of the page.
In this project, we analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system or not after Hickey's publication.
In this project, we pretend we are working for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement.
Our goal in this project is to find out the two best markets to advertise our product in.
A medical institute that aims to prevent and treat gambling addictions wants to build a dedicated mobile app to help lottery addicts better estimate their chances of winning. The institute has a team of engineers that will build the app, but they need us to create the logical core of the app and calculate probabilities.
In this project, we contribute to the development of that app.
In this project, we create a spam filter for SMS messages wich has an accuracy of over 95%.
To do that, we use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that were already classified by humans.
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.
In this project, we look for patterns in the previous questions of the TV game show Jeopardy that could help us win.
In this project, we practice the machine learning workflow and implement a k-nearest neighbors algorithm to predict a car's market price using its attributes.
Keywords : Data Analysis, Machine Learning, Linear Regression, Functions pipeline, Houses, Price prediction
In this project, we implement a linear regression model to predict house sale prices. To do that, we create a pipeline of functions that let us quickly iterate on different models.
Keywords : Data Analysis, Machine Learning, Linear Regression, Forecasting, Time Series, Stock Market, Index
In this project, we work with data from the S&P500 stock market index. We use historical data on this index to make predictions about future prices.
Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.
Keywords : Data Analysis, Machine Learning, Linear Regression, Decision Tree, Random Forest, Bike Sharing, Demand Prediction
Many American cities have communal bike sharing stations where one can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.
In this project, we predict the total number of bikes people rented in a given hour. To accomplish this, we implement a few different machine learning models (a linear regression model, a single tree regressor and a random forest regressor) and evaluate their performance.
Keywords : Machine Learning, Deep Learning, K-Nearest Neighbors, Neural Network, Image Classification
In this project, we build models that can classify handwritten digits. We start by implementing a traditional K-Nearest Neighbors model and then implement a deep, feedforward neural network with increasing numbers of neurons and layers.
Keywords : Data Analysis, Data Visualisation, Machine Learning, Prediction, Binary Classification, Titanic
In this project, we tackle the Titanic competition, one of the most popular competitions from the Kaggle platform.
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
For this challenge, we first run a thorough analysis of the dataset and then implement a few predictive models and test their accuracy at predicting the survival or death of passengers.
In this project, we pretend we own an online laptop store and want to build a way to answer a few different business questions about our inventory.
To do that, we create a class that represents our inventory with several methods to answer questions about it. We then improve these methods using data preprocessing and our knowledge of algorithms time and space complexity to speed up their performance.
In this project, we build a database for storing data related with crimes that occured in Boston. We create a database with a table with appropriate datatypes for storing the data from the dataset. We create the table inside a schema. We also create the readonly and readwrite groups with the appropriate privileges. Finally, we create one user for each of these groups.
The grep command-line utility allows searching for textual data in all files from a given directory.
In this project, we implement a simplified version of the grep command-line utility to search for data in 54MB worth of wikipedia scraped articles. We also make use of MapReduce to parallelize our processes.
Our main goals are:
- Searching for all occurences of a string in all of the files
- Providing a case-insensitive option to the search
- Refining the result by providing the specific locations of the occurences