Dataquest Projects

This repository contains all the data science projects I have completed as part of my Dataquest "Data Scientist Career Path" training.

Please bear in mind that these projects were completed at different steps of my training and that they reflect the level of knowledge at a certain time.

Profitable App Profiles for the Apple Store and Google Play Store Markets

Keywords : Data Analysis, Data Visualisation, Marketing

For this project, we pretend we're working as a data analyst for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store. We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.

In this project, we analyze data to help our developers understand what type of apps are likely to attract more users.

Exploring Hacker News Posts

Keywords : Data Analysis, Data Visualisation, Media impact

In this project, we work with a data set of submissions to the popular technology site Hacker News and we determine what the best time to publish a post is in order to receive the most comments.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Exploring Ebay Car Sales Data

Keywords : Data Analysis, Data Visualisation

In this project, we work with a dataset of used cars from eBay Kleinanzeigen, a section of the German eBay website. We clean the data and analyze the included used car listings to gain some insight on the second-hand cars market (most represented brands, prices distribution, average mileage, ...).

Visualizing College Majors Population, Popularity and Median Earning

Keywords : Data Analysis, Data Visualisation, Sociology

In this project, we work with a dataset listing the job outcomes of students who graduated from college in the US between 2010 and 2012.

Using visualizations, we answer the following questions:

Do students in more popular majors make more money?
How many majors are predominantly male? Predominantly female?
Which category of majors have the most students?

Cleaning and Analyzing Employee Exit Surveys

Keywords : Data Analysis, Data Visualisation, Human ressources

In this project, we work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.

We play the role of data analyst and answer the following questions to provide valuable insight to our stakeholders:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

Analyzing NYC High School Data

Keywords : Data Analysis, Data Visualisation, Education, Sociology

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests like the SAT, and whether they're unfair to certain groups.

The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.

In this project, we investigate the correlations between SAT scores and demographics and search for trends that would confirm the unfairness of such a test.

Star Wars Survey

Keywords : Data Analysis, Data Visualisation, Star Wars, Movie

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fan. In particular, they wondered: does the rest of America realize that The Empire Strikes Back is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which we will download from their GitHub repository.

In this project, we clean and explore the data set to gain insight on the data and answer two main questions:

Which episode is the most viewed?
Which episode is ranked the highest?

Analysing CIA Factbook Data Using SQL

Keywords : Data Analysis, SQL, Demographics, Countries

In this project, we work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information such as each countrie's global population, annual population growth rate as a percentage, total land and water area, ...

We use SQL in Jupyter Notebook to analyze data from this database and answer questions like the following:

What country has the highest growth rate?
Which countries have the highest ratio of water to land?
Which countries will add the most people to their population next year?
...

Answering Business Questions Using SQL

Keywords : Data Analysis, SQL, business, music, record store

In this project, we practice using our SQL skills on a record store database.

Using the right queries, we answer the following business questions:

What are the best three new albums to add to the store?
Whose employees have the highest sale perfomance?
Which countries do the store sell the most to?
Is buying only selected single tracks from record companies more profitable for the store than buying full new albums?

Investigating Fandango Movie Ratings

Keywords : Data Analysis, Data Visualisation, Statistics, Movie ratings

In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in an article.

Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars. Hickey found that there's a significant discrepancy between the number of stars displayed to users and the actual rating, which he was able to find in the HTML of the page.

In this project, we analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system or not after Hickey's publication.

Finding the Best Markets to Advertise In

Keywords : Data Analysis, Data Visualisation, Statistics, Marketing, E-learning, Content creation

In this project, we pretend we are working for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement.

Our goal in this project is to find out the two best markets to advertise our product in.

Mobile App for Lottery Addiction

Keywords : Data Analysis, Statistics, App development, Lottery, Medical, Gambling addiction

A medical institute that aims to prevent and treat gambling addictions wants to build a dedicated mobile app to help lottery addicts better estimate their chances of winning. The institute has a team of engineers that will build the app, but they need us to create the logical core of the app and calculate probabilities.

In this project, we contribute to the development of that app.

Building a Spam Filter with Naive Bayes

Keywords : Data Analysis, Conditional probability, Naive Bayes, SMS, Spam filter

In this project, we create a spam filter for SMS messages wich has an accuracy of over 95%.

To do that, we use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that were already classified by humans.

Winning Jeopardy

Keywords : Data Analysis, Statistics, Chi-squared test, Jeopardy, Game show

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

In this project, we look for patterns in the previous questions of the TV game show Jeopardy that could help us win.

Predicting Car Prices

Keywords : Data Analysis, Machine Learning, K-Nearest Neighbors, Cars, Price prediction

In this project, we practice the machine learning workflow and implement a k-nearest neighbors algorithm to predict a car's market price using its attributes.

Predicting House Sale Prices

Keywords : Data Analysis, Machine Learning, Linear Regression, Functions pipeline, Houses, Price prediction

In this project, we implement a linear regression model to predict house sale prices. To do that, we create a pipeline of functions that let us quickly iterate on different models.

Predicting the Stock Market

Keywords : Data Analysis, Machine Learning, Linear Regression, Forecasting, Time Series, Stock Market, Index

In this project, we work with data from the S&P500 stock market index. We use historical data on this index to make predictions about future prices.

Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

Predicting Bike Rentals

Keywords : Data Analysis, Machine Learning, Linear Regression, Decision Tree, Random Forest, Bike Sharing, Demand Prediction

Many American cities have communal bike sharing stations where one can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.

In this project, we predict the total number of bikes people rented in a given hour. To accomplish this, we implement a few different machine learning models (a linear regression model, a single tree regressor and a random forest regressor) and evaluate their performance.

Building A Handwritten Digits Classifier

Keywords : Machine Learning, Deep Learning, K-Nearest Neighbors, Neural Network, Image Classification

In this project, we build models that can classify handwritten digits. We start by implementing a traditional K-Nearest Neighbors model and then implement a deep, feedforward neural network with increasing numbers of neurons and layers.

Kaggle Titanic Competition

Keywords : Data Analysis, Data Visualisation, Machine Learning, Prediction, Binary Classification, Titanic

In this project, we tackle the Titanic competition, one of the most popular competitions from the Kaggle platform.

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

For this challenge, we first run a thorough analysis of the dataset and then implement a few predictive models and test their accuracy at predicting the survival or death of passengers.

Building Fast Queries on a CSV

Keywords : Data Engineering, Algorithm Complexity, Speed Improvement, Business

In this project, we pretend we own an online laptop store and want to build a way to answer a few different business questions about our inventory.

To do that, we create a class that represents our inventory with several methods to answer questions about it. We then improve these methods using data preprocessing and our knowledge of algorithms time and space complexity to speed up their performance.

Building a Database for Crime Reports

Keywords : Data Engineering, Database Creation, Database Management, Postgres, SQL

In this project, we build a database for storing data related with crimes that occured in Boston. We create a database with a table with appropriate datatypes for storing the data from the dataset. We create the table inside a schema. We also create the readonly and readwrite groups with the appropriate privileges. Finally, we create one user for each of these groups.

Analyzing Wikipedia Pages

Keywords : Data Engineering, Processes Parallelization, MapReduce

The grep command-line utility allows searching for textual data in all files from a given directory.

In this project, we implement a simplified version of the grep command-line utility to search for data in 54MB worth of wikipedia scraped articles. We also make use of MapReduce to parallelize our processes.

Our main goals are:

Searching for all occurences of a string in all of the files
Providing a case-insensitive option to the search
Refining the result by providing the specific locations of the occurences

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
Analysing CIA Factbook Data Using SQL		Analysing CIA Factbook Data Using SQL
Analyzing NYC High School Data		Analyzing NYC High School Data
Analyzing Wikipedia Pages		Analyzing Wikipedia Pages
Answering Business Questions Using SQL		Answering Business Questions Using SQL
Building A Handwritten Digits Classifier		Building A Handwritten Digits Classifier
Building Fast Queries on a CSV		Building Fast Queries on a CSV
Building a Database for Crime Reports		Building a Database for Crime Reports
Building a Spam Filter with Naive Bayes		Building a Spam Filter with Naive Bayes
Cleaning and Analyzing Employee Exit Surveys		Cleaning and Analyzing Employee Exit Surveys
Exploring Ebay Car Sales Data		Exploring Ebay Car Sales Data
Exploring Hacker News Posts		Exploring Hacker News Posts
Finding the Best Markets to Advertise In		Finding the Best Markets to Advertise In
Investigating Fandango Movie Ratings		Investigating Fandango Movie Ratings
Kaggle Titanic Competition		Kaggle Titanic Competition
Mobile App for Lottery Addiction		Mobile App for Lottery Addiction
Popular Data Science Questions		Popular Data Science Questions
Predicting Bike Rentals		Predicting Bike Rentals
Predicting Car Prices		Predicting Car Prices
Predicting House Sale Prices		Predicting House Sale Prices
Predicting the Stock Market		Predicting the Stock Market
Profitable App Profiles for the Apple Store and Google Play Store Markets		Profitable App Profiles for the Apple Store and Google Play Store Markets
Star Wars Survey		Star Wars Survey
Visualizing College Majors Population, Popularity and Median Earning		Visualizing College Majors Population, Popularity and Median Earning
Winning Jeopardy		Winning Jeopardy
README.md		README.md

Antoine101/Dataquest-Projects

Folders and files

Latest commit

History

Repository files navigation

Dataquest Projects

Profitable App Profiles for the Apple Store and Google Play Store Markets

Keywords : Data Analysis, Data Visualisation, Marketing

Exploring Hacker News Posts

Keywords : Data Analysis, Data Visualisation, Media impact

Exploring Ebay Car Sales Data

Keywords : Data Analysis, Data Visualisation

Visualizing College Majors Population, Popularity and Median Earning

Keywords : Data Analysis, Data Visualisation, Sociology

Cleaning and Analyzing Employee Exit Surveys

Keywords : Data Analysis, Data Visualisation, Human ressources

Analyzing NYC High School Data

Keywords : Data Analysis, Data Visualisation, Education, Sociology

Star Wars Survey

Keywords : Data Analysis, Data Visualisation, Star Wars, Movie

Analysing CIA Factbook Data Using SQL

Keywords : Data Analysis, SQL, Demographics, Countries

Answering Business Questions Using SQL

Keywords : Data Analysis, SQL, business, music, record store

Popular Data Science Questions

Keywords : Data Analysis, SQL, business, music, record store

Investigating Fandango Movie Ratings

Keywords : Data Analysis, Data Visualisation, Statistics, Movie ratings

Finding the Best Markets to Advertise In

Keywords : Data Analysis, Data Visualisation, Statistics, Marketing, E-learning, Content creation

Mobile App for Lottery Addiction

Keywords : Data Analysis, Statistics, App development, Lottery, Medical, Gambling addiction

Building a Spam Filter with Naive Bayes

Keywords : Data Analysis, Conditional probability, Naive Bayes, SMS, Spam filter

Winning Jeopardy

Keywords : Data Analysis, Statistics, Chi-squared test, Jeopardy, Game show

Predicting Car Prices

Keywords : Data Analysis, Machine Learning, K-Nearest Neighbors, Cars, Price prediction

Predicting House Sale Prices

Keywords : Data Analysis, Machine Learning, Linear Regression, Functions pipeline, Houses, Price prediction

Predicting the Stock Market

Keywords : Data Analysis, Machine Learning, Linear Regression, Forecasting, Time Series, Stock Market, Index

Predicting Bike Rentals

Keywords : Data Analysis, Machine Learning, Linear Regression, Decision Tree, Random Forest, Bike Sharing, Demand Prediction

Building A Handwritten Digits Classifier

Keywords : Machine Learning, Deep Learning, K-Nearest Neighbors, Neural Network, Image Classification

Kaggle Titanic Competition

Keywords : Data Analysis, Data Visualisation, Machine Learning, Prediction, Binary Classification, Titanic

Building Fast Queries on a CSV

Keywords : Data Engineering, Algorithm Complexity, Speed Improvement, Business

Building a Database for Crime Reports

Keywords : Data Engineering, Database Creation, Database Management, Postgres, SQL

Analyzing Wikipedia Pages

Keywords : Data Engineering, Processes Parallelization, MapReduce

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages