Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

Codit/analyzing-stackexchange-with-azure-data-lake

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analyzing StackExchange data with Azure Data Lake

This repository contains all the code & scripts for my 'Analyzing StackExchange data with Azure Data Lake' series.

This series will take you through the process of storing StackExchange data in Data Lake Store, aggregating all the User-data from all the websites into one file and gaining knowledge from it with Data Lake Analytics. After that we'll use PowerBI to visualize the gained knowledge.

In the introduction I've talked about the four major blocks in the series:

  1. Storing the data in Azure Data Lake Store or Azure Storage (post)
  2. Aggregating the data with Azure Data Lake Analytics
  3. Analyzing the data with Azure Data Lake Analytics
  4. Visualizing the data with Power BI

The blog post series is currently on hold but you can browse all the scripts. This is based on an old SDK so there might be compatibility issues


Getting the StackExchange Data Dump

Stack Exchange has made their data available from all their websites under Creative Commons license. It includes data about users, posts, comments, votes, etc for every single site.

Stack Exchange Logo

We will use this data as a demo set as this reflect real-world data. The data contains information about every website by StackExchange going from users & posts to comments and votes and beyond.

Here is an example of how the folder for coffee-stackexchange-com is structured:

+ coffee-stackexchange-com
	- Badges.xml
	- Comments.xml
	- PostHistory.xml
	- PostLinks.xml
	- Posts.xml
	- Tags.xml
	- Users.xml
	- Votes.xml

You can find all the data here.

License

Licensed under the terms of the MIT license.

About

Analyzing StackExchange data with Azure Data Lake

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 68.2%
  • PowerShell 31.8%