This repository contains all the code & scripts for my 'Analyzing StackExchange data with Azure Data Lake' series.
This series will take you through the process of storing StackExchange data in Data Lake Store, aggregating all the User-data from all the websites into one file and gaining knowledge from it with Data Lake Analytics. After that we'll use PowerBI to visualize the gained knowledge.
In the introduction I've talked about the four major blocks in the series:
- Storing the data in Azure Data Lake Store or Azure Storage (post)
- Aggregating the data with Azure Data Lake Analytics
- Analyzing the data with Azure Data Lake Analytics
- Visualizing the data with Power BI
The blog post series is currently on hold but you can browse all the scripts. This is based on an old SDK so there might be compatibility issues
Stack Exchange has made their data available from all their websites under Creative Commons license. It includes data about users, posts, comments, votes, etc for every single site.
We will use this data as a demo set as this reflect real-world data. The data contains information about every website by StackExchange going from users & posts to comments and votes and beyond.
Here is an example of how the folder for coffee-stackexchange-com
is structured:
+ coffee-stackexchange-com
- Badges.xml
- Comments.xml
- PostHistory.xml
- PostLinks.xml
- Posts.xml
- Tags.xml
- Users.xml
- Votes.xml
You can find all the data here.
Licensed under the terms of the MIT license.