-
PROBLEM STATEMENT:
Having a strong and supportive community for technology tool is one of important factor for tech company to decide whether they want to adopt this tool among all potential tools. However, there is no obvious way of finding out the activeness between one group with another.
-
SOLUTION:
By construct a ETL pipe line for preprocessing stack overflow data and stored in database, the network graph and trend of growth among tags group can be visualized on the Dash-based user interface that provide multiple function support analysis. This project showcased the ability of building a ETL pipeline for processing big data (250GB) and Full-stack software skill as well as data analysis.
Temp Demo website | Slide | Demo Vedio
- Use Ansible to provision EC2, install spark, postgres and Dash UI. (Refer to DevOp folder for further instrution)
- Link the DNS of EC2 to public domain (Not done yet)
- Tech Stack:
Upload xml file downloaded from stack exchange data dump to s3. And use spark cluster to preprocessed posts data and user data. After pre-computation, the result table is stored in database which can be access by front end UI. (Ref to ETLPipeline) folder.
- After installation and preprocessing. You can run the Dash App by
python app.py
in App directory. In APP, it displays the design of user interface using Dash. - The UI implemented with multiple function. Function demo refer to this youtube vedio.And function descriptions are shown below: