MapReduce on Stackoverflow Dataset

Summarizing large question-answer collection of StackOverflow website using text pre-processing and different descriptive statistics methods based on the MapReduce Framework.

About

This is a MapReduce application written in Java which runs on Apache Hadoop cluster.

Requirements

Docker Desktop is required (and it is the easiest way to get Docker work on your laptop).

You can download Docker Desktop for Mac (both ) or Docker Desktop for Linux.

Run the Application

First you must start the Hadoop Cluster, and build Java program with Maven.

To do this you can run: make ready

This will pull the images from Docker Hub and then start needed nodes. This process might take few minutes if you are running it for the first time. After the pull is complete, it will wait until all nodes are ready.
After the cluster is up, you should start the GUI.

To do that simply run: make run
Before you run the MapReduce jobs you have to insert input data into HDFS.

Small samples of input files are located in jobs/data. To insert them, use the file selector from GUI and give any destination path for where to upload files. Or you can run make move-data to upload them into the default /input/ path.

Full files can be downloaded from StackSample: 10% of Stack Overflow Q&A. Before you insert full files, extract them into jobs/data and run: make preprocess to get them ready for MapReduce jobs. Preprocessed files can be found as jobs/data/QuestionsPre.csv and jobs/data/AnswersPre.csv.
To stop the nodes and delete output files, run make clean.

Monitor Hadoop Cluster by WebUI

Namenode: http://localhost:9870
Datanode: http://localhost:9864
Resourcemanager: http://localhost:8088
Nodemanager: http://localhost:8042
Historyserver: http://localhost:8188

Note If you are redirected to a URL like http://119e8b128bd5:8042/ or http://resourcemanager:8088/, change the host name to localhost (i.e. http://localhost:8042/) and it will work. This is because Docker containers use their own IPs which are mapped to different names.

Technologies

License

MapReduce on Stackoverflow Dataset is free software published under the MIT license. See LICENSE for details.

Credits

Docker Hadoop Cluster setup files are taken from @wxw-matt.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs/images		docs/images
jobs		jobs
mapreduce		mapreduce
python		python
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
hadoop		hadoop
hadoop.env		hadoop.env
hdfs		hdfs
log4j.properties		log4j.properties
yarnlog		yarnlog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduce on Stackoverflow Dataset

Contents

About

Requirements

Run the Application

Monitor Hadoop Cluster by WebUI

Technologies

License

Credits

Contributors

About

Contributors 2

Languages

License

arensonzz/mapreduce-on-stackoverflow-dataset

Folders and files

Latest commit

History

Repository files navigation

MapReduce on Stackoverflow Dataset

Contents

About

Requirements

Run the Application

Monitor Hadoop Cluster by WebUI

Technologies

License

Credits

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages