Summarizing large question-answer collection of StackOverflow website using text pre-processing and different descriptive statistics methods based on the MapReduce Framework.
- About
- Requirements
- Run the Application
- Monitor Hadoop Cluster by WebUI
- Technologies
- License
- Credits
- Contributors
This is a MapReduce application written in Java which runs on Apache Hadoop cluster.
Docker Desktop is required (and it is the easiest way to get Docker work on your laptop).
You can download Docker Desktop for Mac (both ) or Docker Desktop for Linux.
-
First you must start the Hadoop Cluster, and build Java program with Maven.
To do this you can run:
make ready
This will pull the images from Docker Hub and then start needed nodes. This process might take few minutes if you are running it for the first time. After the pull is complete, it will wait until all nodes are ready.
-
After the cluster is up, you should start the GUI.
To do that simply run:
make run
-
Before you run the MapReduce jobs you have to insert input data into HDFS.
Small samples of input files are located in
jobs/data
. To insert them, use the file selector from GUI and give any destination path for where to upload files. Or you can runmake move-data
to upload them into the default/input/
path.Full files can be downloaded from StackSample: 10% of Stack Overflow Q&A. Before you insert full files, extract them into
jobs/data
and run:make preprocess
to get them ready for MapReduce jobs. Preprocessed files can be found asjobs/data/QuestionsPre.csv
andjobs/data/AnswersPre.csv
. -
To stop the nodes and delete output files, run
make clean
.
- Namenode: http://localhost:9870
- Datanode: http://localhost:9864
- Resourcemanager: http://localhost:8088
- Nodemanager: http://localhost:8042
- Historyserver: http://localhost:8188
Note If you are redirected to a URL like http://119e8b128bd5:8042/ or http://resourcemanager:8088/, change the host name to localhost (i.e. http://localhost:8042/) and it will work. This is because Docker containers use their own IPs which are mapped to different names.
MapReduce on Stackoverflow Dataset is free software published under the MIT license. See LICENSE for details.
Docker Hadoop Cluster setup files are taken from @wxw-matt.