Skip to content

Latest commit

 

History

History
59 lines (37 loc) · 1.95 KB

README.md

File metadata and controls

59 lines (37 loc) · 1.95 KB

Big-Data

Team

Guhan Kabbina
Harshita Vidapanakal
Hanuraag Baskaran
Rohan M

Project

This repository contains source code for the following projects:

1] Analysis of Earth Surface Temperature using Spark

2] Implementation of Page Rank Algorithm with Embeddings for Wikipedia using Hadoop

3] Analysis of US Road Accident Data using Hadoop

4] Classification of Spam and Ham Emails using Spark Machine Learning



Usage

Step 1 :

Run the script files present in the config folder.

To Install both Hadoop and Spark on your Linux machine

Step 2 :

Run the requirements script files present in the config folder.

To install all the required libraries for all the projects in this repository

Step 3 :

The required data files for all the projects is present in the data folder.

The data files are pre-processed and a sample of the data is stored, but the link for the entire dataset is provided in the data\README.md file.

Step 4 :

The source code for all the projects is present in the src folder.

PLEASE READ THE DOCUMENTATION AND REPORT TO UNDERSTAND THE WORKING OF THE CODE

Step 5 :

Run the respective script files present in the tools folder for each project.

Step 6 :

The output for each project is present in the sample folder.

Step 7 :

Pre-Trained models for Spam_Ham_Classification are present in the build folder to be used for the classification of the emails using the test src\Spam_Ham\models\model_test.py file.

Conclusion

The peformance analysis of the models in the projects is provided in the report\images folder.

All the additional details regarding the project are provided in the docs folder.

Please raise a Github issue if you have any questions or suggestions.