Disaster-Tweet-Classification

This Project is to work on Natural language processing techniques to slove language based problems.

This project deals with the classification problem for the disaster tweets. So that it can be classified that which tweets is the real disaster tweets.

Pre requisistes

Python3
Jupyter Notebook
Machine Learning
NLTK
Pandas
Sklearn
Regex
Vectorization

Section Wise Implementation Guides

Data Loading.
Exploratory Data Analysis
PreProcessing
Fitting the model with Parameter Tuning.
Performance evaluation

Data Loading

I like padas , my basic approach is to load the data into the dataframe and then performing operations and explorations like EDA

df = pd.read_csv("train.csv",engine="python", delimiter=",")

Exploratory Data Analysis

Dataset describe

Dataset info

Top Locations used in Dataset

sns.barplot(y=locations_vc[0:30].index, x=locations_vc[0:30], orient='h')
plt.title("Top 30 Locations")
plt.show()

Top Keywords used in Dataset

sns.barplot(y=keyword_vc[0:30].index, x=keyword_vc[0:30], orient='h')
plt.title("Top 30 keyword")
plt.show()

Word Cloud of the Abbreveations used

PreProcessing

Pre processing is the most important phase . As we are dealing with NLP , it is little different than the Numeric preprocessing Techninques.

The list of filters used for preprocessing the tweets are as follows

Url
Html
Non Ascii
abbreveation replacement
removing mentions
number
punctuations
stop words

The above were used to clean the text before vectorization

Below is the glimpse of befor and after

for vectorization of textual data , the two most popular methods are

count vectorization
TIDIF vectorization

Fitting the Model

To classify Randonm Forest Classifies has been used for the demonstration

classifier = RandomForestClassifier(n_estimators=1000, random_state=0)

The number of estimator can be decided by the testing with different values which size of estimator is giving you the best result.

For eg. the following model has been tested over multiple estimator size to determine which one giveing the most accurate results

Performance evaluation

print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

The number of estimator = 1000 gave the best result .

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
nlp-getting-started		nlp-getting-started
.gitignore		.gitignore
README.md		README.md
Text Classification.ipynb		Text Classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaster-Tweet-Classification

This Project is to work on Natural language processing techniques to slove language based problems.

Pre requisistes

Section Wise Implementation Guides

Data Loading

Exploratory Data Analysis

PreProcessing

Fitting the Model

Performance evaluation

About

Releases

Packages

Languages

vshantam/Disaster-Tweet-Classification

Folders and files

Latest commit

History

Repository files navigation

Disaster-Tweet-Classification

This Project is to work on Natural language processing techniques to slove language based problems.

Pre requisistes

Section Wise Implementation Guides

Data Loading

Exploratory Data Analysis

PreProcessing

Fitting the Model

Performance evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages