This project deals with the classification problem for the disaster tweets. So that it can be classified that which tweets is the real disaster tweets.
- Python3
- Jupyter Notebook
- Machine Learning
- NLTK
- Pandas
- Sklearn
- Regex
- Vectorization
- Data Loading.
- Exploratory Data Analysis
- PreProcessing
- Fitting the model with Parameter Tuning.
- Performance evaluation
I like padas , my basic approach is to load the data into the dataframe and then performing operations and explorations like EDA
df = pd.read_csv("train.csv",engine="python", delimiter=",")
Dataset describe
Dataset info
Top Locations used in Dataset
sns.barplot(y=locations_vc[0:30].index, x=locations_vc[0:30], orient='h')
plt.title("Top 30 Locations")
plt.show()
Top Keywords used in Dataset
sns.barplot(y=keyword_vc[0:30].index, x=keyword_vc[0:30], orient='h')
plt.title("Top 30 keyword")
plt.show()
Word Cloud of the Abbreveations used
Pre processing is the most important phase . As we are dealing with NLP , it is little different than the Numeric preprocessing Techninques.
The list of filters used for preprocessing the tweets are as follows
- Url
- Html
- Non Ascii
- abbreveation replacement
- removing mentions
- number
- punctuations
- stop words
The above were used to clean the text before vectorization
Below is the glimpse of befor and after
for vectorization of textual data , the two most popular methods are
- count vectorization
- TIDIF vectorization
To classify Randonm Forest Classifies has been used for the demonstration
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
The number of estimator can be decided by the testing with different values which size of estimator is giving you the best result.
For eg. the following model has been tested over multiple estimator size to determine which one giveing the most accurate results
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
The number of estimator = 1000 gave the best result .