Gender Classification based on Twitter Profile Data

This dataset is obtained from Kaggle.

As described:

The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.

This data set was used to train a CrowdFlower AI gender predictor. You can read all about the project here. Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.

Within this work, i implemented a notebook going through 5 steps:

Understanding Dataset
Cleaning Dataset
Visualizing Dataset
Classification Modeling

Regarding the scope, i mainly drop features and records that seemed not to contribute to the target, for simplicity. There are two main types of feautres within this dataset: Textual data (text and description) and non-textual data (other features including categorical and numerical type).

For textual data:

I mainly cleaned it using regex and Python string methods.
I employed TF-IDF as a transformation to represent the textual data as feature vectors for the models.
I trained the models with only text feature at first, then concatenated description and re-trained the models. The result showed a significant improvments.

For non-textual data:

I used label count encoding for categorical features (kudos to wrosinski), as their number unique values is huge, which is not represented well using ordinary one-hot encoding.
All other numerical features are kept unchanged.

For more details, please reference the .ipynb file. Any comments or feedbacks are welcome I always love sharing knowledge and learning from you.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Gender Classification based on Twitter Profile.ipynb		Gender Classification based on Twitter Profile.ipynb
README.md		README.md
gender_dataset.csv		gender_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gender Classification based on Twitter Profile Data

About

Releases

Packages

Languages

tranctan/Gender-Classification-based-on-Twritter-textual-data

Folders and files

Latest commit

History

Repository files navigation

Gender Classification based on Twitter Profile Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages