Skip to content

A comprehensive baseline model for Gender Classification based on Twitter Profile Data

Notifications You must be signed in to change notification settings

tranctan/Gender-Classification-based-on-Twritter-textual-data

Repository files navigation

Gender Classification based on Twitter Profile Data

This dataset is obtained from Kaggle.

As described:

The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.

This data set was used to train a CrowdFlower AI gender predictor. You can read all about the project here. Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.

Within this work, i implemented a notebook going through 5 steps:

  1. Understanding Dataset
  2. Cleaning Dataset
  3. Visualizing Dataset
  4. Classification Modeling

Regarding the scope, i mainly drop features and records that seemed not to contribute to the target, for simplicity. There are two main types of feautres within this dataset: Textual data (text and description) and non-textual data (other features including categorical and numerical type).

For textual data:

  • I mainly cleaned it using regex and Python string methods.
  • I employed TF-IDF as a transformation to represent the textual data as feature vectors for the models.
  • I trained the models with only text feature at first, then concatenated description and re-trained the models. The result showed a significant improvments.

For non-textual data:

  • I used label count encoding for categorical features (kudos to wrosinski), as their number unique values is huge, which is not represented well using ordinary one-hot encoding.
  • All other numerical features are kept unchanged.

For more details, please reference the .ipynb file. Any comments or feedbacks are welcome I always love sharing knowledge and learning from you.

About

A comprehensive baseline model for Gender Classification based on Twitter Profile Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published