Skip to content

acmucsd-projects/sp23-ai-team-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(click the picture to access our website!)

MBTI Classification from Tweets

Made with Jupyter Python Pandas MIT license

Table of Contents:

1. Background and Motivation
2. Dataset
3. Structure
4. Technologies Used and Requirements for Use
5. Difficulties
6. Author Info

1. Background and Motivation

MBTI, or Myers-Briggs Type Indicator, is a way to classify different personality types based on four categories: introversion(I)/extraversion(E), sensing(S)/intuition(N), thinking(T)/feeling(F), and judging(J)/perceiving(P). Every person has a combination of four letters that determines their personality types, and there are a total of 16 personality types. The MBTI test usually takes at least 10 to 15 minutes to complete. Due to the the long and tedious process that it takes to finish the test, we wanted to create something that can determine someone's personality in a much quicker manner. As a result, we decided to create a model that can predict a person’s MBTI based on their tweets. This process takes just a fraction of the time as compared to the original MBTI test, allowing users to figure out their personality more efficiently. Personality tests are a popular way for people to get to know each other and make new friends. With people still sufferring from the effects of social distancing due to the COVID-19 pandemic, we thought that creating a personality test will help others meet people easier and build connections.

2. Dataset

We decided to use the MBTI Personality Type Twitter Dataset from Kaggle as our data for our model. The dataset has almost 8000 values and contains tweets and their corresponding MBTI.

image Original Dataset

3. Structure

  • Meeting-Notes store all of our past meeting notes
  • img are the images included in our presentations
  • models include our trained models
  • resources.md is a list of all the resources we reference throughout this project

Note: the package versions listed in requirements.txt and imported in the code may not be the exact versions. However, the versioning here is less important. We've listed all of the used libraries in the section below.

4. Technologies Used and Requirements for Use

For the data cleaning and model training portion of our project, we used Google Colab, which is a collaborative application that allows multiple users to write Python code on the same file. We utilized NLTK, PyTorch, regex, and various libraries with the Colab to for our model. We also trained our model using BERT from PyTorch. To create our website, we used Streamlit, which is a Python-based library used to create web apps.

To see the requirements that we used, check out requirements.txt!

5. Difficulties

Since our project requires us to analyze tweets, the text in our dataset included tags, emojis, and other special characters. However, our model training process required us to remove these characters from the text. It was difficult figuring out how to take away the special characters without modifying the rest of the text.


Data Cleaning Example

We also realized that some of the tweets weren’t in English. This created problems with how we were tokenizing the text, so we had to decide between translating all of the non-English tweets into English, or deleting all of the non-English tweets. After we plotted the distribution of the language that all the tweets are in, we saw that the vast majority of the tweets are in English. As a result, we thought it would be best to remove all of the non-English tweets from our dataset.


Distribution of Languages

Another hurdle we faced was figuring out how to train our model. We had trouble improving the accuracy of our model, and it took many hours before we achieved a low enough accuracy for our output. We were able to fix this problem by modifying our model and testing our various other models until we were satisfied with our results.

6. Author Info

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published