(click the picture to access our website!)
1. Background and Motivation
2. Dataset
3. Structure
4. Technologies Used and Requirements for Use
5. Difficulties
6. Author Info
MBTI, or Myers-Briggs Type Indicator, is a way to classify different personality types based on four categories: introversion(I)/extraversion(E), sensing(S)/intuition(N), thinking(T)/feeling(F), and judging(J)/perceiving(P). Every person has a combination of four letters that determines their personality types, and there are a total of 16 personality types. The MBTI test usually takes at least 10 to 15 minutes to complete. Due to the the long and tedious process that it takes to finish the test, we wanted to create something that can determine someone's personality in a much quicker manner. As a result, we decided to create a model that can predict a person’s MBTI based on their tweets. This process takes just a fraction of the time as compared to the original MBTI test, allowing users to figure out their personality more efficiently. Personality tests are a popular way for people to get to know each other and make new friends. With people still sufferring from the effects of social distancing due to the COVID-19 pandemic, we thought that creating a personality test will help others meet people easier and build connections.
We decided to use the MBTI Personality Type Twitter Dataset from Kaggle as our data for our model. The dataset has almost 8000 values and contains tweets and their corresponding MBTI.
Meeting-Notes
store all of our past meeting notesimg
are the images included in our presentationsmodels
include our trained modelsresources.md
is a list of all the resources we reference throughout this project
Note: the package versions listed in requirements.txt and imported in the code may not be the exact versions. However, the versioning here is less important. We've listed all of the used libraries in the section below.
For the data cleaning and model training portion of our project, we used Google Colab, which is a collaborative application that allows multiple users to write Python code on the same file. We utilized NLTK, PyTorch, regex, and various libraries with the Colab to for our model. We also trained our model using BERT from PyTorch. To create our website, we used Streamlit, which is a Python-based library used to create web apps.
To see the requirements that we used, check out requirements.txt!
Since our project requires us to analyze tweets, the text in our dataset included tags, emojis, and other special characters. However, our model training process required us to remove these characters from the text. It was difficult figuring out how to take away the special characters without modifying the rest of the text.
We also realized that some of the tweets weren’t in English. This created problems with how we were tokenizing the text, so we had to decide between translating all of the non-English tweets into English, or deleting all of the non-English tweets. After we plotted the distribution of the language that all the tweets are in, we saw that the vast majority of the tweets are in English. As a result, we thought it would be best to remove all of the non-English tweets from our dataset.
Another hurdle we faced was figuring out how to train our model. We had trouble improving the accuracy of our model, and it took many hours before we achieved a low enough accuracy for our output. We were able to fix this problem by modifying our model and testing our various other models until we were satisfied with our results.