Author: Christopher Dieck
The business is trying to see if it can be determined if an individual can identify as an introvert, an extrovert, or neither based on a series of questions provided by a questionnaire for the purpose of creating targeted advertisements.
Our goal for this analysis is to identify which features of the dataset are most important in predicting whether someone can be identified as an introvert or an extrovert, to discover trends in the data that may be useful, and to build a machine learning model that can classify if someone is introverted, extroverted, or neither.
- Source:
- Original Questionnaire:
- Introvert/Extrovert ('IE')
- Label for being introverted, extroverted, or neither
The test contained 91 questions. The questions were presented one at a time in a random order. For each questions 3 values were recorded:
A - The user's selected response. 1=Disagree, 2=Slightly disagree, 3=Neutral, 4=Slightly agree, 5=Agree
I - The position of the question in the survey.
E - The time elapsed on that question in milliseconds.
The text of the questions were:
{ "Q1" : "I would never audition to be on a game show."
"Q2" : "I am not much of a flirt."
"Q3" : "I have to psych myself up before I am brave enough to make a phone call."
"Q4" : "I would hate living with room mates."
"Q5" : "I mostly listen to people in conversations."
"Q6" : "I reveal little about myself."
"Q7" : "I spend hours alone with my hobbies."
"Q8" : "I prefer to eat alone."
"Q9" : "I have trouble finding people I want to be friends with."
"Q10" : "I prefer to socialize 1 on 1, than with a group."
"Q11" : "I sometimes speak so quietly people sometimes have trouble hearing me."
"Q12" : "I do not like to get my picture taken."
"Q13" : "I can keep a conversation going with anyone about anything."
"Q14" : "I want a huge social circle."
"Q15" : "I talk to people when waiting in lines."
"Q16" : "I act wild and crazy."
"Q17" : "I am a bundle of joy."
"Q18" : "I love excitement."
"Q19" : "I'd like to be in a parade."
"Q20" : "I am a flamboyant person."
"Q21" : "I am good at making impromptu speeches."
"Q22" : "I naturally emerge as a leader."
"Q23" : "I am spontaneous."
"Q24" : "I would enjoy being a sports team coach."
"Q25" : "I have a strong personality."
"Q26" : "I am excited by many different activities."
"Q27" : "I spend most of my time in fantasy worlds."
"Q28" : "I often feel lucky."
"Q29" : "I don't make eye contact when I talk with people."
"Q30" : "I have a monotone voice."
"Q31" : "I am a touchy feely person."
"Q32" : "I would like to try bungee jumping."
"Q33" : "I tend to be admired by others."
"Q34" : "I make big physical movements whenever I get excited."
"Q35" : "I am brave."
"Q36" : "I am always in the moment."
"Q37" : "I am involved with my community."
"Q38" : "I am good an entertaining children."
"Q39" : "I like formal occasions."
"Q40" : "I would have to be lost for a very long time before asking help."
"Q41" : "I do not care about sports."
"Q42" : "I prefer individual sports to team sports."
"Q43" : "My parents know nothing about my love life."
"Q44" : "I mostly listen to people in conversations."
"Q45" : "I never leave the door to my room open."
"Q46" : "I make a lot of hand motions when I talk."
"Q47" : "I take lots of pictures of my activities."
"Q48" : "When I was a child, I put on fake concerts and plays with my friends."
"Q49" : "I really like dancing."
"Q50" : "I would have difficulty describing myself to someone."
"Q51" : "My life would not make a good story."
"Q52" : "I am hesitant to give suggestions."
"Q53" : "I tire out quickly."
"Q54" : "I never tell people the important things about myself."
"Q55" : "I avoid going to unknown places."
"Q56" : "Going to the doctor is always awkward for me."
"Q57" : "I have not kept up with my old friends over the years."
"Q58" : "I have not been joyful for quite some time."
"Q59" : "I hate to ask for help."
"Q60" : "If I were to die, I would not want there to be a memorial for me."
"Q61" : "I hate shopping."
"Q62" : "I love to do impressions."
"Q63" : "I would be pleased if asked to speak at a funeral."
"Q64" : "I would never go to a dance club."
"Q65" : "I find it very hard to tell people I find them attractive."
"Q66" : "I hate people."
"Q67" : "I was an outcast in school."
"Q68" : "I would enjoy being a librarian."
"Q69" : "I am usually not single."
"Q70" : "I am able to stand up for myself."
"Q71" : "I would go surfing regularly if I lived on a beach."
"Q72" : "I have wanted to be a stand-up comedian."
"Q73" : "I am a high status person."
"Q74" : "I work out regularly."
"Q75" : "I laugh a lot."
"Q76" : "I like pranks."
"Q77" : "I am happy with my life."
"Q78" : "I am never at a loss for words."
"Q79" : "I feel healthy and vibrant most of the time."
"Q80" : "I love large parties."
"Q81" : "I am quiet around strangers."
"Q82" : "I don't talk a lot."
"Q83" : "I keep in the background."
"Q84" : "I don't like to draw attention to myself."
"Q85" : "I have little to say."
"Q86" : "I often feel blue."
"Q87" : "I am not really interested in others."
"Q88" : "I make people feel at ease."
"Q89" : "I don't mind being the center of attention."
"Q90" : "I start conversations."
"Q91" : "I talk to a lot of different people at parties. }
After the main question sequence, the following questions were asked on one final page (none of the following features were used in machine learning except for "IE"):
-
age: "What is your age in years?"
-
gender: "What is your gender?"
- 1=Male
- 2=Female
- 3=Other
-
engnat: "Is English your native language?"
- 1=Yes
- 2=No
-
IE: "Do you identify as either an introvert or extravert?"
- 1=Yes, introvert
- 2=Yes, extravert
- 3=No
On the final page, the users were also asked "Do you give accurate answers and can we store and use your data for research?". Only those who answered yes were recorded.
The following were determined from techincal information:
country: user's network location dateload: the time the user loaded the introduction page introelapse: the time spent in seconds on the introduction page testelapse: the time spent in seconds on the test questions surveyelapse: the time spent in seconds on the final page
Missing Values
- There were only two missing values total, and both were in the country column. In order to preserve the rest of the data in those rows I imputed the missing values with the most frequent value ("US") because it represented the overwhelming majority of the data.
-
I started by removing the columns that were not going to be useful for any predictive analysis. These included all of the questions that were asked after the main questionnaie as well as the technical information, except for the target column. The final questions were never meant to be used for predictions, and the technical information was filled with innaccurate values due to people potentially leaving their computer while taking the questionnaire.
-
I initially set the already ordinal encoded target variable to have string labels for easier data exploration, so I had to re-ordinal encode them.
-
I split the data into training and testing data to validate my models and be able to test it on data it has never seen before to prove that it can make predictions with new data.
-
While all the features have the same range of a value from 1 to 5, I still used a normalizer to scale the data because the distribution is not a normal bell curve.
-
Principal Component Analysis (PCA) was used on each model for dimensionality reduction in order to improve model speed. It was useful and improved the score on one of the models, but it sacrified accuracy on the others.
-
The following three machine learning models were used to see which model can best classify if someone is an introvert, an extrovert, or neither:
- Decision Tree Classifier
- K Nearest Neighbors (KNN) Classifer
- Logistic Regression Classifier
-
Each model was evaluated using classification reports that indentified precision, recall, f1-scores, and accuracy to get an overview of the model performance and the amount of type 1 and type 2 error. Each was also tested once more using
-
There were several aspects of the dataset that potentially made it difficult to produce a model that could reach an accuracy above 75%.
- A correlational heatmap showed that a large majority of the features did not have much correlation to the target variable
- The dataset provided was relatively small with roughly 7,200 entries and few relevant features.
- As shown, the decision tree model performed okay with an accuracy of 71.2% on the training set and 71.8% on the testing set. A redeeming quality is that it has a fairly high recall score of 92% for identifying introverts.
- The KNN model performed slightly better with a 73.1% accuracy on the training set and 73.4% on the testing set. Although only a slight increase in accuracy, it had a great improvement with recall for detecting extroverts by about 20%, but it lowered by 4% for those who identify as neither.
- The KNN model using PCA performed about 1% better than the base model, which may not be worth losing the ability to identify the features. It also performed slightly worse on recall scores, but slightly better for precision scores.
- The logistic regression model performed the best out of all the models with an accuracy of 74.9%. Also, it had the highest recall scores for identifying extroverts or ambiverts(neither), but it performed slightly worse for identifying introverts specifically. On the other hand, it also performed the best on precision scores across the board.
- In looking at this heatmap, positive correlations mean that as the person agrees more with the statement (higher rating) they also tend identify as extroverted (higher number compared to introverted). If they disagree (lower rating), they tend to be introverted (lower number).
- Negative correlations mean that the more they agree with the statement they also tend to be introverted. If they disagree, they tend to be extroverted. Basically, it is an inverse relationship.
After trying to select the specific values that seemed to have the strongest correlations, we can see that Q91 ("I talk to a lot of different people at parties") has the strongest positive correlation to IE. The next strongest positive correlations are followed by Q90 ("I start conversations") and Q89 ("I don't mind being the center of attention"), respectively. With some general knowledge about introverts and extraverts, it makes sense that these questions have the strongest positive correlations towards our target because extroverts typically enjoy, or are much more comfortable with social activities with strangers compared to introverts.
It is interesting to note that these questions also have strong positive correlations between each other, which shows that people who agree with one are likely to agree with the others if they are extroverted, or disagree if they are introverted.
The questions with the highest negative correlations to our target are Q83 ("I keep in the background"), Q82 ("I don't talk a lot"), and Q81 ("I am quiet around strangers"), respectively. These questions make sense to have negative correlations because introverts are typically enjoy their time alone and may dislike social activities around new people, which is represented well by these questions. Similarly, these questions are also strongly positively correlated between each other, which mean they also have consistent answers.
Analysis of Bar Charts:
Both of these barplots show that people who identify as introverted typically seem to either dislike or avoid starting or continuing conversations, especially with people they do not know.
The answers to these questions are also highly correlated to other questions regarding situations where the individual would be around talkative people, such as at parties. Across all of these types of questions, those who identify as introverts tend to not be as talkative as those who identify as extroverted.
Using these questions to give a quick idea on whether or not someone is an introvert or extrovert could be extremely useful for someone who is trying to market towards introverts or extroverts specifically through having more information on how these types of people are.
It is also worth mentioning that those who identify as "Neither" (which could be described as an ambivert, or someone who is in between introverted and extroverted) answered fairly evenly across the board, as expected.
Based on these reports, the logistic regression model had the highest accuracy at roughly 75% and I would consider this model to be the best simply because it has the highest accuracy. Furthermore, while having false positives and negatives are not very important for this particular situation, having less false negatives is slightly better because (in the case of deciding how to market to a particular individual) false negatives mean we are missing opportunities to advertise to someone who would respond well to the targeted advertisement.
A false positive, on the other hand, means we might send an advertisement to someone who might not be as likely to care about it, but there is a chance they could still like the product. In short, the cost of taking action is low, while the cost of missing the opportunity can be much higher.
-
Using the logistic regression model, a business can make predictions on classifying whether someone is introverted, extroverted, or neither.
-
Using this information, the company can make changes to who they advertise to to ultimately increase sales and/or not waste as many resources on who they advertise to. It can also be used to see what category the companies main audience tends to fall into, and then adjustments can be made accordingly depending on the desired outcomes.
-
Additionally, it is highly recommended to disclude the same features that I discluded due to the reasons listed in the method section above. They are not suitable predictions and will most likely add a lot of noise.
The biggest limitations with this project was having features that were difficult to visualize in a meaningful way due the numerous amount of them and how they are formatted. Aside from that, many of them seem to not have much of a great correlation with the target, which is a lot of extra data collected that was unnecessary.
The next steps should be as follows:
-
Further experimentation may prove to be useful for hypertuning the model, especially regarding increasing the recall score for the Extrovert and Neither classes.
-
I could try using a boosting method such as XGBoost to potentially increase performance.
-
I should cross validate the model to ensure that it can perform just as well on multiple test sets aside from the main one I used.
For any additional questions, please contact [email protected]