DeepMBTI: Predict Personalities, Just by Sample Writing
- The PowerPoint pitch deck contains a comprehensive discussion of the trials and errors of the machine learning models and the development of this project.
- This file can be accessed here.
Reasons why we selected our topic:
Myers-Briggs Type Indicator (MBTI) is one of the most prevalent personality tests in psychology. Its purpose is to classify psychological profiles based on the assumption that variations in everyday behavior are ordered, consistent and capable of being categorized. Currently, the MBTI test is performed by an online quiz.
However, these tests often include intrinsic biases because the subject is conscious of the quiz's processes and will answer accordingly. Therefore, we wanted to create a more objective process for testing MBTI. By inputting writing from the subject, we hope to train a machine learning model to predict their personality category through text analytics. Hopefully, this will eliminate input biases from the subject.
The Myers-Briggs Type Indicator (MBTI) is a taxonomy that divides everyone into 16 distinct personality types across four axes:
- Introversion (I) – Extraversion (E)
- Intuition (N) – Sensing (S)
- Thinking (T) – Feeling (F)
- Judging (J) – Perceiving (P)
The dataset contains ~ 8,600 observations (people), where each observation gives a person’s:
- Myers-Briggs personality type (as a 4-letter code)
- An excerpt containing the last 50 posts on their PersonalityCafe forum (each entry separated by “|||”)
This data was collected from the PersonalityCafe forum, as it provides a large selection of people and their MBTI personality types, as well as what they have written.
- Explore and analyze the data to see if any patterns can be detected in specific personality types and their writing styles. This explores the validity of the test in analyzing, predicting and categorizing behavior. These include:
- Length of the post (word count)
- Length of average sentences (word count)
- Use of stop words and punctuation (count)
- Create a Machine Learning capable of predicting the user's personality type based on their writing input.
- PostgreSQL
- AWS RDS
- Python
- Sklearn
- Bootstrap
- Flask
- HTML/CSS
AWS ADS
- Set up RDS instance on AWS, and connect it to PostgreSQL
Postgres
- Create new server with RDS as the host
- Create SQL table and import data in .csv format
SQLAlchemy
- Set up connection with Postgres database
- Reads in the table as a Pandas Dataframe
1. What does the data look like?
- A personality type count shows that the sample data is not very evenly distributed.
- Four data types, INFJ, INFP, INTJ and INTP have considerably larger samples (more than 1,000) than ESFJ, ESFP, ESTJ and ESTP (fewer than 250).
- The users from the forum tend to be IN?? type.
2. Is each dimension balanced?
- Comparing the sample sizes within each of the four axes, we can see that the E-I and N-S axes have extremely imbalanced sample sizes.
3. Does each group have some difference?
- We counted the use of urls, exclamation marks, question marks, word length and digits per post for each personality type.
- The use of exclamation marks and http links differ slightly among the groups. Other writing style counts tend to be fairly evenly distributed.
4. Are the four dimensions independent?
- There are 16 personality types, and this limits our choice of Machine Learning models.
- We proposed dividing the personality types into four axes (E-I, N-S, F-T and J-P).This way, our model only needs to output a binary result. This widened our selection of models considerably.
- Before going forward, we need to test the assumption that each of the four axes is independent from the others.
- A correlation matrix is produced, as shown in the figure on the left.
- Overall, the correlations’ absolute values are < 0.2, indicating low correlation.
1. What to clean up
- Used RegEx methodology to clean data that consisted of social media posts from over 8,000 unique identifiers with 50 posts each separated by 3 pipes ( ||| ).
- Reading the mood from text with ML is called “sentiment analysis.” However, before we could perform sentiment analysis, we had to create a column without the following: http strings; ||| strings; punctuation marks; underscores; numbers; one letter words; leftover white space.
- Then, we made everything lowercase.
2. The balance of information and noise
- We decided that because we are trying to predict personality types, we wanted the text as similar to the actual writing as possible; this is part art and part science.
- We then tokenized and performed TF-IDF vectorization for the cleaned text, and used all 17,000 features as model input.
1. Which models
- We tried five different models: Logistics Regression; Neural Network; Random Forest; Linear SVC; LinearSVC with KBInsDiscretizer.
- With further improvements, the Logistics Regression model produced the best result in terms of overall F1 score and accuracy.
2. How to train the model
- We will use
train_test_splitto split the data intoX_train,X_test, andy_all_train,y_all_test. X_trainandX_testinclude all 17,000 features that are vectorized from the cleaned text input.y_all_trainandy_all_testinclude four columns of target labels.- When training, we will run the model four times. In each iteration, we:
- Resample the data for each dimention to solve the class imbalance issue.
- Use the resampled
X_trainwith one column (target) from they_all_train. - Run the model four times to generate four labels for each test observation.
3. What are the target label(s)
- Each observation will have four initial labels, one from each of the following binaries:
- "E-I"
- "N-S"
- "T-F"
- "J-P"
1. Objective of the app
We want to let users try the model themselves by entering an example of their writing in the text box to receive a prediction of their personality type.
2. The Front-end content
- A text box where the user can input their writing. For better accuracy, there is a minimum requirement of 150 characters.
- A
predictbutton to click on when the user is ready to view the result. - A background introduction of Myers-Briggs Type Indicators with the meaning of each dimension.
- A brief bio of the creators and the background of this project.
3. The Back-end
- Workflow
- Model weights and other variables
- After running the logistics regression model, we used
pickleto save the weights of our four models. - Since we need to preprocess the user text input, we also needed to save the
vectorizervariable.
- After running the logistics regression model, we used
- Flask
- Import all four model weights and the
vectorizervariable. - Initiate a Flask app.
- Create a root route using our HTML template.
- Use a
POSTmethod to capture the user input in the text box. - Apply regex cleaning and vectorize the text into features.
- Input the features into four models and get predicted results for four dimensions.
- Convert 1 & 0 binary results to corresponding personality type letters.
- Output the four dimensions back to the frontend in the result section.
- Import all four model weights and the
- HTML
- At the text box section, we included
<form action="{{ url_for('predict')}}" method="POST">so it can be called from Flask. - In the result section, we included
{{ prediction_text }}so the prediction result can be reflected.
- At the text box section, we included



