fix formatting

macs30123-s23 · May 23, 2023 · 5772a55 · 5772a55
1 parent 82b568a
commit 5772a55
Showing 1 changed file with 4 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -41,19 +41,21 @@ The project is divided into two main parts and uses PySpark:
 * **Fake News Prediction** - data for both models were split into an 80/20 train-test split
 
     * **Logistic Regression** 
+
     I chose a logistic regression model since logistic regression is relatively simple and interpretable and provides a probabilistic interpretation of classification results. I performed hyperparameter tuning via 5-fold grid search cross validation of the regularization parameter and elastic net parameter. The evaluator used was the BinaryClassificationEvaluator from PySpark with AUC-ROC as the evaluation metric. The test AUC and test accuracy came out to 0.9732 and 0.9217, respectively, indicating that fake news can be predicted well using a matrix of n-gram token counts from the count vectorizer and logistic regression.
 
     * **Gradient Boosted Tree Classifier** 
+
     The second model I chose to use was a gradient boosted tree since they are generally considered accurate, stable, and highly interpretable. Additionally, contrary to linear models such as logistic regression, tree-based models don’t assume our data have linear boundaries. I performed hyperparameter tuning via 5-fold grid search cross validation of maximum depth of the tree and maximum number of iterations. The evaluator used was the BinaryClassificationEvaluator from PySpark with AUC-ROC as the evaluation metric. The test AUC and test accuracy came out to 0.9724 and 0.9071, respectively. The test AUC is similar to the one from the logistic regression model, but test accuracy was slightly lower here.
 
 * **LDA Topic Modeling**
 
     * **Results from the top 10 words from 10 topics**
 
-        * **Unreliable (fake) articles**
+        * **Unreliable (fake) articles:**
         <img src="output/fake_topics.png?raw=true"/>
 
-        * **Reliable (real) articles**
+        * **Reliable (real) articles:**
         <img src="output/real_topics.png?raw=true"/>
 
     From the images above, we can see that topics and frequent words from fake news articles tend to be polarizing and controversial (i.e., 2016 Presidential Election in Topics 2 and 3). While topics and frequent words from real news articles overlap with the ones from fake news articles (Topics 1 and 3), we can see from the wordclouds in Topics 1, 7, and 8 that honorifics are commonly used. I used an arbitrary number of topics (10), but for future work, it would be a good idea to determine the number of topics less arbitrarily and examine the differences in frequently used words and topics from LDA more analytically.