adjust formatting

macs30123-s23 · May 23, 2023 · 3a386ff · 3a386ff
1 parent 274ff82
commit 3a386ff
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ By developing accurate and effective methods to predict fake news, social scient
 
 ### 2. Why Large-Scale Computing is Important for NLP (non-exhaustive list):
 1. Text processing, that is, the act of turning text into a numeric input that a machine learning model can take in, would benefit largely from parallelization. Namely, tokenization, the act of breaking down chunks of text into smaller subunits (e.g., words) is a necessary step that can be computationally expensive, especially when dealing with large documents.
-2. Feature extraction such as obtaining n-grams from text can lead to extremely wide dataframes (high dimensions--count vectorizers increase in folds of the length of the vocabulary size, which can be in the tens of thousands), requiring substantial memory resources.
+2. Feature extraction such as obtaining n-grams from text can lead to extremely wide dataframes (high dimensions - count vectorizers increase in folds of the length of the vocabulary size, which can be in the tens of thousands), requiring substantial memory resources.
 3. Large language models (not used in this project, but can be applied to increase accuracy), have millions of parameters leading to the need for more compute-intensive resources.
 4. Model fine-tuning often involve computationally expensive and time consuming procedures such as hyperparameter tuning via grid search.
 
@@ -23,7 +23,7 @@ By developing accurate and effective methods to predict fake news, social scient
 Data come from [this](https://www.kaggle.com/competitions/fake-news/data) Kaggle competition. The key file is ```train.csv```, which is a labeled dataset of 20,800 news articles. The ```test.csv``` file does not contain labels so I excluded it from this project.
 
 #### Process
-The project is divided into two main parts and uses PySpark:
+The project is divided into two main parts and used PySpark:
 
 1. Build a text cleaning and preprocessing pipeline
     1. Data cleaning
@@ -51,7 +51,7 @@ The project is divided into two main parts and uses PySpark:
         * **Unreliable (fake) articles:**
         <img src="output/fake_topics.png?raw=true"/>
         <br>
-        
+
         * **Reliable (real) articles:**
         <img src="output/real_topics.png?raw=true"/>