diff --git a/README.md b/README.md index 1deb777..fe372b5 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ By developing accurate and effective methods to predict fake news, social scient ### 2. Why Large-Scale Computing is Important for NLP (non-exhaustive list): 1. Text processing, that is, the act of turning text into a numeric input that a machine learning model can take in, would benefit largely from parallelization. Namely, tokenization, the act of breaking down chunks of text into smaller subunits (e.g., words) is a necessary step that can be computationally expensive, especially when dealing with large documents. -2. Feature extraction such as obtaining n-grams from text can lead to extremely wide dataframes (high dimensions--count vectorizers increase in folds of the length of the vocabulary size, which can be in the tens of thousands), requiring substantial memory resources. +2. Feature extraction such as obtaining n-grams from text can lead to extremely wide dataframes (high dimensions - count vectorizers increase in folds of the length of the vocabulary size, which can be in the tens of thousands), requiring substantial memory resources. 3. Large language models (not used in this project, but can be applied to increase accuracy), have millions of parameters leading to the need for more compute-intensive resources. 4. Model fine-tuning often involve computationally expensive and time consuming procedures such as hyperparameter tuning via grid search. @@ -23,7 +23,7 @@ By developing accurate and effective methods to predict fake news, social scient Data come from [this](https://www.kaggle.com/competitions/fake-news/data) Kaggle competition. The key file is ```train.csv```, which is a labeled dataset of 20,800 news articles. The ```test.csv``` file does not contain labels so I excluded it from this project. #### Process -The project is divided into two main parts and uses PySpark: +The project is divided into two main parts and used PySpark: 1. Build a text cleaning and preprocessing pipeline 1. Data cleaning @@ -51,7 +51,7 @@ The project is divided into two main parts and uses PySpark: * **Unreliable (fake) articles:**
- + * **Reliable (real) articles:**