Skip to content

Commit

Permalink
adjust formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
Carolyn Liu authored and Carolyn Liu committed May 23, 2023
1 parent 274ff82 commit 3a386ff
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ By developing accurate and effective methods to predict fake news, social scient

### 2. Why Large-Scale Computing is Important for NLP (non-exhaustive list):
1. Text processing, that is, the act of turning text into a numeric input that a machine learning model can take in, would benefit largely from parallelization. Namely, tokenization, the act of breaking down chunks of text into smaller subunits (e.g., words) is a necessary step that can be computationally expensive, especially when dealing with large documents.
2. Feature extraction such as obtaining n-grams from text can lead to extremely wide dataframes (high dimensions--count vectorizers increase in folds of the length of the vocabulary size, which can be in the tens of thousands), requiring substantial memory resources.
2. Feature extraction such as obtaining n-grams from text can lead to extremely wide dataframes (high dimensions - count vectorizers increase in folds of the length of the vocabulary size, which can be in the tens of thousands), requiring substantial memory resources.
3. Large language models (not used in this project, but can be applied to increase accuracy), have millions of parameters leading to the need for more compute-intensive resources.
4. Model fine-tuning often involve computationally expensive and time consuming procedures such as hyperparameter tuning via grid search.

Expand All @@ -23,7 +23,7 @@ By developing accurate and effective methods to predict fake news, social scient
Data come from [this](https://www.kaggle.com/competitions/fake-news/data) Kaggle competition. The key file is ```train.csv```, which is a labeled dataset of 20,800 news articles. The ```test.csv``` file does not contain labels so I excluded it from this project.

#### Process
The project is divided into two main parts and uses PySpark:
The project is divided into two main parts and used PySpark:

1. Build a text cleaning and preprocessing pipeline
1. Data cleaning
Expand Down Expand Up @@ -51,7 +51,7 @@ The project is divided into two main parts and uses PySpark:
* **Unreliable (fake) articles:**
<img src="output/fake_topics.png?raw=true"/>
<br>

* **Reliable (real) articles:**
<img src="output/real_topics.png?raw=true"/>

Expand Down

0 comments on commit 3a386ff

Please sign in to comment.