Text Retrieval and Relevance Ranking

Overview

This project implements various text retrieval and relevance ranking models for analyzing a large dataset of news articles. The assignment is split into two parts:

Part I focuses on parsing text data, building vocabulary, and ranking documents based on different models such as Bit Vector and TF-IDF.
Part II introduces more advanced techniques using Word2Vec for document relevance.

Vector Space Model

Word2Vec Model

The project is implemented in two Python files:

textretrieval.py: Contains implementations for Tasks 1-3 (Text Parsing, Bit Vector Model, TF-IDF Model).
Word2Vec-TFDF.py: Contains implementations for Task 4 (Word2Vec) and Task 5 (extra credit).

Dataset

The dataset used for this project is AG's News Topic Classification Dataset, specifically the test.csv file from this link. The dataset contains news articles, each with a title, description, and class. For this project, we focus on the "description" field.

Requirements

The project is built using Python 3, and the following libraries are required:

Pandas
NumPy
NLTK
Gensim (for Word2Vec)

To install the required dependencies, run:

pip install -r requirements.txt

Instructions

Part I: Text Parsing and Vector Space Models (`textretrieval.py`)

Task 1: Text Data Parsing and Vocabulary Selection
- Cleans and preprocesses the dataset by removing stop-words, punctuation, numbers, HTML tags, and excess whitespaces.
- Builds a vocabulary of the top 200 most frequent words from the dataset.
Task 2: Document Relevance with Bit Vector Model
- Implements a basic Vector Space Model (VSM) using a bit-vector representation.
- Computes relevance scores for documents based on the query.
Task 3: Document Relevance with TF-IDF Model
- Implements the TF-IDF model using Okapi-BM25 (without document length normalization).
- Ranks documents based on the relevance to the provided queries.

Part II: Word2Vec Model (`Word2Vec-TFDF.py`)

Task 4: Document Relevance with Word2Vec
- Uses Word2Vec to compute word relevance based on pre-trained word embeddings.
- Scores documents using the average log-likelihood of Word2Vec embeddings.
Task 5 (Extra Credit): TF-IDF with Document Length Normalization
- Extends the TF-IDF model from Task 3 by adding document length normalization.

How to Run

To execute the text parsing and relevance models:

Run textretrieval.py for Tasks 1-3:
```
python textretrieval.py
```
Run Word2Vec-TFDF.py for Tasks 4-5:
```
python Word2Vec-TFDF.py
```

Each script will process the dataset and output the top 5 most relevant documents and the bottom 5 least relevant documents based on the provided queries.

Queries Tested

The following queries are used to test the models:

Query 1: "olympic gold athens"
Query 2: "reuters stocks friday"
Query 3: "investment market prices"

The results for each query are printed to the console.

Output

For each model, the output includes:

Top 5 most relevant documents.
Bottom 5 least relevant documents.
Relevance scores for each document.

Contact

For any questions regarding the implementation, feel free to contact Letian Jiang.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text Retrieval and Relevance Ranking

Overview

Vector Space Model

Word2Vec Model

Dataset

Requirements

Instructions

Part I: Text Parsing and Vector Space Models (`textretrieval.py`)

Part II: Word2Vec Model (`Word2Vec-TFDF.py`)

How to Run

Queries Tested

Output

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text Retrieval and Relevance Ranking

Overview

Vector Space Model

Word2Vec Model

Dataset

Requirements

Instructions

Part I: Text Parsing and Vector Space Models (textretrieval.py)

Part II: Word2Vec Model (Word2Vec-TFDF.py)

How to Run

Queries Tested

Output

Contact

Part I: Text Parsing and Vector Space Models (`textretrieval.py`)

Part II: Word2Vec Model (`Word2Vec-TFDF.py`)