N-gram Language Model Name Generator

Overview

This project implements an N-gram language model for generating unique names based on patterns learned from a dataset of existing names. It includes a Streamlit web application for interactive exploration of the model's capabilities and visualization of its internal workings.

Features

N-gram language model implementation
Hyperparameter tuning for optimal model performance
Interactive web interface for name generation and model exploration
Visualization of model probabilities using heatmaps
Step-by-step name generation process breakdown

Project Structure

.
├── app.py
├── data
│   ├── names.txt
│   ├── preprocess.py
│   ├── test.txt
│   ├── train.txt
│   └── val.txt
├── ngram.py
├── poetry.lock
└── pyproject.toml

app.py: Streamlit web application for interacting with the model
ngram.py: Core implementation of the N-gram language model. check out the C implementation here
data/: Directory containing dataset and preprocessing script
poetry.lock & pyproject.toml: Poetry dependency management files

Requirements

Python 3.7+
Poetry (for dependency management)

Setup

Clone the repository:

git clone https://github.com/goldenglorys/ngram-lm-in-python.git
cd ngram-lm-in-python

Install dependencies using Poetry:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```
Preprocess the data (if needed):
```
python data/preprocess.py
```
Run the Streamlit app:
```
streamlit run app.py
```
Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

Usage

In the Streamlit interface, adjust the hyperparameters:
- Sequence Lengths: List of N-gram lengths to evaluate
- Smoothings: List of smoothing values to try
- Random Seed: Seed for reproducibility
Click "Train Model and Generate Names" to start the process.
Explore the results:
- View the best hyperparameters found
- Read generated names
- Analyze the model's performance metrics
- Examine the probability heatmap
- Watch the step-by-step name generation process
Experiment with different hyperparameters and observe how they affect the model's behavior and output.

How It Works

Data Preprocessing: The preprocess.py script prepares the name dataset, splitting it into training, validation, and test sets.
Model Training: The N-gram model (ngram.py) is trained on the preprocessed data, learning the statistical patterns of character sequences in names.
Hyperparameter Tuning: The app performs a grid search over specified sequence lengths and smoothing values to find the optimal configuration.
Name Generation: Using the trained model, new names are generated by sampling from the learned probability distributions.
Visualization: The app creates various visualizations to help users understand the model's internal workings and decision-making process.

Customization

To use your own dataset, replace the contents of data/names.txt with your desired names (one per line) and run the preprocessing script.
Modify the hyperparameter ranges in app.py to explore different model configurations.
Extend the ngram.py file to implement additional language model features or alternative algorithms.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by the work on N-gram models by Andrej Karpathy's
Built with Streamlit for an interactive web experience
Visualization techniques adapted from various data science and machine learning resources

Happy name generating!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

N-gram Language Model Name Generator

Overview

Features

Project Structure

Requirements

Setup

Usage

How It Works

Customization

Contributing

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

N-gram Language Model Name Generator

Overview

Features

Project Structure

Requirements

Setup

Usage

How It Works

Customization

Contributing

License

Acknowledgments