Skip to content

Latest commit

 

History

History
121 lines (85 loc) · 3.93 KB

README.md

File metadata and controls

121 lines (85 loc) · 3.93 KB

N-gram Language Model Name Generator

Overview

This project implements an N-gram language model for generating unique names based on patterns learned from a dataset of existing names. It includes a Streamlit web application for interactive exploration of the model's capabilities and visualization of its internal workings.

Features

  • N-gram language model implementation
  • Hyperparameter tuning for optimal model performance
  • Interactive web interface for name generation and model exploration
  • Visualization of model probabilities using heatmaps
  • Step-by-step name generation process breakdown

Project Structure

.
├── app.py
├── data
│   ├── names.txt
│   ├── preprocess.py
│   ├── test.txt
│   ├── train.txt
│   └── val.txt
├── ngram.py
├── poetry.lock
└── pyproject.toml
  • app.py: Streamlit web application for interacting with the model
  • ngram.py: Core implementation of the N-gram language model. check out the C implementation here
  • data/: Directory containing dataset and preprocessing script
  • poetry.lock & pyproject.toml: Poetry dependency management files

Requirements

  • Python 3.7+
  • Poetry (for dependency management)

Setup

  1. Clone the repository:

    git clone https://github.com/goldenglorys/ngram-lm-in-python.git
    cd ngram-lm-in-python
    
  2. Install dependencies using Poetry:

    poetry install
    
  3. Activate the virtual environment:

    poetry shell
    
  4. Preprocess the data (if needed):

    python data/preprocess.py
    
  5. Run the Streamlit app:

    streamlit run app.py
    
  6. Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

Usage

  1. In the Streamlit interface, adjust the hyperparameters:

    • Sequence Lengths: List of N-gram lengths to evaluate
    • Smoothings: List of smoothing values to try
    • Random Seed: Seed for reproducibility
  2. Click "Train Model and Generate Names" to start the process.

  3. Explore the results:

    • View the best hyperparameters found
    • Read generated names
    • Analyze the model's performance metrics
    • Examine the probability heatmap
    • Watch the step-by-step name generation process
  4. Experiment with different hyperparameters and observe how they affect the model's behavior and output.

How It Works

  1. Data Preprocessing: The preprocess.py script prepares the name dataset, splitting it into training, validation, and test sets.

  2. Model Training: The N-gram model (ngram.py) is trained on the preprocessed data, learning the statistical patterns of character sequences in names.

  3. Hyperparameter Tuning: The app performs a grid search over specified sequence lengths and smoothing values to find the optimal configuration.

  4. Name Generation: Using the trained model, new names are generated by sampling from the learned probability distributions.

  5. Visualization: The app creates various visualizations to help users understand the model's internal workings and decision-making process.

Customization

  • To use your own dataset, replace the contents of data/names.txt with your desired names (one per line) and run the preprocessing script.
  • Modify the hyperparameter ranges in app.py to explore different model configurations.
  • Extend the ngram.py file to implement additional language model features or alternative algorithms.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by the work on N-gram models by Andrej Karpathy's
  • Built with Streamlit for an interactive web experience
  • Visualization techniques adapted from various data science and machine learning resources

Happy name generating!