Video Demo: Click on the image

Zvirevo: Shona Language Word Embeddings

Zvirevo is a project aimed at bridging the digital divide for Shona speakers through advanced natural language processing techniques. We've developed word embeddings for the Shona language using Word2Vec and FastText models, enabling powerful word similarity and analogy calculations.

Creators

This project was developed by:

Daisy Tsenesa
Ruvarashe Sadya

Key Features

Word embeddings for the Shona language
Trained using Word2Vec and FastText algorithms
Supports word similarity calculations
Enables analogy computations (e.g., "mambo" - "murume" + "mukadzi" = "mambokadzi")
API deployed on Google Cloud Run for easy integration

Data Sources

Our word embeddings were trained on a corpus of Shona text collected from various sources:

Shona Dictionary: A comprehensive source of formal Shona text.
Shona Language Corpus: Kwayedza news articles and randomly chosen websites covering a wide range of topics in Shona.
Belebele Dataset: A multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants.
VOA News: A collection of Shona news articles and short stories.
Shona Bible and Quran

We are grateful to these sources for providing the rich textual data necessary for training our models.

Why Zvirevo?

Shona, spoken by over 80% of Zimbabwe's population, is underrepresented in modern NLP technologies. Zvirevo aims to change that by providing foundational NLP resources for Shona, paving the way for more advanced applications like chatbots and virtual assistants in the future.

Usage

Our API allows developers to easily integrate Shona language processing into their applications. Use it for:

Improving search relevance for Shona content
Enhancing machine translation systems
Developing educational tools for Shona language learners
Powering recommendation systems for Shona-language e-commerce platforms

API Documentation

The Zvirevo API is accessible at: https://word2vec-app-tk6uqeiatq-uc.a.run.app/

Endpoints

Word Similarity

Endpoint: /similar
Method: POST
Request Body:
```
{
  "word": "baba"
}
```

Response:

{
  "similar words": "[["tete", 0.83], ["vamwene", 0.81]]"
}

Word Analogy

Endpoint: /word-analogy
Method: POST

Request Body:

{
  "word1": "mambo",
  "word2": "murume",
  "word3": "mukadzi"
}

Response:
```
{
  "result": "mambokadzi"
}
```

Example Usage

Using Python with the requests library:

import requests

API_URL = "https://word2vec-app-tk6uqeiatq-uc.a.run.app/"

# Word Similarity
similarity_response = requests.post(f"{API_URL}/similar", json={
    "word": "murume"
})
print(similarity_response.json())

# Word Analogy
analogy_response = requests.post(f"{API_URL}/word-analogy", json={
    "word1": "mambo",
    "word2": "murume",
    "word3": "mukadzi"
})
print(analogy_response.json())

Installation and Setup

Prerequisites

Python 3.7+
pip
Docker (for containerization)
Google Cloud SDK (for deployment to Google Cloud Run)

Local Development

Clone the repository:

git clone https://github.com/your-username/zvirevo_word-embeddings.git
cd zvirevo

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```
Run the application locally:
```
python app.py
```

The application should now be running on http://localhost:5000.

Deployment to Google Cloud Run

Build the Docker image:

docker build -t gcr.io/main-duality-431514-n3/word2vec-app .

Push the image to Google Container Registry:

docker push gcr.io/main-duality-431514-n3/word2vec-app

Deploy to Cloud Run:

gcloud run deploy --image gcr.io/main-duality-431514-n3/word2vec-app --platform managed

Follow the prompts to complete the deployment. Once finished, Google Cloud Run will provide a URL where your application is accessible.

Flask App Documentation

Features

Retrieve similar words using Word2Vec or FastText models.
Compute word relationships based on provided expressions.
Send email notifications via a contact form.

Installation

Clone the repository:

git lfs install
git clone [(https://huggingface.co/dkt-py-bot/zvirevo_word2vec)]
cd zvirevo_word2vec

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the dependencies:
```
pip install -r requirements.txt
```
Download and place your pre-trained models:
- Place your Word2Vec model (w2v_shona2.model) and FastText model (ft_model1.model) in the project directory.
Configure Flask-Mail:
- Update the email configuration in the app.py file with your email credentials.

Usage

Run the Flask app:
```
flask run
```
Open your browser and navigate to:
```
http://127.0.0.1:5000
```

Endpoints

Home

URL: /
Method: GET
Description: Renders the home page.

Get Similar Words

URL: /get_similar_words
Method: POST
Description: Retrieves similar words using the selected model.

Request Body:

{
  "word": "example",
  "model_type": "word2vec",
  "top_n": 10
}

Response:

{
  "similar_words": [["word1", 0.85], ["word2", 0.80], ...]
}

Compute Word Relationships

URL: /compute
Method: POST
Description: Computes word relationships based on provided expressions.

Request Body:

{
  "expression": "mwana - vakomana + vasikana",
  "model": "word2vec",
  "top_n": 5
}

Response:

{
  "result": [{"word": "mukadzi", "similarity": 0.85}, ...]
}

Submit Contact Form

URL: /submit
Method: POST
Description: Sends an email notification with the contact form details.
Form Data:
- name
- email
- message
- subscribe

Response:

{
  "status": "success",
  "message": "Your message has been received and the email has been sent!"
}

Configuration

Email Configuration: Update the email server, port, username, and password in the app.py file.

Dependencies

Flask
Gensim
Flask-Mail
Numpy

Future Directions

While currently focused on word embeddings, we envision Zvirevo as a stepping stone towards more complex NLP tasks in Shona, including sentiment analysis, named entity recognition, and, eventually, full-fledged conversational AI.

Contributing

We welcome contributions! Email us at [email protected] and [email protected] for Daisy and Ruvarashe, respectively.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Join us in our mission to bring the Shona language into the digital age!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Development		Development
GCP Deployment		GCP Deployment
model		model
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Demo: Click on the image

Zvirevo: Shona Language Word Embeddings

Creators

Key Features

Data Sources

Why Zvirevo?

Usage

API Documentation

Endpoints

Example Usage

Installation and Setup

Prerequisites

Local Development

Deployment to Google Cloud Run

Flask App Documentation

Features

Installation

Usage

Endpoints

Home

Get Similar Words

Compute Word Relationships

Submit Contact Form

Configuration

Dependencies

Future Directions

Contributing

License

About

Releases

Packages

Contributors 2

Languages

RuvaS20/zvirevo_word-embeddings

Folders and files

Latest commit

History

Repository files navigation

Video Demo: Click on the image

Zvirevo: Shona Language Word Embeddings

Creators

Key Features

Data Sources

Why Zvirevo?

Usage

API Documentation

Endpoints

Example Usage

Installation and Setup

Prerequisites

Local Development

Deployment to Google Cloud Run

Flask App Documentation

Features

Installation

Usage

Endpoints

Home

Get Similar Words

Compute Word Relationships

Submit Contact Form

Configuration

Dependencies

Future Directions

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages