Zvirevo is a project aimed at bridging the digital divide for Shona speakers through advanced natural language processing techniques. We've developed word embeddings for the Shona language using Word2Vec and FastText models, enabling powerful word similarity and analogy calculations.
This project was developed by:
- Daisy Tsenesa
- Ruvarashe Sadya
- Word embeddings for the Shona language
- Trained using Word2Vec and FastText algorithms
- Supports word similarity calculations
- Enables analogy computations (e.g., "mambo" - "murume" + "mukadzi" = "mambokadzi")
- API deployed on Google Cloud Run for easy integration
Our word embeddings were trained on a corpus of Shona text collected from various sources:
- Shona Dictionary: A comprehensive source of formal Shona text.
- Shona Language Corpus: Kwayedza news articles and randomly chosen websites covering a wide range of topics in Shona.
- Belebele Dataset: A multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants.
- VOA News: A collection of Shona news articles and short stories.
- Shona Bible and Quran
We are grateful to these sources for providing the rich textual data necessary for training our models.
Shona, spoken by over 80% of Zimbabwe's population, is underrepresented in modern NLP technologies. Zvirevo aims to change that by providing foundational NLP resources for Shona, paving the way for more advanced applications like chatbots and virtual assistants in the future.
Our API allows developers to easily integrate Shona language processing into their applications. Use it for:
- Improving search relevance for Shona content
- Enhancing machine translation systems
- Developing educational tools for Shona language learners
- Powering recommendation systems for Shona-language e-commerce platforms
The Zvirevo API is accessible at: https://word2vec-app-tk6uqeiatq-uc.a.run.app/
-
Word Similarity
- Endpoint:
/similar
- Method: POST
- Request Body:
{ "word": "baba" }
- Response:
{ "similar words": "[["tete", 0.83], ["vamwene", 0.81]]" }
- Endpoint:
-
Word Analogy
- Endpoint:
/word-analogy
- Method: POST
- Request Body:
{ "word1": "mambo", "word2": "murume", "word3": "mukadzi" }
- Response:
{ "result": "mambokadzi" }
- Endpoint:
Using Python with the requests
library:
import requests
API_URL = "https://word2vec-app-tk6uqeiatq-uc.a.run.app/"
# Word Similarity
similarity_response = requests.post(f"{API_URL}/similar", json={
"word": "murume"
})
print(similarity_response.json())
# Word Analogy
analogy_response = requests.post(f"{API_URL}/word-analogy", json={
"word1": "mambo",
"word2": "murume",
"word3": "mukadzi"
})
print(analogy_response.json())
- Python 3.7+
- pip
- Docker (for containerization)
- Google Cloud SDK (for deployment to Google Cloud Run)
-
Clone the repository:
git clone https://github.com/your-username/zvirevo_word-embeddings.git cd zvirevo
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Run the application locally:
python app.py
The application should now be running on http://localhost:5000
.
-
Build the Docker image:
docker build -t gcr.io/main-duality-431514-n3/word2vec-app .
-
Push the image to Google Container Registry:
docker push gcr.io/main-duality-431514-n3/word2vec-app
-
Deploy to Cloud Run:
gcloud run deploy --image gcr.io/main-duality-431514-n3/word2vec-app --platform managed
Follow the prompts to complete the deployment. Once finished, Google Cloud Run will provide a URL where your application is accessible.
- Retrieve similar words using Word2Vec or FastText models.
- Compute word relationships based on provided expressions.
- Send email notifications via a contact form.
-
Clone the repository:
git lfs install git clone [(https://huggingface.co/dkt-py-bot/zvirevo_word2vec)] cd zvirevo_word2vec
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Download and place your pre-trained models:
- Place your Word2Vec model (
w2v_shona2.model
) and FastText model (ft_model1.model
) in the project directory.
- Place your Word2Vec model (
-
Configure Flask-Mail:
- Update the email configuration in the
app.py
file with your email credentials.
- Update the email configuration in the
-
Run the Flask app:
flask run
-
Open your browser and navigate to:
http://127.0.0.1:5000
- URL:
/
- Method:
GET
- Description: Renders the home page.
- URL:
/get_similar_words
- Method:
POST
- Description: Retrieves similar words using the selected model.
- Request Body:
{ "word": "example", "model_type": "word2vec", "top_n": 10 }
- Response:
{ "similar_words": [["word1", 0.85], ["word2", 0.80], ...] }
- URL:
/compute
- Method:
POST
- Description: Computes word relationships based on provided expressions.
- Request Body:
{ "expression": "mwana - vakomana + vasikana", "model": "word2vec", "top_n": 5 }
- Response:
{ "result": [{"word": "mukadzi", "similarity": 0.85}, ...] }
- URL:
/submit
- Method:
POST
- Description: Sends an email notification with the contact form details.
- Form Data:
name
email
message
subscribe
- Response:
{ "status": "success", "message": "Your message has been received and the email has been sent!" }
- Email Configuration: Update the email server, port, username, and password in the
app.py
file.
- Flask
- Gensim
- Flask-Mail
- Numpy
While currently focused on word embeddings, we envision Zvirevo as a stepping stone towards more complex NLP tasks in Shona, including sentiment analysis, named entity recognition, and, eventually, full-fledged conversational AI.
We welcome contributions! Email us at [email protected] and [email protected] for Daisy and Ruvarashe, respectively.
This project is licensed under the MIT License - see the LICENSE file for details.
Join us in our mission to bring the Shona language into the digital age!