This repository hosts a suite of tools designed to facilitate the download and processing of parallel sentence data sourced from Tatoeba for the English and Central Kurdish (ckb) languages.
- Directly downloads data from Tatoeba's public datasets.
- Efficiently processes and consolidates English and Central Kurdish sentences for enhanced accessibility.
- Persists prepared data in a tab-separated (TSV) file format, ensuring seamless integration with various tools.
Ensure Python 3 is installed. Clone this repository and install necessary dependencies using the following command:
pip install git+https://github.com/abdulbaseet-zahir/tatoeba-ckb.git
- Import the
TatoebaCKB
class
from tatoeba_ckb import TatoebaCKB
- Create an instance of the class
tatoeba_ckb = TatoebaCKB()
- Get the data:
data = tatoeba_ckb.get_data()
This will download the data if it's not already present, and return a pandas DataFrame containing the parsed data.
The DataFrame contains the following columns:
eng_id
: ID of the English sentenceckb_id
: ID of the Central Kurdish sentenceeng_sentence
: The English sentenceckb_sentence
: The Central Kurdish sentenceeng_username
: Username of the English sentence contributorckb_username
: Username of the Central Kurdish sentence contributor
Access and manipulate the data using pandas operations:
print(data.head()) # View the first few rows
print(data.shape) # Check the dimensions of the DataFrame
Use the data for various tasks, such as:
- Building machine translation models
- Creating language learning resources
- Researching language patterns
- For more details on available methods, refer to the class documentation within the code.
- If you encounter any issues, please open an issue on the repository.
We welcome contributions to this project! Feel free to submit issues or pull requests.