This repository is a part of the thesis project for MA Linguistics (Human Language Technology). This project focuses on the cleaning up of Wordnet Bahasa by comparing automatically aligned dictionary data (OPUS) with hand-curated dictionary data (Wiktionary) using the multilingual sense intersection (MSI) methodology. This repository contains tools and scripts for performing sense alignment and evaluation using various datasets and algorithms. The repository is organized into different subfolders and contains a variety of Jupyter notebooks and Python scripts to facilitate the process.
Due to limited storage space on GitHub, all the complete repository along with its data can be downloaded from this Google Drive as well.
JSON file for Wiktionary extraction compiled by Ylonen (2022) can be download through kaikki.org website. Navigate to the Download section. Once the file is downloaded, store the file into data and rename the file as kaikki.org-dictionary-English.json.
The opus
folder contains data related to the OPUS corpus, a parallel data collection. It is organized as follows:
- data stores all the files of the original parallel data from SourceForge.
- predictions_results stores prediction results of the system's sense alignment.
- raw_data_bible contains the Bible (Uedin) corpus.
- raw_data_opensubtitle contains the OpenSubtitles corpus.
- raw_data_qed contains the QED corpus.
- raw_data_tanzil stores the Tanzil corpus.
- 1. opus_extraction.ipynb parses data from the four different corpora.
- 2. intersection_algorithm_opus.ipynb maps senses using the intersection algorithm on the OPUS data.
- 3. opus_algorithm_based_conditions.ipynb runs the mapped senses based on five conditions.
- 4. opus_token_count.ipynb counts the number of tokens in the OPUS corpus for data analysis.
- 5. opus_languages_intersection_analysis.ipynb counts the number of language intersections in the OPUS corpus.
- 6. data_analysis.ipynb counts the number of senses suggested by one or two languages in the development set for data analysis.
- 7. error_analysis.ipynb counts senses suggested by certain languages based on evaluation set results for error analysis.
- 8. evaluation_set.ipynb generates matching between OPUS and the evaluation set and runs the intersection algorithm using the best condition (condition 5).
- classification_report_generator.py generates a classification report in the
predictions_results
folder.
The wiktionary
folder contains data related to the Wiktionary dataset. It is structured as follows:
- data: This directory stores all the files of the original parallel data from SourceForge.
- predictions_results: This is where the prediction results of the system's sense alignment are stored.
- 1. wiktionary_extraction.ipynb parses data from the Wiktionary JSON file downloaded from kaikki.org.
- 2. intersection_algorithm_wiktionary.ipynb maps senses using the intersection algorithm on the Wiktionary data.
- 3. wiktionary_algorithm_based_conditions.ipynb runs the mapped senses based on five conditions.
- 4. wiktionary_token_count.ipynb counts the number of tokens in the Wiktionary dataset for data analysis.
- 5. wiktionary_languages_intersection_analysis.ipynb counts the number of language intersections in the Wiktionary dataset.
- 6. POS_analysis.ipynb counts the number of parts of speech (POS) in Wiktionary for data analysis.
- 7. data_analysis.ipynb counts the number of senses suggested by one or two languages in the development set for data analysis.
- 8. evaluation_set.ipynb generates matching between Wiktionary and the evaluation set and runs the intersection algorithm using the best condition (condition 5).
- 9. error_analysis.ipynb counts senses suggested by certain languages based on evaluation set results for error analysis.
- classification_report_generator.py generates a classification report in the
predictions_results
folder.
The wn-msa-all
folder contains data related to the main file and additional data from Wordnet Bahasa maintainers. It is structured as follows:
- data stores all the files of the original parallel data from SourceForge.
- predictions_results stores prediction results of the system's sense alignment.
- 1. development_evaluation_sets.ipynb builds development and evaluation sets.
- 2. goodness_labels_extraction.ipynb extracts data from the main data with labels B and I, extracting goodness labels to the development and evaluation sets.
- 3. combined_datasets_on_eval.ipynb combines both Wiktionary and OPUS data and runs the combined dataset on the evaluation set using the best condition. Generates a classification report.
- 4. wn-msa_error_analysis.ipynb suggests senses for the main data using Wiktionary and OPUS data respectively for the final error analysis step. Takes 150 random samples from each data source.
- 5. error_analysis_hand-checked.ipynb obtains glosses of the 150 random samples and generates accuracy scores for both Wiktionary and OPUS for the last step of error analysis.
- senses.ipynb checks senses, gloss, hypernym and hyponyms of a synset in WordNet for the purpose of further data anlysis.
- classification_report_generator.py generates a classification report in the
predictions_results
folder.
This file contains all the information about the packages needed to run the code.
To use this repository, first clone or download it to your local machine. Then, navigate to the respective folders based on your requirements. Each folder contains notebooks or scripts for specific tasks related to data analysis and intersection algorithms. Follow the instructions provided in the notebooks or scripts to execute the necessary operations.
If you have any questions or encounter any issues, please feel free to reach out to the project owner at [email protected].
Licence of the repository.
H. B. M. Noor, S. Sapuan, and F. Bond. Creating the open Wordnet Bahasa. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, pages 255–264, Singapore, Dec. 2011. Institute of Digital Enhancement of Cog- nitive Processing, Waseda University.
Bond and G. Bonansinga. Exploring Cross-Lingual Sense Mapping in a Multilingual Parallel Corpus, pages 56–61. 01 2015. ISBN 9788899200008. doi: 10.4000/books.aaccademia.1321
Kratochvil and L. Morgado da Costa. Abui Wordnet: Using a toolbox dictionary to develop a wordnet for a low-resource language. In Proceedings of the first workshop on NLP applications to field linguistics, pages 54–63, Gyeongju, Republic of Korea, Oct. 2022. International Conference on Computational Linguistics.
Tiedemann. Parallel data, tools and interfaces in opus. In N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, ed- itors, LREC, pages 2214–2218. European Language Resources Association (ELRA), 2012. ISBN 978-2-9517408-7-7.
Ylonen. Wiktextract: Wiktionary as machine-readable structured data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1317–1325, Marseille, France, June 2022. European Language Resources Association.