Description

This repository is a part of the thesis project for MA Linguistics (Human Language Technology). This project focuses on the cleaning up of Wordnet Bahasa by comparing automatically aligned dictionary data (OPUS) with hand-curated dictionary data (Wiktionary) using the multilingual sense intersection (MSI) methodology. This repository contains tools and scripts for performing sense alignment and evaluation using various datasets and algorithms. The repository is organized into different subfolders and contains a variety of Jupyter notebooks and Python scripts to facilitate the process.

Special Note:

Due to limited storage space on GitHub, all the complete repository along with its data can be downloaded from this Google Drive as well.

JSON file for Wiktionary extraction compiled by Ylonen (2022) can be download through kaikki.org website. Navigate to the Download section. Once the file is downloaded, store the file into data and rename the file as kaikki.org-dictionary-English.json.

opus Folder

The opus folder contains data related to the OPUS corpus, a parallel data collection. It is organized as follows:

Subfolders:

data stores all the files of the original parallel data from SourceForge.
predictions_results stores prediction results of the system's sense alignment.
raw_data_bible contains the Bible (Uedin) corpus.
raw_data_opensubtitle contains the OpenSubtitles corpus.
raw_data_qed contains the QED corpus.
raw_data_tanzil stores the Tanzil corpus.

Jupyter Notebooks and scripts:

1. opus_extraction.ipynb parses data from the four different corpora.
2. intersection_algorithm_opus.ipynb maps senses using the intersection algorithm on the OPUS data.
3. opus_algorithm_based_conditions.ipynb runs the mapped senses based on five conditions.
4. opus_token_count.ipynb counts the number of tokens in the OPUS corpus for data analysis.
5. opus_languages_intersection_analysis.ipynb counts the number of language intersections in the OPUS corpus.
6. data_analysis.ipynb counts the number of senses suggested by one or two languages in the development set for data analysis.
7. error_analysis.ipynb counts senses suggested by certain languages based on evaluation set results for error analysis.
8. evaluation_set.ipynb generates matching between OPUS and the evaluation set and runs the intersection algorithm using the best condition (condition 5).
classification_report_generator.py generates a classification report in the predictions_results folder.

wiktionary Folder

The wiktionary folder contains data related to the Wiktionary dataset. It is structured as follows:

Subfolders:

data: This directory stores all the files of the original parallel data from SourceForge.
predictions_results: This is where the prediction results of the system's sense alignment are stored.

Jupyter Notebooks and scripts:

1. wiktionary_extraction.ipynb parses data from the Wiktionary JSON file downloaded from kaikki.org.
2. intersection_algorithm_wiktionary.ipynb maps senses using the intersection algorithm on the Wiktionary data.
3. wiktionary_algorithm_based_conditions.ipynb runs the mapped senses based on five conditions.
4. wiktionary_token_count.ipynb counts the number of tokens in the Wiktionary dataset for data analysis.
5. wiktionary_languages_intersection_analysis.ipynb counts the number of language intersections in the Wiktionary dataset.
6. POS_analysis.ipynb counts the number of parts of speech (POS) in Wiktionary for data analysis.
7. data_analysis.ipynb counts the number of senses suggested by one or two languages in the development set for data analysis.
8. evaluation_set.ipynb generates matching between Wiktionary and the evaluation set and runs the intersection algorithm using the best condition (condition 5).
9. error_analysis.ipynb counts senses suggested by certain languages based on evaluation set results for error analysis.
classification_report_generator.py generates a classification report in the predictions_results folder.

wn-msa-all Folder

The wn-msa-all folder contains data related to the main file and additional data from Wordnet Bahasa maintainers. It is structured as follows:

Subfolders:

data stores all the files of the original parallel data from SourceForge.
predictions_results stores prediction results of the system's sense alignment.

Jupyter Notebooks and scripts:

1. development_evaluation_sets.ipynb builds development and evaluation sets.
2. goodness_labels_extraction.ipynb extracts data from the main data with labels B and I, extracting goodness labels to the development and evaluation sets.
3. combined_datasets_on_eval.ipynb combines both Wiktionary and OPUS data and runs the combined dataset on the evaluation set using the best condition. Generates a classification report.
4. wn-msa_error_analysis.ipynb suggests senses for the main data using Wiktionary and OPUS data respectively for the final error analysis step. Takes 150 random samples from each data source.
5. error_analysis_hand-checked.ipynb obtains glosses of the 150 random samples and generates accuracy scores for both Wiktionary and OPUS for the last step of error analysis.
senses.ipynb checks senses, gloss, hypernym and hyponyms of a synset in WordNet for the purpose of further data anlysis.
classification_report_generator.py generates a classification report in the predictions_results folder.

requirement.txt

This file contains all the information about the packages needed to run the code.

Usage

To use this repository, first clone or download it to your local machine. Then, navigate to the respective folders based on your requirements. Each folder contains notebooks or scripts for specific tasks related to data analysis and intersection algorithms. Follow the instructions provided in the notebooks or scripts to execute the necessary operations.

Contact

If you have any questions or encounter any issues, please feel free to reach out to the project owner at [email protected].

License.txt

Licence of the repository.

References

H. B. M. Noor, S. Sapuan, and F. Bond. Creating the open Wordnet Bahasa. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, pages 255–264, Singapore, Dec. 2011. Institute of Digital Enhancement of Cog- nitive Processing, Waseda University.

Bond and G. Bonansinga. Exploring Cross-Lingual Sense Mapping in a Multilingual Parallel Corpus, pages 56–61. 01 2015. ISBN 9788899200008. doi: 10.4000/books.aaccademia.1321

Kratochvil and L. Morgado da Costa. Abui Wordnet: Using a toolbox dictionary to develop a wordnet for a low-resource language. In Proceedings of the first workshop on NLP applications to field linguistics, pages 54–63, Gyeongju, Republic of Korea, Oct. 2022. International Conference on Computational Linguistics.

Tiedemann. Parallel data, tools and interfaces in opus. In N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, ed- itors, LREC, pages 2214–2218. European Language Resources Association (ELRA), 2012. ISBN 978-2-9517408-7-7.

Ylonen. Wiktextract: Wiktionary as machine-readable structured data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1317–1325, Marseille, France, June 2022. European Language Resources Association.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
opus		opus
wiktionary		wiktionary
wn-msa-all		wn-msa-all
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Special Note:

Table of Contents

opus Folder

Subfolders:

Jupyter Notebooks and scripts:

wiktionary Folder

Subfolders:

Jupyter Notebooks and scripts:

wn-msa-all Folder

Subfolders:

Jupyter Notebooks and scripts:

requirement.txt

Usage

Contact

License.txt

References

About

Releases

Packages

Languages

License

cltl-students/Siti_Nurhalimah_Thesis_2023

Folders and files

Latest commit

History

Repository files navigation

Description

Special Note:

Table of Contents

opus Folder

Subfolders:

Jupyter Notebooks and scripts:

wiktionary Folder

Subfolders:

Jupyter Notebooks and scripts:

wn-msa-all Folder

Subfolders:

Jupyter Notebooks and scripts:

requirement.txt

Usage

Contact

License.txt

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages