Skip to content

Latest commit

 

History

History
105 lines (75 loc) · 12.7 KB

README.md

File metadata and controls

105 lines (75 loc) · 12.7 KB

Description

This repository is a part of the thesis project for MA Linguistics (Human Language Technology). This project focuses on the cleaning up of Wordnet Bahasa by comparing automatically aligned dictionary data (OPUS) with hand-curated dictionary data (Wiktionary) using the multilingual sense intersection (MSI) methodology. This repository contains tools and scripts for performing sense alignment and evaluation using various datasets and algorithms. The repository is organized into different subfolders and contains a variety of Jupyter notebooks and Python scripts to facilitate the process.

Special Note:

Due to limited storage space on GitHub, all the complete repository along with its data can be downloaded from this Google Drive as well.

JSON file for Wiktionary extraction compiled by Ylonen (2022) can be download through kaikki.org website. Navigate to the Download section. Once the file is downloaded, store the file into data and rename the file as kaikki.org-dictionary-English.json.

Table of Contents

opus Folder

The opus folder contains data related to the OPUS corpus, a parallel data collection. It is organized as follows:

Subfolders:

Jupyter Notebooks and scripts:

wiktionary Folder

The wiktionary folder contains data related to the Wiktionary dataset. It is structured as follows:

Subfolders:

  • data: This directory stores all the files of the original parallel data from SourceForge.
  • predictions_results: This is where the prediction results of the system's sense alignment are stored.

Jupyter Notebooks and scripts:

wn-msa-all Folder

The wn-msa-all folder contains data related to the main file and additional data from Wordnet Bahasa maintainers. It is structured as follows:

Subfolders:

  • data stores all the files of the original parallel data from SourceForge.
  • predictions_results stores prediction results of the system's sense alignment.

Jupyter Notebooks and scripts:

This file contains all the information about the packages needed to run the code.

Usage

To use this repository, first clone or download it to your local machine. Then, navigate to the respective folders based on your requirements. Each folder contains notebooks or scripts for specific tasks related to data analysis and intersection algorithms. Follow the instructions provided in the notebooks or scripts to execute the necessary operations.

Contact

If you have any questions or encounter any issues, please feel free to reach out to the project owner at [email protected].

Licence of the repository.

References

H. B. M. Noor, S. Sapuan, and F. Bond. Creating the open Wordnet Bahasa. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, pages 255–264, Singapore, Dec. 2011. Institute of Digital Enhancement of Cog- nitive Processing, Waseda University.

Bond and G. Bonansinga. Exploring Cross-Lingual Sense Mapping in a Multilingual Parallel Corpus, pages 56–61. 01 2015. ISBN 9788899200008. doi: 10.4000/books.aaccademia.1321

Kratochvil and L. Morgado da Costa. Abui Wordnet: Using a toolbox dictionary to develop a wordnet for a low-resource language. In Proceedings of the first workshop on NLP applications to field linguistics, pages 54–63, Gyeongju, Republic of Korea, Oct. 2022. International Conference on Computational Linguistics.

Tiedemann. Parallel data, tools and interfaces in opus. In N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, ed- itors, LREC, pages 2214–2218. European Language Resources Association (ELRA), 2012. ISBN 978-2-9517408-7-7.

Ylonen. Wiktextract: Wiktionary as machine-readable structured data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1317–1325, Marseille, France, June 2022. European Language Resources Association.