Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion: more COLING workshops #4558

Merged
merged 7 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions data/xml/2025.bucc.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
<?xml version='1.0' encoding='UTF-8'?>
<collection id="2025.bucc">
<volume id="1" ingest-date="2025-02-05" type="proceedings">
<meta>
<booktitle>Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)</booktitle>
<editor><first>Serge</first><last>Sharoff</last></editor>
<editor><first>Ayla Rigouts</first><last>Terryn</last></editor>
<editor><first>Pierre</first><last>Zweigenbaum</last></editor>
<editor><first>Reinhard</first><last>Rapp</last></editor>
<publisher>Association for Computational Linguistics</publisher>
<address>Abu Dhabi, UAE</address>
<month>January</month>
<year>2025</year>
<url hash="0b1ce1f4">2025.bucc-1</url>
<venue>bucc</venue>
<venue>ws</venue>
</meta>
<frontmatter>
<url hash="dc8428d0">2025.bucc-1.0</url>
<bibkey>bucc-2025-1</bibkey>
</frontmatter>
<paper id="1">
<title>Bilingual resources for <fixed-case>M</fixed-case>oroccan <fixed-case>S</fixed-case>ign <fixed-case>L</fixed-case>anguage Generation and <fixed-case>S</fixed-case>tandard <fixed-case>A</fixed-case>rabic Skills Improvement of Deaf Children</title>
<author><first>Abdelhadi</first><last>Soudi</last></author>
<author><first>Corinne</first><last>Vinopol</last></author>
<author><first>Kristof</first><last>Van Laerhoven</last></author>
<pages>1–9</pages>
<abstract>This paper presents a set of bilingual Standard Arabic (SA)-Moroccan Sign Language (MSL) tools and resources to improve Moroccan Deaf children’s SA skills. An MSL Generator based on rule-based machine translation (MT) is described that enables users and educators of Deaf children, in particular, to enter Arabic text and generate its corresponding MSL translation in both graphic and video format. The generated graphics can be printed and imported into an Arabic reading passage. We have also developed MSL Clip and Create software that includes a bilingual database of 3,000 MSL signs and SA words, a Publisher for the incorporation of MSL graphic support into SA reading passages, and six Templates that create customized bilingual crossword puzzles, word searches, Bingo cards, matching games, flashcards, and fingerspelling scrambles. A crowdsourcing platform for MSL data collection is also described. A major social benefit of the development of these resources is in relation to equity and the status of deaf people in Moroccan society. More appropriate resources for the bilingual education of Deaf children (in MSL and SA) will lead to improved quality of educational services.</abstract>
<url hash="5b6841d1">2025.bucc-1.1</url>
<bibkey>soudi-etal-2025-bilingual</bibkey>
</paper>
<paper id="2">
<title>Harmonizing Annotation of <fixed-case>T</fixed-case>urkic Postverbial Constructions: A Comparative Study of <fixed-case>UD</fixed-case> Treebanks</title>
<author><first>Arofat</first><last>Akhundjanova</last></author>
<pages>10–17</pages>
<abstract>As the number of treebanks within the same language family continues to grow, the importance of establishing consistent annotation practices has become increasingly evident. In this paper, we evaluate various approaches to annotating Turkic postverbial constructions across UD treebanks. Our comparative analysis reveals that none of the existing methods fully capture the unique semantic and syntactic characteristics of these complex constructions. This underscores the need to adopt a balanced approach that can achieve broad consensus and be implemented consistently across Turkic treebanks. By examining the phenomenon and the available annotation strategies, our study aims to improve the consistency of Turkic UD treebanks and enhance their utility for cross-linguistic research.</abstract>
<url hash="a82b754a">2025.bucc-1.2</url>
<bibkey>akhundjanova-2025-harmonizing</bibkey>
</paper>
<paper id="3">
<title>Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models</title>
<author><first>Preslav</first><last>Nakov</last></author>
<pages>18</pages>
<abstract>First, we will argue for the need for fully transparent open-source large language models (LLMs), and we will describe the efforts of MBZUAI’s Institute on Foundation Models (IFM) towards that based on the LLM360 initiative. Second, we will argue for the need for language-specific LLMs, and we will share our experience from building Jais, the world’s leading open Arabic-centric foundation and instruction-tuned large language model, Nanda, our recently released open Hindi LLM, and some other models. Third, we will argue for the need for safe LLMs, and we will present Do-Not-Answer, a dataset for evaluating the guardrails of LLMs, which is at the core of the safety mechanisms of our LLMs. Forth, we will argue for the need for factual LLMs, we will discuss the factuality challenges that LLMs pose. We will then present some recent relevant tools for addressing these challenges developed at MBZUAI: (i) OpenFactCheck, a framework for fact-checking LLM output, for building customized fact-checking systems, and for benchmarking LLMs for factuality, (ii) LM-Polygraph, a tool for predicting an LLM’s uncertainty in its output using cheap and fast uncertainty quantification techniques, and (iii) LLM-DetectAIve, a tool for machine-generated text detection. Finally, we will argue for the need for specialized models, and we will present the zoo of LLMs currently being developed at MBZUAI’s IFM.</abstract>
<url hash="9e1099ae">2025.bucc-1.3</url>
<bibkey>nakov-2025-towards</bibkey>
</paper>
<paper id="4">
<title>Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative <fixed-case>LLM</fixed-case>s</title>
<author><first>Asli Umay</first><last>Ozturk</last></author>
<author><first>Recep Firat</first><last>Cekinel</last></author>
<author><first>Pinar</first><last>Karagoz</last></author>
<pages>19–35</pages>
<abstract>Satire detection is essential for accurately extracting opinions from textual data and combating misinformation online. However, the lack of diverse corpora for satire leads to the problem of stylistic bias which impacts the models’ detection performances. This study proposes a debiasing approach for satire detection, focusing on reducing biases in training data by utilizing generative large language models. The approach is evaluated in both cross-domain (irony detection) and cross-lingual (English) settings. Results show that the debiasing method enhances the robustness and generalizability of the models for satire and irony detection tasks in Turkish and English. However, its impact on causal language models, such as Llama-3.1, is limited. Additionally, this work curates and presents the Turkish Satirical News Dataset with detailed human annotations, with case studies on classification, debiasing, and explainability.</abstract>
<url hash="1d422b03">2025.bucc-1.4</url>
<bibkey>ozturk-etal-2025-make</bibkey>
</paper>
<paper id="5">
<title><fixed-case>BEIR</fixed-case>-<fixed-case>NL</fixed-case>: Zero-shot Information Retrieval Benchmark for the <fixed-case>D</fixed-case>utch Language</title>
<author><first>Ehsan</first><last>Lotfi</last></author>
<author><first>Nikolay</first><last>Banar</last></author>
<author><first>Walter</first><last>Daelemans</last></author>
<pages>36–45</pages>
<abstract>Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content reduces its utility for underrepresented languages in IR, including Dutch. To address this limitation and encourage the development of Dutch IR models, we introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method. Our experiments show that BM25 remains a competitive baseline, and is only outperformed by the larger dense models trained for retrieval. When combined with reranking models, BM25 achieves performance on par with the best dense ranking models. In addition, we explored the impact of translation on the data by back-translating a selection of datasets to English, and observed a performance drop for both dense and lexical methods, indicating the limitations of translation for creating benchmarks. BEIR-NL is publicly available on the Hugging Face hub.</abstract>
<url hash="20700e6b">2025.bucc-1.5</url>
<bibkey>lotfi-etal-2025-beir</bibkey>
</paper>
<paper id="6">
<title>Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models</title>
<author><first>Chia-Hsuan</first><last>Chang</last></author>
<author><first>Tien Yuan</first><last>Huang</last></author>
<author><first>Yi-Hang</first><last>Tsai</last></author>
<author><first>Chia-Ming</first><last>Chang</last></author>
<author><first>San-Yih</first><last>Hwang</last></author>
<pages>46–56</pages>
<abstract>Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.</abstract>
<url hash="f4072b71">2025.bucc-1.6</url>
<bibkey>chang-etal-2025-refining</bibkey>
</paper>
<paper id="7">
<title>The Role of Handling Attributive Nouns in Improving <fixed-case>C</fixed-case>hinese-To-<fixed-case>E</fixed-case>nglish Machine Translation</title>
<author><first>Adam</first><last>Meyers</last></author>
<author><first>Rodolfo Joel</first><last>Zevallos</last></author>
<author><first>John E.</first><last>Ortega</last></author>
<author><first>Lisa</first><last>Wang</last></author>
<pages>57–61</pages>
<abstract>Translating between languages with drastically different grammatical conventions poses significant challenges, not just for human interpreters but also for machine translation systems. In this work, we specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation. By manually inserting the omitted particle ‘DE’ in news article titles from the Penn Chinese Discourse Treebank, we developed a targeted dataset to fine-tune Hugging Face Chinese to English translation models, specifically improving how this critical function word is handled. This focused approach not only complements the broader strategies suggested by previous studies but also offers a practical enhancement by specifically addressing a common error type in Chinese-English translation.</abstract>
<url hash="913f6b28">2025.bucc-1.7</url>
<bibkey>meyers-etal-2025-role</bibkey>
</paper>
<paper id="8">
<title>Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection</title>
<author><first>Aso</first><last>Mahmudi</last></author>
<author><first>Borja</first><last>Herce</last></author>
<author><first>Demian</first><last>Inostroza Améstica</last></author>
<author><first>Andreas</first><last>Scherbakov</last></author>
<author><first>Eduard H.</first><last>Hovy</last></author>
<author><first>Ekaterina</first><last>Vylomova</last></author>
<pages>62–72</pages>
<abstract>Linguistic fieldwork is an important component in language documentation and the creation of comprehensive linguistic corpora. Despite its significance, the process is often lengthy, exhaustive, and time-consuming. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.</abstract>
<url hash="41e767d7">2025.bucc-1.8</url>
<bibkey>mahmudi-etal-2025-neural</bibkey>
</paper>
<paper id="9">
<title>Comparable Corpora: Opportunities for New Research Directions</title>
<author><first>Kenneth Ward</first><last>Church</last></author>
<pages>73–82</pages>
<abstract>Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research.</abstract>
<url hash="97658e06">2025.bucc-1.9</url>
<bibkey>church-2025-comparable</bibkey>
</paper>
<paper id="10">
<title><fixed-case>SELEXINI</fixed-case> – a large and diverse automatically parsed corpus of <fixed-case>F</fixed-case>rench</title>
<author><first>Manon</first><last>Scholivet</last></author>
<author><first>Agata</first><last>Savary</last></author>
<author><first>Louis</first><last>Estève</last></author>
<author><first>Marie</first><last>Candito</last></author>
<author><first>Carlos</first><last>Ramisch</last></author>
<pages>83–98</pages>
<abstract>The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.</abstract>
<url hash="2657e5c8">2025.bucc-1.10</url>
<bibkey>scholivet-etal-2025-selexini</bibkey>
</paper>
</volume>
</collection>
Loading