Ingested CL 48.2 and recent TACL (acl-org#2091)

* Updated file structure expected in `ingest_mitpress.py`; added `--year` flag
EricPeter · Aug 6, 2022 · c234b23 · c234b23
1 parent dda2cf1
commit c234b23
Show file tree

Hide file tree

Showing 3 changed files with 234 additions and 18 deletions.
diff --git a/bin/ingest_mitpress.py b/bin/ingest_mitpress.py
@@ -9,16 +9,12 @@
 Example usage: unpack the ZIP file from MIT press. You'll have something like this:
 
     ./taclv9-11082021/
-      ./xml/
-        tacl_a_00350.xml
-        tacl_a_00351.xml
-        tacl_a_00352.xml
-        ...
-      ./assets/
-        tacl_a_00350.pdf
-        tacl_a_00351.pdf
-        tacl_a_00352.pdf
-        ...
+      tacl_a_00350.xml
+      tacl_a_00350.pdf
+      tacl_a_00351.xml
+      tacl_a_00351.pdf
+      tacl_a_00352.xml
+      tacl_a_00352.pdf
 
 Then, run
 
@@ -158,7 +154,7 @@ def get_abstract(xml_front_node: etree.Element) -> str:
     if abstract is not None:
         abstract_text = collapse_spaces("".join(abstract.itertext()))
         # 2022/June abstracts all started with "Abstract "
-        if abstract_text.starts_with("Abstract "):
+        if abstract_text.startswith("Abstract "):
             abstract_text = abstract_text[9:]
         return abstract_text
     else:
@@ -356,11 +352,14 @@ def main(args):
     logging.info(f"Looks like a TACL ingest: {is_tacl}")
 
     venue = TACL if is_tacl else CL  # J for CL, Q for TACL.
-    try:
-        year = int(args.root_dir.name[-4:])
-    except ValueError:
-        logging.warning(f"Expected last four chars of {args.root_dir} to be a year")
-        sys.exit(-1)
+    year = args.year
+    if year is None:
+        try:
+            year = int(args.root_dir.name[-4:])
+        except ValueError:
+            logging.warning(f"Expected last four chars of {args.root_dir} to be a year")
+            logging.warning(f"Or you can use --year YYYY")
+            sys.exit(-1)
 
     collection_id = str(year) + "." + venue
 
@@ -379,12 +378,12 @@ def main(args):
     previous_issue_info = None
 
     papers = []
-    for xml in sorted(args.root_dir.glob("xml/*.xml")):
+    for xml in sorted(args.root_dir.glob("*.xml")):
         papernode, issue_info, issue = process_xml(xml, is_tacl)
         if papernode is None or papernode.find("title").text.startswith("Erratum: “"):
             continue
 
-        pdf_path = xml.parent.parent / "assets" / xml.with_suffix(".pdf").name
+        pdf_path = xml.parent / xml.with_suffix(".pdf").name
         if not pdf_path.is_file():
             logging.error(f"Missing pdf for {pdf_path}")
             sys.exit(1)
@@ -491,6 +490,12 @@ def sort_papers_by_page(paper_tuple):
         default=anthology_path,
         help="Root path of ACL Anthology Github repo. Default: %(default)s.",
     )
+    parser.add_argument(
+        "--year",
+        type=int,
+        default=None,
+        help="The current year",
+    )
     pdfs_path = os.path.join(os.environ["HOME"], "anthology-files")
     parser.add_argument(
         "--pdfs-dir",

diff --git a/data/xml/2022.cl.xml b/data/xml/2022.cl.xml
@@ -123,4 +123,98 @@
       <bibkey>stanojevic-steedman-2022-erratum</bibkey>
     </paper>
   </volume>
+  <volume id="2">
+    <meta>
+      <booktitle>Computational Linguistics, Volume 48, Issue 2 - June 2022</booktitle>
+      <publisher>MIT Press</publisher>
+      <address>Cambridge, MA</address>
+      <month>June</month>
+      <year>2022</year>
+    </meta>
+    <paper id="1">
+      <title>Ethics Sheet for Automatic Emotion Recognition and Sentiment Analysis</title>
+      <author><first>Saif M.</first><last>Mohammad</last></author>
+      <doi>10.1162/coli_a_00433</doi>
+      <abstract>The importance and pervasiveness of emotions in our lives makes affective computing a tremendously important and vibrant line of work. Systems for automatic emotion recognition (AER) and sentiment analysis can be facilitators of enormous progress (e.g., in improving public health and commerce) but also enablers of great harm (e.g., for suppressing dissidents and manipulating voters). Thus, it is imperative that the affective computing community actively engage with the ethical ramifications of their creations. In this article, I have synthesized and organized information from AI Ethics and Emotion Recognition literature to present fifty ethical considerations relevant to AER. Notably, this ethics sheet fleshes out assumptions hidden in how AER is commonly framed, and in the choices often made regarding the data, method, and evaluation. Special attention is paid to the implications of AER on privacy and social groups. Along the way, key recommendations are made for responsible AER. The objective of the ethics sheet is to facilitate and encourage more thoughtfulness on why to automate, how to automate, and how to judge success well before the building of AER systems. Additionally, the ethics sheet acts as a useful introductory document on emotion recognition (complementing survey articles).</abstract>
+      <pages>239–278</pages>
+      <url hash="36db3140">2022.cl-2.1</url>
+      <bibkey>mohammad-2022-ethics-sheet</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization</title>
+      <author><first>Md Tahmid Rahman</first><last>Laskar</last></author>
+      <author><first>Enamul</first><last>Hoque</last></author>
+      <author><first>Jimmy Xiangji</first><last>Huang</last></author>
+      <doi>10.1162/coli_a_00434</doi>
+      <abstract>The Query-Focused Text Summarization (QFTS) task aims at building systems that generate the summary of the text document(s) based on the given query. A key challenge in addressing this task is the lack of large labeled data for training the summarization model. In this article, we address this challenge by exploring a series of domain adaptation techniques. Given the recent success of pre-trained transformer models in a wide range of natural language processing tasks, we utilize such models to generate abstractive summaries for the QFTS task for both single-document and multi-document scenarios. For domain adaptation, we apply a variety of techniques using pre-trained transformer-based summarization models including transfer learning, weakly supervised learning, and distant supervision. Extensive experiments on six datasets show that our proposed approach is very effective in generating abstractive summaries for the QFTS task while setting a new state-of-the-art result in several datasets across a set of automatic and human evaluation metrics.</abstract>
+      <pages>279–320</pages>
+      <url hash="e41774df">2022.cl-2.2</url>
+      <bibkey>laskar-etal-2022-domain</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Challenges of Neural Machine Translation for Short Texts</title>
+      <author><first>Yu</first><last>Wan</last></author>
+      <author><first>Baosong</first><last>Yang</last></author>
+      <author><first>Derek Fai</first><last>Wong</last></author>
+      <author><first>Lidia Sam</first><last>Chao</last></author>
+      <author><first>Liang</first><last>Yao</last></author>
+      <author><first>Haibo</first><last>Zhang</last></author>
+      <author><first>Boxing</first><last>Chen</last></author>
+      <doi>10.1162/coli_a_00435</doi>
+      <abstract>Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.</abstract>
+      <pages>321–342</pages>
+      <url hash="d9abf790">2022.cl-2.3</url>
+      <bibkey>wan-etal-2022-challenges</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Annotation Curricula to Implicitly Train Non-Expert Annotators</title>
+      <author><first>Ji-Ung</first><last>Lee</last></author>
+      <author><first>Jan-Christoph</first><last>Klie</last></author>
+      <author><first>Iryna</first><last>Gurevych</last></author>
+      <doi>10.1162/coli_a_00436</doi>
+      <abstract>Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowdsourcing scenarios where domain expertise is not required. To alleviate these issues, this work proposes annotation curricula, a novel approach to implicitly train annotators. The goal is to gradually introduce annotators into the task by ordering instances to be annotated according to a learning curriculum. To do so, this work formalizes annotation curricula for sentence- and paragraph-level annotation tasks, defines an ordering strategy, and identifies well-performing heuristics and interactively trained models on three existing English datasets. Finally, we provide a proof of concept for annotation curricula in a carefully designed user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. The results indicate that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can be a promising research direction to improve data collection. To facilitate future research—for instance, to adapt annotation curricula to specific tasks and expert annotation scenarios—all code and data from the user study consisting of 2,400 annotations is made available.1</abstract>
+      <pages>343–373</pages>
+      <url hash="a1ccdcf0">2022.cl-2.4</url>
+      <bibkey>lee-etal-2022-annotation</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity</title>
+      <author><first>Himanshu</first><last>Yadav</last></author>
+      <author><first>Samar</first><last>Husain</last></author>
+      <author><first>Richard</first><last>Futrell</last></author>
+      <doi>10.1162/coli_a_00437</doi>
+      <abstract>Formal constraints on crossing dependencies have played a large role in research on the formal complexity of natural language grammars and parsing. Here we ask whether the apparent evidence for constraints on crossing dependencies in treebanks might arise because of independent constraints on trees, such as low arity and dependency length minimization. We address this question using two sets of experiments. In Experiment 1, we compare the distribution of formal properties of crossing dependencies, such as gap degree, between real trees and baseline trees matched for rate of crossing dependencies and various other properties. In Experiment 2, we model whether two dependencies cross, given certain psycholinguistic properties of the dependencies. We find surprisingly weak evidence for constraints originating from the mild context-sensitivity literature (gap degree and well-nestedness) beyond what can be explained by constraints on rate of crossing dependencies, topological properties of the trees, and dependency length. However, measures that have emerged from the parsing literature (e.g., edge degree, end-point crossings, and heads’ depth difference) differ strongly between real and random trees. Modeling results show that cognitive metrics relating to information locality and working-memory limitations affect whether two dependencies cross or not, but they do not fully explain the distribution of crossing dependencies in natural languages. Together these results suggest that crossing constraints are better characterized by processing pressures than by mildly context-sensitive constraints.</abstract>
+      <pages>375–401</pages>
+      <url hash="9d6ae8f1">2022.cl-2.5</url>
+      <bibkey>yadav-etal-2022-assessing</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Dual Attention Model for Citation Recommendation with Analyses on Explainability of Attention Mechanisms and Qualitative Experiments</title>
+      <author><first>Yang</first><last>Zhang</last></author>
+      <author><first>Qiang</first><last>Ma</last></author>
+      <doi>10.1162/coli_a_00438</doi>
+      <abstract>Based on an exponentially increasing number of academic articles, discovering and citing comprehensive and appropriate resources have become non-trivial tasks. Conventional citation recommendation methods suffer from severe information losses. For example, they do not consider the section header of the paper that the author is writing and for which they need to find a citation, the relatedness between the words in the local context (the text span that describes a citation), or the importance of each word from the local context. These shortcomings make such methods insufficient for recommending adequate citations to academic manuscripts. In this study, we propose a novel embedding-based neural network called dual attention model for citation recommendation (DACR) to recommend citations during manuscript preparation. Our method adapts the embedding of three semantic pieces of information: words in the local context, structural contexts,1 and the section on which the author is working. A neural network model is designed to maximize the similarity between the embedding of the three inputs (local context words, section headers, and structural contexts) and the target citation appearing in the context. The core of the neural network model comprises self-attention and additive attention; the former aims to capture the relatedness between the contextual words and structural context, and the latter aims to learn their importance. Recommendation experiments on real-world datasets demonstrate the effectiveness of the proposed approach. To seek explainability on DACR, particularly the two attention mechanisms, the learned weights from them are investigated to determine how the attention mechanisms interpret “relatedness” and “importance” through the learned weights. In addition, qualitative analyses were conducted to testify that DACR could find necessary citations that were not noticed by the authors in the past due to the limitations of the keyword-based searching.</abstract>
+      <pages>403–470</pages>
+      <url hash="e40d8ac3">2022.cl-2.6</url>
+      <bibkey>zhang-ma-2022-dual</bibkey>
+    </paper>
+    <paper id="7">
+      <title>On Learning Interpreted Languages with Recurrent Models</title>
+      <author><first>Denis</first><last>Paperno</last></author>
+      <doi>10.1162/coli_a_00431</doi>
+      <abstract>Can recurrent neural nets, inspired by human sequential data processing, learn to understand language? We construct simplified data sets reflecting core properties of natural language as modeled in formal syntax and semantics: recursive syntactic structure and compositionality. We find LSTM and GRU networks to generalize to compositional interpretation well, but only in the most favorable learning settings, with a well-paced curriculum, extensive training data, and left-to-right (but not right-to-left) composition.</abstract>
+      <pages>471–482</pages>
+      <url hash="1b28885c">2022.cl-2.7</url>
+      <bibkey>paperno-2022-learning</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Boring Problems Are Sometimes the Most Interesting</title>
+      <author><first>Richard</first><last>Sproat</last></author>
+      <doi>10.1162/coli_a_00439</doi>
+      <abstract>In a recent position paper, Turing Award Winners Yoshua Bengio, Geoffrey Hinton, and Yann LeCun make the case that symbolic methods are not needed in AI and that, while there are still many issues to be resolved, AI will be solved using purely neural methods. In this piece I issue a challenge: Demonstrate that a purely neural approach to the problem of text normalization is possible. Various groups have tried, but so far nobody has eliminated the problem of unrecoverable errors, errors where, due to insufficient training data or faulty generalization, the system substitutes some other reading for the correct one. Solutions have been proposed that involve a marriage of traditional finite-state methods with neural models, but thus far nobody has shown that the problem can be solved using neural methods alone. Though text normalization is hardly an “exciting” problem, I argue that until one can solve “boring” problems like that using purely AI methods, one cannot claim that AI is a success.</abstract>
+      <pages>483–490</pages>
+      <url hash="c7da70a3">2022.cl-2.8</url>
+      <bibkey>sproat-2022-boring</bibkey>
+    </paper>
+  </volume>
 </collection>