TEI/ParlaMint.odd.xml

<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="tei_odds.rng" type="application/xml"
            schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns:rng="http://relaxng.org/ns/structure/1.0" 
     xmlns="http://www.tei-c.org/ns/1.0"
     xmlns:sch="http://purl.oclc.org/dsdl/schematron" 
     xmlns:eg="http://www.tei-c.org/ns/Examples"
     xmlns:egXML="http://www.tei-c.org/ns/Examples"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
     xmlns:tei="http://www.tei-c.org/ns/1.0"
     xml:lang="en" n="tei_clarin">
  
  <!-- To do:
       - need a better introduction and discussions of samples 
       - maybe even more constrained schema
  -->
  
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>The structure and encoding of ParlaMint corpora</title>
        <author>Tomaž Erjavec, tomaz.erjavec@ijs.si</author>
        <author>Matyáš Kopp, kopp@ufal.mff.cuni.cz</author>
        <author>Andrej Pančur, andrej.pancur@inz.si</author>
      </titleStmt>
      <publicationStmt>
        <publisher>CLARIN</publisher>
        <date>2024-12-03</date>
        <availability status="free">
          <p>This file is freely available and you are hereby authorised to copy, modify, and
          redistribute it in any way without further reference or permissions.</p>
        </availability>
        <pubPlace>
          <ref target="https://github.com/clarin-eric/ParlaMint">https://github.com/clarin-eric/ParlaMint</ref>
        </pubPlace>
      </publicationStmt>
      <sourceDesc>
        <p>Made on the basis of the <ref
        target="https://github.com/clarin-eric/ParlaMint">Parla-CLARIN</ref>
        recommendation. For (possibly modified) examples of encoding uses the
        <bibl xml:id="ParlaMint">ParlaMint corpus</bibl>.</p>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
      <projectDesc>
        <p>Research Infrastructure for Language Resources and Tools <ref
        target="https://www.clarin.eu/">CLARIN</ref>.</p>
      </projectDesc>
    </encodingDesc>
    <revisionDesc>
      <change when="2024-12-03">Tomaž Erjavec: Allow s/@ana.</change>
      <change when="2024-05-14">Tomaž Erjavec: Make person/sex obligatory.</change>
      <change when="2024-03-03">Tomaž Erjavec: Slighlty edit the section of affiliations.</change>
      <change when="2024-02-08">Tomaž Erjavec: Change encoding of name so it allows more .ana elements.</change>
      <change when="2024-01-25">Tomaž Erjavec: Change encoding of org/state so it is TEI compliant.</change>
      <change when="2023-11-06">Tomaž Erjavec: Add info on semantic annotations and other small changes.</change>
      <change when="2023-10-09">Tomaž Erjavec: Allow multilingual names.</change>
      <change when="2023-08-22">Tomaž Erjavec: Change description of org/state.</change>
      <change when="2023-07-17">Tomaž Erjavec: Add org/@role='federatedState'.</change>
      <change when="2023-06-13">Tomaž Erjavec: Start work on section for MT.</change>
      <change when="2023-04-06">Tomaž Erjavec: allow orgName in affiliation.</change>
      <change when="2022-12-21">Tomaž Erjavec: reorganise, fix errors, add explanations.</change>
      <change when="2022-10-17">Tomaž Erjavec: add description for new XIncludes, reorganise</change>
      <change when="2022-10-04">Tomaž Erjavec: reorganise sect. on organizations, add state.</change>
      <change when="2022-07-27">Tomaž Erjavec: many changes to (explanations of) structure.</change>
      <change when="2022-05-18">Katja Meden: add ParlaMint examples in exemplums.</change>
      <change when="2022-05-16">Tomaž Erjavec: add section 3.5 on referring attributes.</change>
      <change when="2022-03-04">Matyáš Kopp: finish first pass through schemaSpecs.</change>
      <change when="2022-02-16">Tomaž Erjavec: first draft.</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <front>
      <titlePage>
        <docTitle>
          <titlePart type="main">The structure and encoding of
          <ref target="https://github.com/clarin-eric/ParlaMint">ParlaMint corpora</ref></titlePart>
        </docTitle>
        <!--docEdition>v4.2</docEdition-->
        <docDate>2024-12-03</docDate>
        <!--byline>Tomaž Erjavec, Matyáš Kopp, Katja Meden</byline-->
      </titlePage>
      <p></p>
      <divGen type="toc"/>
    </front>
    <body>
      <div xml:id="chp-intro">
        <head>Introduction</head>
        <p>This document is meant to serve as a reference for the encoding of ParlaMint
        corpora of parliamentary proceedings. In order for the ParlaMint corpora to be
        interoperable (i.e. so that the same scripts can be used to process them), their
        structure is fairly rigid, both in terms of file names and folder structure, as well
        as their TEI XML encoding. This is not to say that all the corpora have to contain
        exactly the same information because we distinguish obligatory information, which
        all the corpora should contain, from that which is optional, and present only in the
        corpora for which it has been possible to gather it from the corpus sources.</p>

        <p>This document is a specialisation of <ref
        target="https://github.com/clarin-eric/parla-clarin">Parla-CLARIN</ref>, itself a
        customisation the <ref
        target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html">TEI
        Guidelines</ref>. But while Parla-CLARIN gives fairly general recommendations for
        encoding corpora of parliamentary proceedings, ParlaMint, as mentioned, is much
        stricter. This document gives very specific encoding recommendations without
        necessarily stating the reasons for their choice. It covers the overall structure
        of ParlaMint corpora, the metadata they contain, the encoding of transcriptions,
        and, for the linguistically annotated version, the encoding of word-level
        linguistic annotatios, syntactic dependencies and named entities.</p>

        <p>The document is not meant as a tutorial on TEI or ParlaMint, but as a reference to
        elements, their nesting and attributes exemplified by snippets from the existing ParlaMint
        corpora. Other sources can help in understanding the encoding of ParlaMint corpora:
        <list>
          <item>The freely available paper:<lb/> Erjavec, T., Ogrodniczuk, M., Osenova,
          P. et al. The ParlaMint corpora of parliamentary proceedings.  Language
          Resources &amp; Evaluation (2022). <ref
          target="https://doi.org/10.1007/s10579-021-09574-0">https://doi.org/10.1007/s10579-021-09574-0</ref>.</item>
          
          <item>The <ref
          target="https://clarin-eric.github.io/parla-clarin/">Parla-CLARIN</ref>
          guidelines, which provide general guidelines for encoding parliamentary corpora
          in TEI; they also give links to the relevant chapters of the <ref
          target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html">TEI Guidelines
          </ref>.</item>
          
          <item>Samples of ParlaMint corpora, available in the Data/ directory of the
          ParlaMint GitHub repository, esp. useful as they give the complete picture of a
          ParlaMint corpus; note that the samples in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/main/Data">main
          branch</ref> are supposed to be publication-ready, while those in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/data/Data">data
          branch</ref> are work in progress.
          </item>
        </list>
        </p>

        <p>The rest of these recommendations are structured as follows:
        <list>
          <item><ref target="#chp-overall">Chapter 2</ref> explains the overall XML
          structure of a ParlaMint corpus, and introduces the distinction between the
          corpus root and corpus components;</item>

          <item><ref target="#chp-general">Chapter 3</ref> explains some general requirements
          and the file-naming conventions a ParlaMint corpus has to meet; it also introduces
          the top level elements and their attributes and the main pointing attributes;</item>

          <item><ref target="#sec-metadata">Chapter 4</ref> concentrates on the stucture and
          encoding of the corpus metadata, such as the title information, documenting the source
          of the corpus, taxonomies used etc.;</item>
          
          <item><ref target="#chp-speaker">Chapter 5</ref> explains how and what information
          must be encoded about the persons giving the speeches and the (political)
          organisations they belong to;</item>
          
          <item><ref target="#chp-transcript">Chapter 6</ref> treats the encoding of the
          transcripts, including speeches and transcriber notes;</item>

          <item><ref target="#chp-linguistic">Chapter 7</ref> details the addition of
          linguistic annotations to the corpus;</item>

          <item><ref target="#chp-validation">Chapter 8</ref> introduces scripts to finalise,
          validate and convert a ParlaMint corpus to other formats;</item>

          <item><ref target="#chp-contributing">Chapter 9</ref> gives instructions on how to
          contribute samples of a ParlaMint corpus to GitHub;</item>

          <item><ref target="#schema">Appendix A</ref> gives the formal specification of
          the Parla-CLARIN schema.</item>
        </list>
        </p>
      </div>

      <div xml:id="chp-overall">
        <head>Overall corpus structure</head>
        
        <div xml:id="chp-xml-struct">
          <head>XML structure</head>
          
          <p>The parliamentary proceeding of one country of autonomous region constitute one
          ParlaMint corpus, which is stored as one XML document, with <gi>teiCorpus</gi> as its
          top-level element. It is composed of a <gi>teiHeader</gi>, giving the metadata for the
          corpus as a whole (further detailed in the Section on <ref target="#sec-metadata">Corpus
          metadata</ref>), followed by a series of <gi>TEI</gi> elements that each contain one
          <term>corpus component</term>, as illustrated<note>Note that this is a illustrative
          example, i.e. a valid ParlaMint corpus would also need certain attributes to be defined
          on the illustrated elements. This holds for all the examples in this section.</note>
          below:
        
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure">
            <![CDATA[
<!-- Corpus root -->
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>...</teiHeader>
  <TEI>...</TEI> <!-- Corpus component -->
  <TEI>...</TEI> <!-- Corpus component -->
  ...            <!-- More corpus components -->
  </teiCorpus>]]>
          </egXML>
          
          Each corpus component should contain at most the transcripts for <hi>one
          day</hi>, although several components can contain the transcript for the same
          day, e.g. for different (types of) meetings. How and if these further
          subdivisions into separate components are realised is dependent on the corpus, as
          the granularity of parliamentary proceedings corpora, not to mention the national
          rules of structuring the workings of the parliament, differ substantially.</p>
          
          <p>A corpus component will thus be rooted in the <gi>TEI</gi> element, which then
          contains its metadata in its own <gi>teiHeader</gi>, followed by the
          <gi>text</gi> element, which contains the transcription of the particular
          component, as illustrated below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure-comp">
            <TEI>
              <teiHeader>...</teiHeader>
              <text>...</text>
            </TEI>
          </egXML>
          </p>
        
          <p>The <gi>teiHeader</gi> of a corpus component (further detailed in the Section on
          <ref target="#sec-metadata">Corpus metadata</ref>) contains the metadata specific for
          this component (along with some redundant metadata about the provenance), and which
          should be unique in the corpus, i.e. the corpus component metadata should distinguish
          it from all the other components of the corpus.</p>

        </div>
        
        <div xml:id="sec-xinclude">
          <head>Use of XInclude</head>

          <p>The fact that a corpus is one XML document does not mean that it is also stored in
          one file. In fact, ParlaMint requires that each corpus component is stored in a
          separate file, with the <term>corpus root</term>, i.e. the top-level
          <gi>teiCorpus</gi>, also stored as one file.  Furthermore, some parts of the corpus
          root metadata are also stored in separate files.</p>
          
          <p>To enable one XML document to be composed of many files, we use the <ref
          target="https://www.w3.org/TR/xinclude/">XInclude</ref> mechanism, and the
          <term>corpus root</term> uses this mechanism (i.e. the <gi>include</gi>
          elements in the XInclude namespace) to include its <term>corpus component</term>
          files, so a corpus root will be in fact encoded similarly to the following
          example:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure-root">
          <![CDATA[
<!-- Corpus root file -->
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" > 
  <teiHeader>...</teiHeader>
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="2014/ParlaMint-NL_2014-04-16.xml"/>  <!-- Corpus component file -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="2014/ParlaMint-NL_2014-04-17.xml"/>  <!-- Corpus component file -->
  ...                                            <!-- More corpus component files -->
</teiCorpus>]]>
          </egXML>
          </p>
          <p>Apart from corpus components, some parts of the overall corpus metadata (i.e. the
          <gi>teiCorpus</gi> <gi>teiHeader</gi> element) are also stored as separate files, and
          hence also included in the corpus root using the same XInclude mechanism as explained
          above.</p>
        </div>
      
        <div xml:id="sec-files">
          <head>File names and directory structure</head>

          <p>ParlaMint has strict rules on how to name the various files that constitute a
          corpus, and how to collect them in directories.</p>
          
          <p>The file names have the the following structure:
          
          <list>
            <item>The corpus root file name should start with the string
            <code>ParlaMint-</code>, followed by the ISO
            3166 country (or automous region) code
            (cf. Section on <ref target="#sec-standard">Standard values</ref>)
            e.g. <code>ParlaMint-NL.xml</code> or <code>ParlaMint-ES-CT</code>.</item>
            
            <item>For machine-translated corpora the ISO 639 code of
            the language (cf. Section on <ref target="#sec-standard">Standard
            values</ref>) should follow the country code, e.g. <code>ParlaMint-NL-en.xml</code>.
            </item>

            <item>A corpus component filename should start with the name of the root, followed
            by an underscore and the ISO 8601 formatted date of the transcript, for example
            <code>ParlaMint-IS_2015-01-21-54.xml</code>. In case a corpus component is further
            distinguished, so that there are are several components with the same date, the
            corpus compilers are free to extend the file name by a hyphen and any suffix containing
            only ASCII letters and numbers and the hyphen character,
            e.g. <code>ParlaMint-NL_2018-10-30-eerstekamer-4.xml</code> or
            <code>ParlaMint-CZ_2016-04-13-ps2013-044-02-016-098.xml</code></item>

            <item>Certain metadata elements from the corpus root <gi>teiHeader</gi> are stored
            in separate files, in particular the list of speakers, <gi>listPerson</gi>,
            the list of political parties and other organisations, <gi>listOrg</gi>, and
            the ParlaMint structural and linguistic taxonomies, i.e. <gi>taxonomy</gi> elements.
            The file names for such metadata files start with the name of the corpus root,
            followed by a hyphen, and then the name of the element,
            e.g. <code>ParlaMint-BE-listPerson.xml</code>. Where there are more files for
            instances of the same element name, as is the case for taxonomies, the filename
            should end with another hypen, followed by the ID of the particular element,
            e.g. <code>ParlaMint-BE-taxonomy-UD-SYN.xml</code>. Finally, some of the
            taxonomies are not corpus-specific, i.e. identical files are used by all
            ParlaMint corpora. In this case, the country or region code is ommitted,
            e.g. <code>ParlaMint-taxonomy-parla.legislature.xml</code>.
            </item>

            <item>The file names of the corpus as a whole or corpus components that have
            been automatically converted from the source XML into some other format should
            have the same name as the corpus root or components, respectively, but with
            appropriate file extensions, e.g, <code>ParlaMint-IS_2015-01-21-54.txt</code>; this
            is further explained in the Section on <ref
            target="#sec-conversion">Conversions</ref>.</item>

            <item>As discussed in the Chapter on <ref target="#chp-linguistic">Linguistic
            annotation</ref> we distinguish the linguistically annotated version of the
            corpus from the <q>plain-text</q> one, with the linguistic annotated version
            having the additional suffix <code>.ana</code> on the corpus root and
            components, e.g. <code>ParlaMint-ES-CT.ana.xml</code> or
            <code>ParlaMint-IS_2015-01-21-54.ana.xml.</code></item>
          </list>
          </p>

          <p>For distribution the complete XML corpus should be stored in a
          directory that has the same name prefix as the corpus root file. The directory
          then contains the corpus root file and its metadata files,
          while the corpus components should be in subdirectories, one per year, for example:
          
          <eg xml:id="exa-directories"><lb/>
            <code>ParlaMint-BE.TEI/ParlaMint-BE.xml</code><lb/>
            <code>ParlaMint-BE.TEI/ParlaMint-BE-listPerson.xml</code><lb/>
            <code>ParlaMint-BE.TEI/ParlaMint-BE-listOrg.xml</code><lb/>
            <code>ParlaMint-BE.TEI/ParlaMint-taxonomy-parla.legislature.xml</code><lb/>
            <code>ParlaMint-BE.TEI/ParlaMint-taxonomy-speaker_types.xml</code><lb/>
            <code>...</code><lb/>
            <code>ParlaMint-BE.TEI/2014/ParlaMint-BE_2014-06-19.xml</code><lb/>
            <code>ParlaMint-BE.TEI/2014/ParlaMint-BE_2014-06-30.xml</code><lb/>
            <code>ParlaMint-BE.TEI/2014/ParlaMint-BE_2014-07-17.xml</code><lb/>
            <code>...</code><lb/>
            <code>ParlaMint-BE.TEI/2015/ParlaMint-BE_2015-01-06-54.xml</code><lb/>
            <code>ParlaMint-BE.TEI/2015/ParlaMint-BE_2015-01-07-54.xml</code><lb/>
            <code>ParlaMint-BE.TEI/2015/ParlaMint-BE_2015-01-08-54.xml</code><lb/>
            <code>...</code><lb/>
          </eg>

          The lingistically annotated version of the corpus is stored
          separately, with the main directory and, as mentioned, the
          corpus root and component filenames having the
          additional suffix <code>.ana</code>, e.g.

          <eg xml:id="exa-directories.ana"><lb/>
            <code>ParlaMint-BE.TEI.ana/ParlaMint-BE.ana.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/ParlaMint-BE-listPerson.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/ParlaMint-BE-listOrg.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/ParlaMint-taxonomy-speaker_types.xml</code><lb/>
            <code>ParlaMint-taxonomy-NER.xml</code><lb/>
            <code>ParlaMint-taxonomy-UD.xml</code><lb/>
            <code>...</code><lb/>
            <code>ParlaMint-BE.TEI.ana/2014/ParlaMint-BE_2014-06-19.ana.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/2014/ParlaMint-BE_2014-06-30.ana.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/2014/ParlaMint-BE_2014-07-17.ana.xml</code><lb/>
            <code>...</code><lb/>
            <code>ParlaMint-BE.TEI.ana/2015/ParlaMint-BE_2015-01-06-54.ana.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/2015/ParlaMint-BE_2015-01-07-54.ana.xml</code><lb/>
            <code>ParlaMint-BE.TEI.ana/2015/ParlaMint-BE_2015-01-08-54.ana.xml</code><lb/>
            <code>...</code><lb/>
          </eg>
          </p>
        </div>
      </div>
      
      <div xml:id="chp-general">
        <head>General requirements</head>
        
        <p>This section gives some general requirements a ParlaMint corpus has to meet,
        in particular those relating to the characters in a corpus, and the use of
        standards. It also details the structure of the file names of the ParlaMint root
        and component files, as well as the attributes expected on the <gi>teiCorpus</gi>
        and <gi>TEI</gi> tags.</p>
        
        <div xml:id="sec-chars">
          <head>Characters</head>
          
          <p>The corpus should be encoded in Unicode, using the UTF-8 character encoding, at
          least for European languages.  In cases where the original contains characters from
          the Unicode Private Use Area, these should, if possible, be given their closest
          Unicode equivalents or substituted by the Unicode replacement character
          U+FFFD. End-of-line hyphens, if present in the source files, should be removed, and
          the split words joined in order to enhance searching the corpus and to simplify
          linguistic processing.</p>
          
          <p>The following characters, esp. prevalent when the source documents were in Word or
          HTML, deserve special mention:
          
          <list>
            <item>TAB (U+0009) character helps the alignment of strings on successive
            lines.  As ParlaMint is not interested in preserving the layout, all TAB
            chacters are substituted by space characters (U+0020).</item>
            
            <item>NO-BREAK SPACE (U+00A0) prevents, with some applications, an automatic
            line break at its position and also collapsing such consecutive characters
            into a single space. As the use of this character complicates (or breaks)
            further processing, esp. linguistic annotation, these characters should be 
            substituted by the normal space character (U+0020). The same holds for other
            variants of spaces (U+2000 - U+200A), which are, however, used much less
            frequently.</item>
            
            <item>NON-BREAKING HYPHEN (U+2011), similarly to NO-BREAK SPACE, prevents a
            line break, in this case following its position. With a similar reasoning as
            above, this character should be substituted by the normal hyphen character
            ('-', U+002D).</item>
            
            <item>SOFT HYPHEN (U+00AD) indicates that a word can be hyphenated at that
            point. Occurrences of this character should be removed from the corpus.</item>
          </list>
          </p>
          
          <p>Text-bearing elements should also not start or end with space characters, and
          sequences of whitespace characters should be changed into a single space.</p>
        </div>
        
        <div xml:id="sec-standard">
          <head>Standard values</head>
          
          <p>Whenever possible, ParlaMint uses standards for information coding. In
          particular, the following information must be standardised:
          <list>
            <item xml:id="iso3166">As the identity of a ParlaMint corpus is determined by the country or
            region of the particular parliament, its code appears in many places. For
            specifying these codes, the <ref
            target="https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes">ISO
            3166</ref> standard should be used, in particular <ref
            target="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">ISO 3166-1
            alpha-2</ref> for the two letter codes of the countries (for national
            parliaments) and <ref target="https://en.wikipedia.org/wiki/ISO_3166-2">ISO
            3166-2</ref> for the names of country subdivision (for parliaments of
            autonomous provinces,). So, for example, the country code for Spain is "ES",
            while the code for the autonomous Basque community is "ES-PV". Note that we
            use the term <term>regional parliaments</term> for such cases.</item>
            
            <item>The codes for the languages used in the corpora (i.e. the possible
            values of the <att>xml:lang</att> attribute) should follow <ref
            target="https://tools.ietf.org/html/bcp47">BCP 47</ref> (cf. also <q><ref
            target="https://www.w3.org/International/questions/qa-when-xmllang">xml:lang
            in XML document schemas</ref></q>. Essentially, this means that the value for
            a language code should have two letters, following <ref
            target="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">ISO 639-1</ref>
            or, and only if a two letter code does not exist for a language, the
            three-letter <ref target="https://en.wikipedia.org/wiki/ISO_639-2">ISO
            639-2/T</ref> code. For example, the code for Basque is 'eu'. ParlaMint
            corpora will use at least two languages, i.e. the language that the
            transcriptions are written in, which we will call the <term>local
            language</term> and English, as the meta-language, which is (also) used in
            the metadata.
            </item>
            
            <item>Temporal, i.e. time-related information is typically stored in the
            <att>when</att>, <att>from</att> and <att>to</att> attributes of various
            elements. To specify a date or time as the value of these attributes,
            formatting according to the <ref
            target="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</ref> standard should
            be used, e.g. <val>2022-04-01</val> for the 1st of April 2022. More
            information on temporal attributes is given in the Section on <ref
            target="#sec-temporal">Temporal attributes</ref>.</item>
          </list>
          </p>
        </div>
        
        <div xml:id="sec-roottags">
          <head>Attributes of top-level elements</head>
          
          <p>The Chapter on <ref target="#chp-overall">Overall corpus structure</ref>
          introduced the top level elements of the corpus root file and of the component
          files (i.e. the <gi>teiCorpus</gi> and <gi>TEI</gi> elements), but did not
          elaborate on their attributes; these are presented in this section.</p>

          <p>The corpus root has three required attributes, as shown below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-teiCorpusRoot">
            <![CDATA[
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" 
           xml:id="ParlaMint-FR" 
           xml:lang="fr">]]>
          </egXML>

          All three attributes can also be used on any other element, and are thus of
          special importance:
          <list>
            <item><att>xmlns</att> determines the namespace of the element, and this should
            always be the TEI namespace, i.e. <val>http://www.tei-c.org/ns/1.0</val>. Note
            that all lower level elements in the same file inherit this namespace, so it is
            not necessary (although it is not an error) for other elements to also define
            their namespace.</item>
            
            <item><att>xml:id</att> is an attribute form the (implicitly assumed) XML
            namespace, and gives the identifier for the corpus root or component.  The
            value of an ID should be unique in the corpus as a whole and should obey
            format requirements as defined by <ref
            target="https://www.w3.org/TR/xml-id/">W3C</ref>.
            
            For the corpus root, as well as for the components, it is required that this
            top level identifier is identical to the file name (without the file
            extension).

            The <att>xml:id</att> is a global attribute, so any element can have
            it. While this is not required, it is necessary for any element that
            is then referred to (via this same ID) by some other element, such
            as many elements in the <gi>teiHeader</gi>, as is explained in the
            Section on <ref target="#sec-metadata">Corpus metadata</ref>.

            The subordinate elements in the transcription that have an ID (such as
            utterances and segments), are recommended to have the top level <att>xml:id</att> as a
            prefix and to indicate the element name in the ID. For example, if the top
            level ID is <val>ParlaMint-GB_2021-01-06</val>, the first utterance
            would have the ID <val>ParlaMint-GB_2021-01-06-lords.u1</val> and the first
            segment <val>ParlaMint-GB_2021-01-06-lords.seg1</val>. The number of the
            element should not have leading zeros.</item>
            
            <item><att>xml:lang</att> is also a global attribute and gives the language
            code of the text content of the element; for the corpus root this does not
            (just) mean the content of its TEI header, but primarily the textual content
            of its XIncluded components. The convention is that language of the text
            content of an element is determined by the value of the first
            <att>xml:lang</att> attribute on its ancestor axis. In cases where the content
            is multilingual, the language code should be of the majority language. When
            the proportion of the languages is about equal, then the <val>mul</val> code
            for multiple languages can also be used.</item>
          </list>
          </p>

          <p>A corpus component also has the same three required attributes, but
          additionally also the <att>ana</att> attribute:
          
          <!-- The info in @ana should actually go in the
               TEI/teiHeader/profileDesc/textClass - but it is probably too late to change
               this now -->
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-teiCorpusComp">
            <![CDATA[
<TEI xmlns="http://www.tei-c.org/ns/1.0" 
     xml:id="ParlaMint-FR_2017-07-04-E1001" 
     xml:lang="fr" 
     ana="#parla.sitting #reference">]]>
          </egXML>

          The same as for the corpus root, the component also sets the TEI namespace, and
          gives the language of its textual content, while its <att>xml:id</att>, of
          course, identifies the particular component. The <att>ana</att> attribute is a
          pointing attribute, and we introduce the these attributes in the next
          section.</p>
        </div>

        <div xml:id="sec-pointing">
          <head>Pointing attributes</head>

          <p>The ParlaMint encoding uses pointing attributes for a number of purposes,
          e.g. for references to taxonomy categories, to speaker metadata, or to
          linguistic categories.</p>

          <p>While a few elements have dedicated pointing attributes, there are three
          generally used ones. They share the characteristics that they are all used by a
          large number of different elements and that their value is a series of pointers,
          i.e.  a white-space delimited sequence of references to the values of some
          <att>xml:id</att> attribute in the corpus or, in general, to an URI.
          The three attributes are:

          <list>
            <item><att>ana</att> serves to provide an analysis or to classify an element
            according to some pre-determined vocabulary. In ParlaMint the target element will
            typically be a category in a taxonomy, an event or date, or an organisation.</item>
            <item><att>corresp</att> points to items that correspond to the current
            element in some way, e.g. the (URL of a) media file to a page break.
            </item>
            <item><att>ref</att> provides an explicit reference to the full definition or
            identity for the entity being named. In ParlaMint it is used e.g. for
            connecting a person's affiliation with a particular organisation. The value
            of this attribute is often, but not always, an URL, e.g. for associating a
            place name with its GeoNames URL.</item>
          </list>
          </p>

          <p>To illustrate, the example below gives some elements that contain one or more of these
          attributes:
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-pointing">
              <meeting ana="#parla.upper #parla.term #LEG.18">18 Legislatura</meeting>
              ...
              <affiliation ref="#group.L-SP-PSd.Az" role="member" ana="#LEG.18" from="2018-03-27"/>
              ...
              <placeName ref="https://www.geonames.org/2523918">Palermo</placeName>
              ...
              <link ana="ud-syn:det" target="#ParlaMint-IT.seg1.2.6 #ParlaMint-IT.seg1.2.5"/>
            </egXML>
            
          The first example, with the <gi>meeting</gi> element classifies it (the
          definitions are given in the relevant taxonomy) as a meeting of the upper house,
          in the scope of a parlimentary term, specifically in the XVIII Legislative Term.
          The example with <gi>affiliation</gi> (again, the definitions are given the
          elements with the pointed-to ID) specifies that the (person that has this)
          affiliation is a member of the parliamentary group <q>Lega-Salvini
          Premier-Partito Sardo d'Azione</q> in the scope of the XVIII Legislative
          Term. The <gi>placeName</gi> example gives the definition of Palermo in the
          GeoNames database via the used URL. Finally, the <gi>link</gi> example
          illustrates a Universal Dependencies determiner syntactic link between two
          tokens. The link uses the TEI extended pointer syntax, further explained in the
          Section on <ref target="#sec-ana-prefixDef">Prefix definitions</ref>.</p>

          <p>It is often difficult to decide which of the attribute to use for a
          particular pointer, therefore examples of usage given with the relevant element
          should be always consulted.</p>
        </div>

        <div xml:id="sec-temporal">
          <head>Temporal attributes</head>

          <p>ParlaMint makes a lot of use of temporal information, e.g. to determine when
          a session took place or the period when a certain person was an MP. As mentioned
          in the Section on <ref target="#sec-standard">Standard values</ref>, the <ref
          target="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</ref> format should be
          used to specify the dates or times.</p>
          
          <p>The following attributes are used to specify temporal information:

          <list>
            <item>The <att>when</att> attribute is used when the temporal information
            refers to a point in time, typically a date, and is used e.g. to give the date
            when the corpus was published, or when a change in the corpus was made.</item>
            
            <item>The <att>from</att> and <att>to</att> attributes give the starting and
            ending date or time of an interval, e.g. the time period the corpus covers, or
            the period when a person was an MP. If only one of the two attributes is
            present, then the assumption is that this interval extends at least to the
            start (if <att>from</att> is missing) or after the end (if <att>to</att> is
            missing) of time period that the particular ParlaMint corpus covers. Similary,
            if both attributes are missing, the assumption is that the interval covers the
            complete time period of the ParlaMint corpus.</item>
            
          </list>
          </p>
        </div>
        
      </div>
      
      <div xml:id="sec-metadata">
        <head>Corpus metadata</head>
        
        <p>As mentioned, <gi>teiCorpus</gi> and <gi>TEI</gi> elements contain the obligatory
        <gi>teiHeader</gi> element, which stores the metadata to the corpus root or
        component. In this section we explain and give examples of the required and optional
        metadata that is contained in the <gi>teiHeader</gi>, proceeding through its various
        elements, and there distinguishing which parts and what content is appropriate for
        the corpus root, and which for a corpus component.</p>

        <p>As a general remark, most metadata contains free text, and it is a requirement
        of ParlaMint that this data is given in the English language, to help researchers
        for other countries to understand it, and it is recommended to also give it in the
        local language in which the (main portion of) parliamentary transcripts is
        written, for a local researcher to be able to use it in their native tongue.</p>

        <p>A ParlaMint <gi>teiHeader</gi> contains three obligatory elements: the file
        description, <gi>fileDesc</gi>, the encoding description, <gi>encodingDesc</gi>, and
        the profile description, <gi>profileDesc</gi>, and an optional revision description,
        <gi>revisionDesc</gi>:
        
        <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-teiHeader">
          <teiHeader>
            <fileDesc>...</fileDesc>
            <encodingDesc>...</encodingDesc>
            <profileDesc>...</profileDesc>
            <revisionDesc>...</revisionDesc>
          </teiHeader>
        </egXML>
        Below we explain each of these element in turn.
        </p>
        
        <div xml:id="sec-fileDesc">
          <head>File description</head>
          <p>The file description, <gi>fileDesc</gi> is composed of five obligatory
          elements, namely the title statement, <gi>titleStmt</gi>, the edition statement,
          <gi>editionStmt</gi>, the extent, <gi>extent</gi>, the publication statement,
          <gi>publicationStmt</gi>, and the source description, <gi>sourceDesc</gi>:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-fileDesc">
            <fileDesc>
              <titleStmt>...</titleStmt>
              <editionStmt>...</editionStmt>
              <extent>...</extent>
              <publicationStmt>...</publicationStmt>
              <sourceDesc>...</sourceDesc>
            </fileDesc>
          </egXML>
          </p>
          
          <div xml:id="sec-titleStmt">
            <head>Title statement</head>
            
            <p>The title statement, <gi>titleStmt</gi> gives the title of the corpus root or
            component, along with the specification of the particular session(s) of the
            parliament contained, the persons responsible for compiling the corpus, and the
            funder(s) of the project.</p>

            <p>This structure is exemplified by the following <hi>corpus root</hi> title
            statement:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-titleStmtRoot">
              <titleStmt>
                <title type="main">Slovenski parlamentarni korpus ParlaMint-SI [ParlaMint]</title>
                <title type="main" xml:lang="en">Slovenian parliamentary corpus ParlaMint-SI [ParlaMint]</title>
                <title type="sub">Zapisi sej Državnega zbora Republike Slovenije, 7. in 8. mandat (2014 - 2020)</title>
                <title type="sub" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia, Term 7 and 8 (2014 - 2020)</title>
                <meeting n="7" corresp="#DZ" ana="#parla.lower #parla.term #DZ.7">7. mandat</meeting>
                <meeting n="8" corresp="#DZ" ana="#parla.lower #parla.term #DZ.8">8. mandat</meeting>
                <respStmt>
                  <persName ref="https://orcid.org/0000-0001-6143-6877">Andrej Pančur</persName>
                  <persName ref="https://orcid.org/0000-0002-1560-4099">Tomaž Erjavec</persName>
                  <resp>Kodiranje ParlaMint TEI XML</resp>
                  <resp xml:lang="en">ParlaMint TEI XML corpus encoding</resp>
                </respStmt>
                <funder>
                  <orgName>Raziskovalna infrastruktura CLARIN</orgName>
                  <orgName xml:lang="en">The CLARIN research infrastructure</orgName>
                </funder>
                <funder>
                  <orgName>Slovenska raziskovalna infrastruktura CLARIN.SI</orgName>
                  <orgName xml:lang="en">The Slovenian research infrastructure CLARIN.SI</orgName>
                </funder>
              </titleStmt>
            </egXML>
            
            The title statement starts with two titles (one main, the other subordinate), both
            in English and the local language, with the appropriate language code possibly
            inherited from a superordinate element. They are distinguished by the value
            <val>main</val> or <val>sub</val> of their <att>type</att> attribute and the
            value of their <att>xml:lang</att> attribute.</p>
            
            <p>The main title has a formulaic structure <q>&lt;Country name&gt;
            parliamentary corpus ParlaMint-&lt;Country code&gt; [ParlaMint]</q>, with an
            equivalent structure for the local language. Note that the corpus <q>stamp</q>
            in square brackets can also be <q>[ParlaMint.ana]</q> for the linguistically
            annotated version of the corpus (as explained in the Chapter on <ref
            target="#chp-linguistic">Linguistic annotation</ref>) or <q>[ParlaMint
            SAMPLE]</q> for corpus data samples, as available on the <ref
            target="https://github.com/clarin-eric/ParlaMint/">ParlaMint GitHub
            repository</ref>.</p>

            <p>The subordinate title, in contrast to the main one, is free text, and usually
            formed on the basis of the source of the corpus. As with the main one, it should
            be given in both languages.</p>

            <p>After the titles come the specification of the particular sessions that the
            corpus contains, encoded as <gi>meeting</gi> elements: the two meeting
            elements in the above example state that the ParlaMint-SI corpus contains the
            meetings of the 7th and 8th terms of the lower house of the National Assembly
            of the Republic of Slovenia. The <gi>meeting</gi> elements can give, as the
            value of their <att>n</att> attribute, the numbers of the meetings that the
            corpus covers, and their text content can give a free-text description of the
            meetings in the local language.</p>

            <p>The formal information on the meetings is given in the values of the
            <att>corresp</att> and <att>ana</att> attributes, which are pointing
            attributes, as already explained in the Section on <ref
            target="#sec-roottags">Attributes of top-level elements</ref>. Here they refer
            to the definition of organisations further explained in the Section on <ref
            target="#sec-orgs">Organisations</ref> and the categories of taxonomy
            elements, further explained in the Section on the <ref
            target="#sec-classDecl">Class declaration</ref>. The value of the
            <att>corresp</att> attribute points to the governmental body of which a
            particular meeting element is a meeting of (in this case the National Assembly
            of the Republic of Slovenia), while the <att>ana</att> attribute contains a
            space-delimited sequence of pointers: <val>#parla.lower</val> points to the
            definition of the lower house, <val>#parla.term</val> to the definition of a
            parliamentary term, and <val>#DZ.7</val> to the definition of the seventh
            mandate.
            <!-- This is silly, why two attributes, why DZ in one, lower in the other? 
                 Did we have a reason for such an encoding? Maybe ask Andrej
                 We could post an issue. -->
            </p>
            
            <p>Next come one or more responsibility statements, <gi>respStmt</gi>, each
            one containing one or more person names, <gi>persName</gi>, with an optional
            <att>ref</att> attribute, giving the URL, where more information about the
            person can be found, and the responsibility element <gi>resp</gi>, which
            specifies what responsibility the statement is about.</p>

            <p>In a similar manner, the <gi>funder</gi> elements give information on the
            organisations which have financially contributed to the compilation of the corpus,
            with the names of the organisations given in the <gi>orgName</gi> elements.</p>
            
            <p>A <hi>corpus component</hi> has a very similar title statement to the
            corpus root, except that certain elements specify the metadata of the
            component, rather than the complete corpus. The also contain some redundant
            metadata, in particular, the responsibility statement and the funder, as
            illustrated in the example below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-titleStmtComp">
              <titleStmt>
                <title type="main">Slovenski parlamentarni korpus ParlaMint-SI, izredna seja 59 [ParlaMint]</title>
                <title type="main" xml:lang="en">Slovenian parliamentary corpus ParlaMint-SI, Extraordinary Session 59 [ParlaMint]</title>
                <title type="sub">Zapisi sej Državnega zbora Republike Slovenije, 7. mandat, 59. izredna seja, 13.4.2018</title>
                <title type="sub" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia, Term 7, Extraordinary Session 59, 13.4.2018</title>
                <meeting n="59"
                         corresp="#DZ"
                         ana="#parla.lower #parla.meeting.extraordinary">Izredna</meeting>
                <meeting n="7" corresp="#DZ" ana="#parla.lower #parla.term #DZ.7">7. mandat</meeting>
                <respStmt>
                  <persName>Andrej Pančur</persName>
                  <resp>Kodiranje TEI</resp>
                  <resp xml:lang="en">TEI corpus encoding</resp>
                </respStmt>
                <funder>
                  <orgName>Raziskovalna infrastruktura CLARIN</orgName>
                  <orgName xml:lang="en">The CLARIN research infrastructure</orgName>
                </funder>
                <funder>
                  <orgName>Slovenska raziskovalna infrastruktura CLARIN.SI</orgName>
                  <orgName xml:lang="en">The Slovenian research infrastructure CLARIN.SI</orgName>
                </funder>
              </titleStmt>
            </egXML>
            
            In the example it can be seen that the main title of a corpus component is simply
            an extension of the corpus root title, as it also gives the name of the particular
            meeting that the component contains, while the subordinate title is, again, free
            text. Both titles must be unique in the complete corpus.</p>

            <p>The other difference is in the <gi>meeting</gi> elements, which here specify a
            particular meeting of the corpus component transcription. In the exmple above,
            this is an extraordinary meeting of the lower house in the seventh term of the
            National Assembly of the Republic of Slovenia.
            </p>
          </div>
          
          <div xml:id="sec-editionStmt">
            <head>Edition statement</head>
            
            <p>ParlaMint corpora have their edition statement, <gi>editionStmt</gi> both in
            the corpus root and components. As illustrated below, the only element it contains
            is <gi>edition</gi>:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-editionStmt">
              <editionStmt>
                <edition>3.0</edition>
              </editionStmt>
            </egXML>
            
            We use semantic versioning to specify the version of the corpus, i.e. giving the
            version number, where a new major version means substantial changes to the
            corpus, while the minor version is reserved for e.g. correcting errata or other
            minor changes. We do not use the patch number. It should be noted that - at least
            so far - all the ParlaMint corpora were released together, so that they are all
            of the same edition, i.e. have the same version number. At the time of writing,
            the latest version is 2.1, with the next one planned to be 3.0.</p>
          </div>

          <div xml:id="sec-extent">
            <head>Extents</head>
            
            <p>The <gi>extent</gi> element gives information on selected sizes of the
            complete corpus (in the corpus root) or of one corpus component, as illustrated
            below in the case of a corpus root extent:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-extent">
              <extent>
                <measure unit="speeches" quantity="75122" xml:lang="sl">75.122 govorov</measure>
                <measure unit="speeches" quantity="75122" xml:lang="en">75,122 speeches</measure>
                <measure unit="words" quantity="20190034" xml:lang="sl">20.190.034 besed</measure>
                <measure unit="words" quantity="20190034" xml:lang="en">20,190,034 words</measure>
              </extent>
            </egXML>
            
            ParlaMint requires two sizes to be given, and in both languages, which are
            distinguished by their <att>unit</att> attribute, namely the number of speeches
            and the number of words. The exact quantity is given in the <att>quantity</att>
            attribute, while the text content of <gi>measure</gi> gives the quantity together
            with the unit - if possible, the number here should contain the thousands
            separator appropriate for the language.</p>

            <p>It should be noted that both sizes are somewhat complex to compute and are
            inserted into the TEI headers in the finalisation of a corpus (cf. the Section on
            <ref target="#sec-final">Finalisation of corpora</ref>) by a common script,
            so it is not necessary to insert the extent in the process of developing a ParlaMint
            corpus.</p>
          </div>

          <div xml:id="sec-publicationStmt">
            <head>Publication statement</head>

            <p>The publication statement <gi>publicationStmt</gi> must appear in the corpus
            root as well as, in identical form, in the corpus components. As illustrated
            below, it contains information about the publisher of the corpus, the persistent
            identifier where the complete corpus can be found, under which licence it is
            distributed, and when it was released:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-publicationStmt">
              <publicationStmt>
                <publisher>
                  <orgName xml:lang="sl">Raziskovalna infrastrukutra CLARIN</orgName>
                  <orgName xml:lang="en">CLARIN research infrastructure</orgName>
                  <ref target="https://www.clarin.eu/">www.clarin.eu</ref>
                </publisher>
                <idno type="URI" subtype="handle">http://hdl.handle.net/11356/1432</idno>
                <availability status="free">
                  <licence>http://creativecommons.org/licenses/by/4.0/</licence>
                  <p xml:lang="sl">To delo je ponujeno pod
                  <ref target="http://creativecommons.org/licenses/by/4.0/">Creative Commons Priznanje avtorstva 4.0
                  mednarodna licenca</ref>.</p>
                  <p xml:lang="en">This work is licensed under the
                  <ref target="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0
                  International License</ref>.</p>
                </availability>
                <date when="2021-06-11">11. 6. 2023</date>
              </publicationStmt>
            </egXML>
            
            The <gi>publisher</gi> is, at least for the corpora produced in the scope of the
            CLARIN ParlaMint project, the CLARIN research infrastructure, and the element
            also gives the home page of the infrastructure.

            The <q>identifier number</q> element, <gi>idno</gi>, specifies via its
            <att>type</att> and <att>subtype</att> attributes with fixed values
            <val>URI</val> and <val>handle</val> that the identifier is a handle, and
            contains the handle where the complete corpus corresponding to the specified
            version can be found.

            The <gi>availability</gi> specifiers, via its <gi>licence</gi> element the
            fixed-value CC BY 4.0 URL, and in the following paragraph gives a prose
            description of the licence, including its URL via the <att>target</att> attribute
            of <gi>ref</gi>. As usual, the textual information is given in both languages.

            Finally, the <gi>date</gi> gives the date of the release, where the
            <att>when</att> gives the date in the ISO 8601 format, while the textual content
            can give it according to the conventions used in the local language.
            </p>
          </div>
          
          <div xml:id="sec-sourceDesc">
            <head>Source description</head>

            <p>The source description <gi>sourceDesc</gi> of the <hi>corpus root</hi> encodes
            the original digital source of the ParlaMint corpus in the <gi>bibl</gi> element,
            as shown in the following example:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-sourceDescRoot">
              <sourceDesc>
                <bibl>
                  <title type="main" xml:lang="sl">Zapisi sej Državnega zbora Republike Slovenije</title>
                  <title type="main" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia</title>
                  <idno type="URI">https://www.dz-rs.si</idno>
                  <date from="2014-08-01" to="2020-07-16">1.8.2014 - 16.7.2020</date>
                </bibl>
              </sourceDesc>
            </egXML>
            
            <!-- But what about if the corpus was made from a previous corpus, or if they
                 received the dump directly from the government? Have a look at the French
                 data! -->
            
            Apart from the bi-lingual <gi>title</gi>s, it should also give in <gi>idno</gi>
            with the fixed <att>type</att> as <val>URI</val> the government URL where the
            transcripts were first harvested from, while the dates of the earliest and latest
            transcript in the corpus are indicated by the <att>from</att> and <att>to</att>
            attributes of the <gi>date</gi> element. As usual, the values of these attributes
            should be according to ISO 8601, while the textual content can be formatted
            according to the local rules for writing dates.</p>

            <p>For <hi>corpus components</hi> the source description is very similar to the
            one for the corpus root, except that the <gi>title</gi> can be modified to
            constrain the description to the exact meeting the component contains.

            The <gi>date</gi> element must, of course, specify the exact date when the
            meeting took place.

            If the transcription of the meeting is avilable on the Web, the <gi>idno</gi>
            should give this URL.

            Furthermore, if the audio or video of the meeting is available, this information
            can be given in the <gi>recodingStmt</gi>, as illustrated in the example below:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-sourceDescComp">
              <sourceDesc>
                <bibl>
                  <title type="main" xml:lang="cs">Parlament České republiky, Poslanecká sněmovna</title>
                  <title type="main" xml:lang="en">Parliament of the Czech Republic, Chamber of Deputies</title>
                  <idno type="URI">https://www.psp.cz/eknih/2013ps/stenprot/044schuz/s044033.htm</idno>
                  <date when="2016-04-13">13.04.2016</date>
                </bibl>
                <recordingStmt>
                  <recording type="audio">
                    <media xml:id="ps2013-044-02-000-000.audio1" mimeType="audio/mp3" source="https://www.psp.cz/eknih/2013ps/audio/2016/04/13/2016041308580912.mp3" url="2013ps/audio/2016/04/13/2016041308580912.mp3"/>
                  </recording>
                </recordingStmt>
              </sourceDesc>
            </egXML>
            
            <!-- The exemplified optional audio (or video) recording information is meant for
            cases where the media file or, in general, URL encompasses the complete
            transcription of the corpus component; for cases when the media are further
            segmented, e.g. by speech, a different encoding is used, which is further
            explained in the Section <ref target="#sec-audio">Audio</ref>. -->
            </p>

            <p>As the example shows, the recording statement contains a <gi>recording</gi>
            element, which specifies the <att>type</att> of the recording (<val>audio</val> or
            <val>video</val>), and then contains a <gi>media</gi> element giving the ID of the
            file, its <att>mimeType</att>, the URL of the <att>source</att> of the recording
            (typically the official governmental site for parliamentary proceedings) and the
            local (possibly processed) copy of the file; this can be a local file, even though
            it won't be distributed together with the ParlaMint corpus or, better, a Web-based
            file on a stable location.</p>
          </div>
        </div>
        
        <div xml:id="sec-encodingDesc">
          <head>Encoding description</head>
          
          <p>The encoding description <gi>encodingDesc</gi> of the <hi>corpus root</hi>
          contains the following elements:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-encodingDescRoot">
            <encodingDesc>
              <projectDesc>...</projectDesc>
              <editorialDecl>...</editorialDecl>
              <tagsDecl>...</tagsDecl>
              <classDecl>...</classDecl>
            </encodingDesc>
          </egXML>
          </p>
          
          <p>In contrast, the encoding description of a <hi>corpus component</hi> contains
          only two elements, namely (and redundantly) the <gi>projectDesc</gi> and the
          <gi>tagsDecl</gi>.</p>

          <div xml:id="sec-projectDesc">
            <head>Project description</head>
            
            <p>The project description <gi>projectDesc</gi> of the <hi>corpus root</hi>
            contains a short description of the project in the scope of which the corpus was
            compiled:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-projectDesc">
              <projectDesc>
                <p xml:lang="sl">Glavni cilji projekta <ref
                target="https://www.clarin.eu/content/parlamint">ParlaMint</ref> so
                (1) izdelati večjezično množico na enak način kodiranih korpusov
                zapiskov parlamentarnih sej, (2) jezikoslovno označiti te korpuse; (3)
                narediti korpuse dostopne za prevzem in prek konkordančnikov; in (4)
                pripraviti primere uporabe korpusov v politologiji in digitalni
                humanistiki.</p>
                <p xml:lang="en">The <ref
                target="https://www.clarin.eu/content/parlamint">ParlaMint</ref>
                project aims to (1) create a multilingual set of uniformly encoded
                comparable corpora of parliamentary proceedings (2) process the
                corpora linguistically; (3) make the corpora available for download
                and through concordancers; and (4) build use cases in Political
                Sciences and Digital Humanities based on the corpus data.</p>
              </projectDesc>
            </egXML>

           The description above is written for the CLARIN ParlaMint project and the English
           language part can be used as is in the produced corpora for version 3.
            </p>
          </div>
          
          <div xml:id="sec-editorialDecl">
            <head>Editorial declaration</head>
            
            <p>The editorial declaration, <gi>editorialDecl</gi> is used only in the
            corpus root and contains prose descriptions of the editorial decision made in
            the process of compiling the corpus, along several dimensions, in
            particular what, if any types of <gi>correction</gi>,
            <gi>normalization</gi>, <gi>quotation</gi>, <gi>hyphenation</gi>, and
            <gi>segmentation</gi> was performed on the source texts of the corpus. The
            example below illustrates the use of these elements:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-editorial">
              <editorialDecl>
                <correction>
                  <p xml:lang="en">No correction of source texts was performed.</p>
                </correction>
                <normalization>
                  <p xml:lang="en">Text has not been normalised, except for spacing.</p>
                </normalization>
                <hyphenation>
                  <p xml:lang="en">No end-of-line hyphens were present in the source.</p>
                </hyphenation>
                <quotation>
                  <p xml:lang="en">Quotation marks have been left in the text and are not explicitly marked up.</p>
                </quotation>
                <segmentation>
                  <p xml:lang="en">The texts are segmented into utterances (speeches) and segments (corresponding to paragraphs in the source transcription).</p>
                </segmentation>
              </editorialDecl>
            </egXML>
            </p>
          </div>

          <div xml:id="sec-tagsDecl">
            <head>Tags declaration</head>
            
            <p>The tags declaration, <gi>tagsDecl</gi> of the corpus root gives the
            count of all the XML tags used in the data part (so, not in the TEI
            header) of the corpus (for the corpus root) or in an individual component
            of the corpus. To distinguish the TEI elements from the possible use of
            elements from other namespaces, a <gi>namespace</gi> element giving the
            TEI namespace in its <att>name</att> attribute is introduced
            first. Inside it, each TEI tag is listed in its own <gi>tagUsage</gi>
            element, with the attribute <att>gi</att> giving the name of the tag and
            <att>occurs</att> the number of occurrences, as shown in the following
            example:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-tagsDecl">
              <tagsDecl>
                <namespace name="http://www.tei-c.org/ns/1.0">
                  <tagUsage gi="text" occurs="414"/>
                  <tagUsage gi="body" occurs="414"/>
                  <tagUsage gi="div" occurs="414"/>
                  <tagUsage gi="head" occurs="826"/>
                  <tagUsage gi="u" occurs="75122"/>
                  <tagUsage gi="seg" occurs="280971"/>
                  <tagUsage gi="note" occurs="85525"/>
                  <tagUsage gi="gap" occurs="7897"/>
                  <tagUsage gi="vocal" occurs="1740"/>
                  <tagUsage gi="incident" occurs="37"/>
                  <tagUsage gi="kinesic" occurs="560"/>
                  <tagUsage gi="desc" occurs="10234"/>
                </namespace>
              </tagsDecl>
            </egXML>
            
            It should be noted that similar to the extents (as explained in the Section on
            <ref target="#sec-extent">Extents</ref>) the tag usage is inserted into
            the TEI headers in the finalisation of a corpus (cf. the Section on <ref
            target="#sec-conversion">Validation and conversion</ref>) by a common
            script, so it is not necessary to compute it the process of developing a
            ParlaMint corpus.</p>
          </div>
          
          <div xml:id="sec-classDecl">
            <head>Class declaration and taxonomies</head>
            
            <p>The class declaration, <gi>classDecl</gi> is used only in the corpus root and
            contains only definitions of some controlled vocabularies used in ParlaMint corpora.
            These vocabularies, possibly hierarchically organised, are encoded using the
            <gi>taxonomy</gi> element.</p>

            <p>The taxonomies themselves are stored in separate files, and are typically
            ParlaMint-wide, i.e. all corpora use the same taxonomies. The taxonomies are
            included in the document root with the XInclude directive, as illustrated
            below, for the case of the Czech corpus:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-classdecl-xinclude"><![CDATA[
<classDecl>
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="ParlaMint-taxonomy-parla.legislature.xml"/>    <!-- Common taxonomy on parliament legislature -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="ParlaMint-CZ-taxonomy-meeting.parts.xml"/>     <!-- CZ-specific taxonomy with additional categories for meetings -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="ParlaMint-taxonomy-speaker_types.xml"/>        <!-- Common taxonomy on types of speakers -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="ParlaMint-taxonomy-subcorpus.xml"/>            <!-- Common taxonomy on predeterimentd subcorpora -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" 
      href="ParlaMint-taxonomy-politicalOrientation.xml"/> <!-- Common taxonomy on L-R political orientations of pol. parties -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" 
      href="ParlaMint-taxonomy-CHES.xml"/>                 <!-- Common taxonomy on CHES variables of pol. parties -->
</classDecl>]]></egXML>
            As can be seen, three of the taxonomies are general ParlaMint taxonomies, while
            two are corpus specific, and are distinguished by including the country code
            <code>CZ</code> (followed by hyphen) into the filename.</p>
            
            <p>To illustrate the structure of a taxonomy element, we give below the simplest
            common taxonomy (and include only the descriptions in English), which contains the
            categories that define the three subcorpora of a ParlaMint corpus:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-taxo">
              <classDecl>
                ...
                <taxonomy xml:id="ParlaMint-taxonomy-subcorpus" xml:lang="mul">
                  <desc xml:lang="en"><term>Subcorpora</term></desc>
                  <category xml:id="reference">
                    <catDesc xml:lang="en"><term>Reference</term>: reference subcorpus, until 2020-01-30</catDesc>
                  </category>
                  <category xml:id="covid">
                    <catDesc xml:lang="en"><term>COVID</term>: COVID subcorpus, from 2020-01-31 onwards, when WHO made the
                    formal declaration of PHEIC, i.e. the Public Health Emergency of International Concern for COVID-19</catDesc>
                  </category>
                  <category xml:id="war">
                    <catDesc xml:lang="en"><term>War</term>: War in Ukraine subcorpus, from 2022-02-24 onwards, i.e. from
                    Russia's full-scale invasion of Ukraine</catDesc>
                  </category>
                </taxonomy>
              </classDecl>
            </egXML>
            
            A <gi>taxonomy</gi> thus first describes, via <gi>desc</gi>, what it
            is a taxonomy of, and then lists (the possibly nested) categories in
            <gi>category</gi> elements. Crucial here are the values of their
            <att>xml:id</att> attributes, by which a category is referred to, e.g. via the
            <att>ana</att> attribute of some other element, as was already explained in the
            Section on <ref target="#sec-roottags">Attributes of top-level elements</ref>, in
            connection with classifying a corpus component via the <att>ana</att> attribute
            of its <gi>TEI</gi> element. The taxonomy category then bilingually glosses its
            meaning in its <gi>catDesc</gi> elements, which should always first contain the
            short name of the category, encoded in the <gi>term</gi> element.</p>

            <p>ParlaMint requires several taxonomies to be defined in the class declaration
            of the corpus root (as well as a additionaly ones for the linguistically annotated
            corpus, as further described in the Section on <ref target="#sec-ana-metadata">Linguistic metadata</ref>).
            As mentioned, these taxonomies are defined globally and available as part of the data on the
            <ref target="https://github.com/clarin-eric/ParlaMint/">ParlaMint GitHub
            repository</ref>, and there is a special procedure modifying them, in
            particular on how to insert translations of a new language.</p>

            <p>The five obligatory taxonomies are:
            
            <list>
              <item>The <hi>subcorpus</hi> taxonomy, already given in the example</item>
              
              <item>The taxonomy of <hi>speaker types</hi>, which distinguishes e.g. the
              chair of a meeting, ordinary speakers, and guest speakers.</item>
              
              <item>The <hi>legislature</hi> taxonomy, which gives the possible organisations
              of a parliament, and is by far the most complex one.</item>
              
              <item>The <hi>political orientation</hi> taxonomy, which gives the values of
              the (mostly) left-right political orientation of political parties and 
              parliamentary groups.</item>
              
              <item>The <hi>CHES variables</hi> taxonomy, which gives the variables of the 
              Chapel Hill Survey on political parties (cf. the Section on
              <ref target="#sec-parties">Political parties and parliamentary groups</ref>).</item>
              
            </list>
            </p>
            <p>Furhtermore, there are two obligatory taxonomies which pertain to the linguistically analysed
            version of the corpus only, cf. the Section on <ref target="#sec-ana-taxo">Linguistic taxonomies</ref>.</p>
            
          </div>
        </div>

        <div xml:id="sec-profileDesc">
          <head>Profile description</head>
          
          <p>The profile description, <gi>profileDesc</gi> is the third main division of the
          metadata provided by the TEI header. It contains a description of non-bibliographic
          aspects of the corpus, for example the list of speakers with their metadata. For
          the <hi>corpus root</hi>, it contains four elements, of which only the first, the
          <gi>settingDesc</gi> is used in corpus components. The elements are listed below:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-profileDesc">
            <profileDesc>
              <settingDesc>...</settingDesc>
              <textClass>...</textClass>
              <particDesc>...</particDesc>
              <langUsage>...</langUsage>
            </profileDesc>
          </egXML>
          
          We explain the contents of each element in the following sections.</p>            
          
          <div xml:id="sec-settingDesc">
            <head>Setting description</head>
            
            <p>The setting description, <gi>settingDesc</gi>, is used by both the corpus root
            and corpus components, and contains only one element, <gi>setting</gi>, which
            then gives information on where and when the meetings included took place.  The
            example below gives a typicaly corpus root setting description:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-settingDescRoot">
              <settingDesc>
                <setting>
                  <name type="place">Westminster</name>
                  <name type="city">London</name>
                  <name type="country" key="GB">U.K.</name>
                  <date from="2015-01-01" to="2021-03-31"/>
                </setting>
              </settingDesc>
            </egXML>
            
            As can be seen, the location of the meeting is given in <gi>name</gi> elements
            with the <att>type</att> attribute further specifying what kind of location this
            is, with <val>country</val> (or <val>region</val> for regional parliaments)
            additionally having the <att>key</att>, which gives is ISO 3166 code, as
            explained in the Section on <ref target="#sec-standard">Standard values</ref>.
            
            For the corpus root it also contains the interval of the dates of the
            transcripts included in the corpus. Note that this date range is also given in
            the <gi>sourceDesc</gi>, as explained in the Section on the <ref
            target="#sec-sourceDesc">Source description</ref>.</p>
            
            <p>The setting description is also present in <hi>corpus components</hi>, and is
            very similar to the one in the corpus root, except that it can additionally
            specify the location of the meeting and must give its exact date, as illustrated
            below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-settingDescComp">
              <settingDesc>
                <setting>
                  <name type="place">Commons Chamber</name>
                  <name type="place">Westminster</name>
                  <name type="city">London</name>
                  <name type="country" key="GB">U.K.</name>
                  <date when="2019-02-18">February 18th, 2019</date>
                </setting>
              </settingDesc>
            </egXML>
            </p>
          </div>
          
          <div xml:id="sec-textClass">
            <head>Text class</head>
            
            <p>The text class, <gi>textClass</gi> groups information which describes the
            nature or topic of the corpus in terms of a standard classification scheme.  It
            is used only in the ParlaMint <hi>corpus root</hi>, where it contains the
            category reference, <gi>catRef</gi> element:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-textClass-root">
              <textClass>
                <catRef scheme="#parla.legislature"
                        target="#parla.bi #parla.lower #parla.upper"/>
              </textClass>
            </egXML>

            The category reference specifies in the value of the <att>scheme</att> attribute
            which scheme it uses, and this will always be a pointer to the ParlaMint-wide
            taxonomy on legislature, as further explained in the Section on the <ref
            target="#sec-classDecl">Class declaration</ref>. The <att>target</att>
            attribute then gives pointers to the kind of legislature types the country or
            region has and what the corpus contains; in the case above, the country has a
            bicameral parliament, and the corpus contains the transcriptions of both the
            upper and lower house sittings. In general, the options are:
            
            <list>
              <item><val>#parla.uni</val>: Unicameral parliament</item>
              <item><val>#parla.bi #parla.lower</val>: Bicameral parliament, lower house only</item>
              <item><val>#parla.bi #parla.upper</val>: Bicameral parliament, upper house only</item>
              <item><val>#parla.bi #parla.lower #parla.upper</val>: Bicameral parliament,
              both houses</item>
            </list>
            </p>            
          </div>
          
          <div xml:id="sec-particDesc">
            <head>Participant description</head>
            
            <p>The participant description, <gi>particDesc</gi> gives the information about
            about the speakers whose speeches constitute the corpus transcripts, as well as 
            information about the government, parliament, parliamentary groups of 
            political parties and other <q>organisations</q> relevant to the affiliations
            of the speaker or the corpus in general.
            The <gi>particDesc</gi> is a part of the TEI header of the <hi>corpus root</hi>
            and contains two dedicated types of
            lists, <gi>listOrg</gi> for organisations, and <gi>listPerson</gi> for the
            speakers, as shown below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-particDesc">
              <particDesc>
                <listOrg>...</listOrg>
                <listPerson>...</listPerson>
              </particDesc>
            </egXML>
            
            While the above gives the XML structure of the participant description,
            ParlaMint separates the organisation and person list into separate files
            (cf. the Section on <ref target="#sec-files">File names and directory
            structure</ref>), so the actual encoding would be (for the CZ corpus) as
            follows:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-particDesc-XInclude">
              <particDesc>
                <![CDATA[
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="ParlaMint-CZ-listOrg.xml"/>
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
      href="ParlaMint-CZ-listPerson.xml"/>]]>
              </particDesc>
            </egXML>
            </p>
            
            <p>Given the importance that ParlaMint gives to the information on speakers and
            their affiliations to (political) organisations, as well as the richness of this
            information, the content of <gi>listOrg</gi> and <gi>listPerson</gi> is further
            explained in a separate Chapter on <ref target="#chp-speaker">Speakers and their
            organisations</ref>.
            </p>
            
        </div>
          
          <div xml:id="sec-langUsage">
            <head>Language usage</head>
            
            <p>The language usage, <gi>langUsage</gi> is the fourth and last element of the
            profile description of a corpus root and defines the languages that are used in
            the corpus. Typically the language use will define (bilingually) only two
            languages, the local language and English, as the language used in the metadata,
            for example:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-langUsage">
              <langUsage>
                <language ident="sl" xml:lang="sl">slovenski</language>
                <language ident="en" xml:lang="sl">angleški</language>
                <language ident="sl" xml:lang="en">Slovenian</language>
                <language ident="en" xml:lang="en">English</language>
              </langUsage>
            </egXML>
            </p>
            
            <p>In cases where the transcription contains more than one language, the
            percentage of their use can also be indicated in the <att>usage</att> element of
            the <gi>language</gi> elements, as illustrated in the example below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-langusage2">
              <langUsage>
                <language ident="en" xml:lang="en">English</language>
                <language ident="en" xml:lang="nl">Engels</language>
                <language usage="45" ident="nl" xml:lang="en">Dutch</language>
                <language usage="45" ident="nl" xml:lang="nl">Nederlands</language>
                <language usage="55" ident="fr" xml:lang="en">French</language>
                <language usage="55" ident="fr" xml:lang="nl">Frans</language>
              </langUsage>
            </egXML>
            </p>
          </div>
        </div>
        
        <div xml:id="sec-revisionDesc">
          <head>Revision description</head>
          <p>The revision description, <gi>revisionDesc</gi> is the fourth, and last
          element of the TEI header. It is an optional element that can appear in the
          <hi>corpus root or component</hi>, and documents the revisions made in the corpus
          or component. Its structure is illustrated below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-revisionDesc">
            <revisionDesc>
              <change when="2021-06-11">
              <name>Tomaž Erjavec</name>: Finalized encoding.</change>
              <change when="2021-05-28">
              <name>Tomaž Erjavec</name>: Built corpus.</change>
            </revisionDesc>
          </egXML>
          The revision description consists of a series of <gi>change</gi> elements, with
          the attribute <att>when</att> giving the date of the change, and the content
          containing the <gi>name</gi> of the person responsible for the change, and a
          free-text description of the change.</p>
        </div>
      </div>
      
      <div xml:id="chp-speaker">
        <head>Speakers and their organisations</head>

        <p>ParlaMint places considerable emphasis of including in the corpora significant
        information about the persons giving the speeches contained in the transcriptions.
        This is why, even though this information is encoded in the <gi>particDesc</gi> element
        of the <gi>teiHeader</gi> of the corpus root (cf. the Chapter on <ref
        target="#sec-particDesc">Participant description</ref>) we treat it here in a separate
        Chapter. Below we first discuss the information on persons, including how they are
        affiliated with (political) organisation, and then explain the encoding of these
        organisations.</p>
        
        <div xml:id="sec-speakers">
          <head>Speakers</head>
          
          <p>The information on speakers is given in the <gi>listPerson</gi> element of
          the <gi>particDesc</gi> element (cf. the Section on <ref
          target="#sec-particDesc">Participant description</ref>). This element contains
          the series of <gi>person</gi> elements, each of which gives information on an
          individual speaker, as the example below illustrates:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-speakers">
            <listPerson>
              <person xml:id="AccettoMatej">
                <persName>
                  <surname>Accetto</surname>
                  <forename>Matej</forename>
                </persName>
                <sex value="M"/>
              </person>
              ...
            </listPerson>
          </egXML>

          Each <gi>person</gi> must have an <att>xml:id</att> attribute, so that it can
          be referred to from the transcription. The person's name, <gi>persName</gi>
          gives the name of the person which is further decomposed into the person
          <gi>surname</gi>(s), <gi>forename</gi>(s) and possibly <gi>addName</gi>. A
          person's name can also change, typically because of marriage. In this case, the
          <gi>person</gi> should contain another (or, possibly, several)
          <gi>persName</gi> elements, each marked by the <att>from</att> and/or
          <att>to</att> temporal attributes, as shown in the following example:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-persName">
            <person xml:id="GlawischnigAnna">
              <persName to="2016-06-01">
                <surname>Glawischnig</surname>
                <forename>Anna</forename>
              </persName>
              <persName from="2016-06-02">
                <surname>Glawischnig-Piesczek</surname>
                <forename>Anna</forename>
              </persName>
              ...
            </person>
          </egXML>
          
          Note that the <att>to</att>/<att>from</att> dates should not overlap, neither
          should they have gaps in them, as this would mean that the person
          could have either two names at once, or none.
          </p>

          <p>The person must also have the <gi>sex</gi> element, with the
          <att>value</att> attribute being one of the controlled values: <val>M</val> for
          male, <val>F</val> for female, <val>O</val> for other, <val>N</val> for none or
          <val>U</val> for unknown.</p>

          <p>The person element can also contain other optional information, i.e. the
          date and place of <gi>birth</gi> (and <gi>death</gi>), their official Web
          page, link(s) to Wikipedia, their VIAF, or photo, as illustrated below:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-personextra">
            <person xml:id="SayeedaWarsi">
              <persName>
                <forename>Sayeeda</forename>
                <surname>Warsi</surname>
              </persName>
              <sex value="F"/>
              <birth when="1971-03-28">
                <placeName ref="https://www.geonames.org/2651286/">Dewsbury</placeName>
              </birth>
              <idno type="URI" subtype="contact">https://members.parliament.uk/member/3839/contact</idno>
              <idno type="URI" subtype="wikimedia">https://en.wikipedia.org/wiki/Sayeeda_Warsi,_Baroness_Warsi</idno>
              <idno type="URI" subtype="wikimedia" xml:lang="es">https://es.wikipedia.org/wiki/Sayeeda_Warsi</idno>
              <idno type="URI" subtype="viaf">http://viaf.org/viaf/33149912470406211798</idno>
              <figure>
                <graphic url="https://api.parliament.uk/photo/Paa3j0vS.jpg?crop=CU_1:1"/>
              </figure>
            </person>
          </egXML>
          </p>
	  
          <p>Finally, the name of the person, their place of birth etc. can als be written in several languages or scripts,
	  as illustrated below:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-person-langs">
	    <person xml:id="PlevnelievRosen">
	      <persName xml:lang="bg">
		<forename>Росен</forename>
		<surname>Асенов</surname>
		<surname>Плевнелиев</surname>
	      </persName>
	      <persName xml:lang="en">
		<forename>Rosen</forename>
		<surname>Asenov</surname>
		<surname>Plevneliev</surname>
	      </persName>
	      <sex value="M"/>
	      <birth when="1964-05-14">
		<placeName>Гоце Делчев</placeName>
		<placeName xml:lang="bg-Latn">Gotse Delchev</placeName>
	      </birth>
	      <education>Висшия машинно-електротехнически институт в София, със специалност „изчислителна техника“</education>
	      <education xml:lang="bg-Latn">Visshiya mashinno-elektrotehnicheski institut v Sofiya, sas spetsialnost „izchislitelna tehnika“</education>
	      <occupation>политик</occupation>
	      <occupation xml:lang="bg-Latn">politik</occupation>
	      <affiliation from="2012-01-22" ref="#republic.Bulgaria" role="member" to="2017-01-21"/>
	      <affiliation from="2012-01-22" ref="#republic.Bulgaria" role="head" to="2017-01-21">
		<roleName xml:lang="bg">4-и президент на Република България</roleName>
		<roleName xml:lang="bg-Latn">4-i prezident na Republika Balgariya</roleName>
	      </affiliation>
	      <idno type="URI" subtype="wikimedia">https://bg.wikipedia.org/wiki/Росен_Плевнелиев</idno>
	      <idno type="URI" subtype="wikimedia">https://en.wikipedia.org/wiki/Rosen_Plevneliev</idno>
	    </person>
          </egXML>
	  </p>
	  
          <div xml:id="sec-affiliation">
            <head>Speaker affiliations</head>

            <p>And important element of a person is <gi>affiliation</gi>, which associates
            the speakers with organisations, i.e. it specifies who is a member of the
            government, parliament, a parliamentary group of political parties<note>Note
            that parliaments also have unaffiliated (or independent) MPs,
            that can either belong to a special <q>unaffiliated</q> parliamentary group
            or don't belong to any parliamentary group. For the former, they are simply
            not affiliated to any parliamentary group. For the latter, an
            <q>unaffiliated</q> parlimentaryGroup organisation must be created, and such
            MPs are affiliated with it as members.</note>
            
            or political parties themselves, as well as who holds a relevant office,
            e.g. that they are the president, chairman, minister etc. in or of a given
            organisation.

            The following example shows the use of the <gi>affiliation</gi> element for
            specifying membership in organisations:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-affiliation">
              <person xml:id="BahŽibertAnja">
                <persName>
                  <forename>Anja</forename>
                  <surname>Žibert</surname>
                </persName>
                <sex value="F"/>
                <affiliation role="member"
                             ref="#parliamentSI"
                             from="2014-08-01"
                             to="2018-06-21"
                             ana="#DZ.7"/>
                <affiliation role="member"
                             ref="#parliamentSI"
                             from="2018-06-22"
                             ana="#DZ.8"/>
              </person>
            </egXML>
            
            The formal type of affiliation is given in the <att>role</att> attribute, in this
            case <val>member</val>. The <att>ref</att> attribute points to the ID of the
            organisation (cf. the Section on <ref target="#sec-orgs">Organisations</ref>) in
            which the person has the specified role. For MPs, as is the case above, this will
            be the ID of the parliament (cf. the Section on <ref target="#sec-parliament">The
            parliament organisations</ref>).</p>

            <p>The example above also shows the use of the (optional) classification attribute
            <att>ana</att>, which points to the specification of the legislative period
            in which the person was affiliated with the specified organisation. Such
            legislative periods are typically given as <gi>event</gi> elements inside the
            government or parliament organisations, as futher explained in the Sections on
            the <ref target="#sec-government">government</ref> and <ref
            target="#sec-parliament">parliament</ref> organisations.</p>
            
            <p>The affiliation element can also have the usual <att>from</att> and
            <att>to</att> attributes, i.e. from and to when the person was affiliated
            with the organisation.</p>
            
            <p>The <att>role</att> attribute can have as its value one of the values
            given by the ParlaMint schema. For backward compatibility with ParlaMint I
            corpora, there are some roles that are only used by one corpus (cf. the
            definition of <att>role</att> for <gi>affiliation</gi>), but the main ones
            that should be used are:
            
            <list>
              
              <item><val>member</val> specifies that a person is a member of the
              organisation</item>
              
              <!--item><val>representative</val> specifies that a person represents an
                  organisation and it used to connect persons to parliamentary groups, i.e. to
                  <gi>org</gi> elements with <att>role</att> =
                  <val>parliamentaryGroup</val></item-->
              
              <item><val>head</val> is used for the lead person in an organisation, regardless
              of how this role is named in the specific country and for the specific
              organisation, i.e. it can be used for the queen (head of the <val>country</val>
              organisation), president (head of the <val>republic</val> organisation), prime
              minister (head of <val>goverment</val> organisation), minister (head of <val>ministry</val> organisation),
              chairperson (head of <val>committee</val> organisation), etc.</item>
              
              <item><val>deputyHead</val> is the deputy head of the organisation, again,
              regardless of how this role is named in the specific country and for the specific
              organisation, i.e. it is used for the vice president, deputy chairperson,
              etc.</item>

              <item><val>minister</val> is used to indicate that the person is the minister
              in the government. Note that this only says that they are a minister, and
              when. To encode what they are a minister of, the ministry organisation needs
              to be defined, and the minister associated with it with the <val>head</val>
              role.</item>
            </list>
            </p>

            <p>Because the values of the <att>role</att> attribute are quite generic, ParlaMint also allows to specify the
            exact name of the affiliation role for a particular country or region using the
            <gi>roleName</gi> element in the content of <gi>affiliation</gi>, preferably both in the local language, as well as in
            English, as illustrated in the following example, which specifies that somebody
            is a minister, and also what they are a minister of:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-roleName">
              <affiliation role="minister" ref="#GOV" from="2020-08-01">
                <roleName xml:lang="sl">Minister za obrambo</roleName>
                <roleName xml:lang="en">Minister of Defence</roleName>
              </affiliation>
              <affiliation role="head" ref="#MinistryOfDefence" from="2020-08-01">
                <roleName xml:lang="sl">Minister za obrambo</roleName>
                <roleName xml:lang="en">Minister of Defence</roleName>
              </affiliation>
            </egXML>
            </p>
            
            <p>Finally, it is also possible to specify the name of the organisation that the person is affilated with 
            inside the <gi>affiliation</gi> element, using the <gi>orgName</gi> element<note>The ideal situation
            is that the organisation somebody is affiliated with is specificed as a organisation, using the <gi>org</gi>
            element (cf. the Section on <ref target="#sec-orgs">Organisations</ref>) but if this is not the case, using
            <gi>orgName</gi> directly in the <gi>affiliation</gi> is an alternative encoding.</note> again, preferably
            both in the local language, as well as in English, as illustrated in the following example, which specifies
            that somebody is a minister of a certain ministry:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-affiliation-orgName">
              <affiliation role="minister" ref="#GOV" from="2020-08-01">
                <orgName xml:lang="sl">Ministrstvo za obrambo</orgName>
                <orgName xml:lang="en">Ministry of Defence</orgName>
              </affiliation>
            </egXML>
            </p>
            
            <p>It should be noted that ParlaMint makes no assumptions on the
            connection between various roles, e.g. we do not assume that if somebody
            has a minister role in the government that they are also a member of the
            government. Therefore it is necessary to specify all the desired affiliations
            with their particular roles, e.g. both as <val>minister</val> and as
            <val>member</val>.</p>

            <p>It is important to give correct roles to the affiliations that associate a
            person with organisations. We list the most common roles and how they should be
            encoded, emphasising the ones that are obligatory in ParlaMint:
            
            <list>
              
              <item>King or queen:
              <code>affiliation/@role="head"</code> →
              <code>org/@role="country"</code>
              </item>
              
              <item>President (as opposed to Prime minister):
              <code>affiliation/@role="head"</code> →
              <code>org/@role="republic"</code>
              </item>
              
              <item><emph>Prime minister</emph>
              (or other head of government):
              <code>affiliation/@role="head"</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="government"</code>
              (cf. Section on <ref target="#sec-government">The government organisation</ref>)
              </item>
              
              <item>Deputy primer minister:
              <code>affiliation/@role="deputyHead"</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="government"</code>
              </item>
              
              <item><emph>Minister</emph>:
              <code>affiliation/@role="minister"</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="government"</code>
              <lb/>
              If ministries are defined, then also:
              <code>affiliation/@role="head</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="ministry"</code>
              </item>
              
              <item>Deputy minister:
              <code>affiliation/@role="deputyMinister"</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="government"</code>
              <lb/>
              If ministries are defined, then also:
              <code>affiliation/@role="deputyHead</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="ministry"</code>
              </item>
              
              <item>Leader of parliamentary group:
              <code>affiliation/@role="head"</code> &amp; 
              <code>affiliation/@role="member"</code> →
              <code>org/@role="parliamentaryGroup"</code>
              (cf. Section on <ref target="#sec-parties">Political parties and parliamentary groups</ref>)
              </item>
              
              <item><emph>Member of parliamentary group</emph>:
              <code>affiliation/@role="member"</code> →
              <code>org/@role="parliamentaryGroup"</code>
              </item>
              
              <item>Leader of political party:
              <code>affiliation/@role="head"</code> &amp;
              <code>affiliation/@role="member"</code> →
              <code>org/@role="politicalParty"</code>
              </item>
              
              <item>Member of political party:
              <code>affiliation/@role="member"</code> →
              <code>org/@role="politicalParty"</code>
              </item>
              
              <item><emph>MP</emph>:
              <code>affiliation/@role="member"</code> →
              <code>org/@role="parliament"</code>
              (cf. Section on <ref target="#sec-parliament">The parliament organisations</ref>)
              </item>
              
            </list>
            </p>

          </div>
        </div>

        <div xml:id="sec-orgs">
          <head>Organisations</head>
          
          <p>Information on the the government, parliament, political parties or
          parliamentary groups of political parties, as well as other optional
          parliamentary structures (e.g. ministries, committees) is given in the corpus
          root <gi>listOrg</gi> element of the <gi>particDesc</gi> (cf. the Section on
          <ref target="#sec-particDesc">Participant description</ref>).  The
          <gi>listOrg</gi> element then contains a series of <gi>org</gi> elements, each
          giving information about one organisation. The organisations are followed by
          the <gi>listRelations</gi> element giving a list of relations between
          organisations:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-listOrg">
            <listOrg>
              <org>...</org>
              <org>...</org>
              <org>...</org>
              ...
              <listRelation>...</listRelation>
            </listOrg>
          </egXML>
          </p>

          <p>We exemplify the structure of one organisation element by giving the basic
          information about a parliamentary group of political parties:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-parliamentaryGroup">
            <org role="parliamentaryGroup" xml:id="group.LN-Aut">
              <orgName full="yes" xml:lang="it">Lega Nord e Autonomie</orgName>
              <orgName full="abb">LN-Aut</orgName>
              <event from="2013-03-15">
                <label xml:lang="en">existence</label>
              </event>
            </org>
          </egXML>
          
          First, each organisation must have an <att>xml:id</att> attribute, so that
          other elements (in particular, <gi>person</gi>s) can refer to it.
          The fact that the organisation is a
          parliamentary group of political parties is encoded in the <att>role</att>
          attribute, and we elaborate on this below.  The name of the organisation
          group is given in the <gi>orgName</gi> element, which also uses the
          <att>full</att> attribute to distinguish between the <val>full</val> name of
          the party and the <val>abb</val>reviated name of the party
          group.</p>
          
          <p>Organisations are also created and dissolved, and this information is encoded in the
          <gi>event</gi> element, which has as its <gi>label</gi> <val>existence</val>,
          and where the start of its existence is given in the <att>from</att> attribute;
          as there is no <att>to</att> attribute, this also means that the party still
          exists.</p>

          <p>The example above gives the minimal required information about an
          organisation but ParlaMint also allows further data to be added, as exemplified by
          the following example:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-parties">
            <org xml:id="party.PS" role="parliamentaryGroup">
              <orgName full="yes" xml:lang="sl">Pozitivna Slovenija</orgName>
              <orgName full="yes" xml:lang="en">Positive Slovenia</orgName>
              <orgName full="abb">PS</orgName>
              <event from="2011-10-22">
                <label xml:lang="en">existence</label>
              </event>
              <idno type="URI" subtype="wikimedia" xml:lang="sl">https://sl.wikipedia.org/wiki/Pozitivna_Slovenija</idno>
              <idno type="URI" subtype="wikimedia" xml:lang="en">https://en.wikipedia.org/wiki/Positive_Slovenia</idno>
            </org>
          </egXML>
          
          Here the full name of the organisation is also given in English, and there two
          external links giving URLs to further information about the organisation. In the
          example above, these are the Slovene and English Wikipedia pages, as indicated
          by the appropriate values of the <att>xml:lang</att> attribtes on the
          <gi>idno</gi> element.</p>

          <p>Returning to the <att>role</att> attribute, it is the ParlaMint schema (cf. the
          Section on <ref target="#sec-validate">Validating ParlaMint corpora</ref>) that gives
          its set of allowed values. Currently, the list is quite long, as we left it up to the
          partners of ParlaMint I to determine the values, however, there are some that are common
          to all corpora, and, for some, it is obligatory to have organisations with these roles
          in the organisation list of a corpus. Furthermore, it is recommended that the
          organisations are listed in the order as given below, with the obligatory roles
          emphasised:

          <list rend="numbered">
            <item><val>country</val>: the country taken as an organisation, which can be
            used to specify the king or queen of the country as a <gi>person</gi> affiliated
            with the country organisation with the <att>role</att> <val>head</val>;</item>

            <item><val>republic</val>: the republic, which can be used to specify the
            president of the country as a <gi>person</gi> affiliated with the republic
            organisation with the <att>role</att> <val>head</val>;</item>

            <item><hi rend="bold"><val>government</val></hi>: the government of the country or
            region;</item>

            <item><val>ministry</val>: a particular ministry, which can be used to specify
            the minister as a <gi>person</gi> affiliated with the ministry 
            organisation with the <att>role</att> <val>head</val>;</item>

            <item><hi rend="bold"><val>parliament</val></hi>: the parliament, upper or lower house
            of the country or region;</item>

            <item><hi rend="bold"><val>parliamentaryGroup</val></hi>: a grouping of political
            parties for the purpose of acting as one in the government;</item>

            <item><val>politicalParty</val>: a political party.</item>

          </list>

          We discuss the obligatory types of organisations and political parties in the following
          sections.
          </p>
          
          <div xml:id="sec-government">
            <head>The government organisation</head>
            
            <p>The government organisation, distinguished by the <att>role</att>
            <val>government</val>, is required to be present in the <gi>listOrg</gi> of every
            corpus root and is, by convention, the first organisation in the list. It gives
            the ID and name of the country or region government, and also contains the list
            of specific governments by using the dedicated list element for events,
            <gi>listEvent</gi>, which then gives the these governments as a series of
            <gi>event</gi> elements:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-government">
              <org xml:id="ParlaMint-SI-GOV" role="government">
                <orgName xml:lang="sl" full="yes">Vlada Republike Slovenije</orgName>
                <orgName xml:lang="en" full="yes">Government of the Republic of Slovenia</orgName>
                <event from="1990-05-16">
                  <label xml:lang="en">existence</label>
                </event>
                <idno type="URI" subtype="wikimedia" xml:lang="sl">https://sl.wikipedia.org/wiki/Vlada_Republike_Slovenije</idno>
                <idno type="URI" subtype="wikimedia" xml:lang="en">https://en.wikipedia.org/wiki/Government_of_Slovenia</idno>
                <listEvent>
                  <event xml:id="GOV.11" from="2013-03-20" to="2014-09-18">
                    <label xml:lang="sl">11. vlada Republike Slovenije (20. marec 2013 - 18. september 2014)</label>
                    <label xml:lang="en">11th Government of the Republic of Slovenia (20 March 2013 - 18 September 2014)</label>
                  </event>
                  ...
                  <event xml:id="GOV.14" from="2018-03-13">
                    <label xml:lang="sl">14. vlada Republike Slovenije (13. marec 2020 - danes)</label>
                    <label xml:lang="en">14th Government of the Republic of Slovenia (March 13, 2020 - today)</label>
                  </event>
                </listEvent>
              </org>
            </egXML>
            </p>
          </div>
          
          <div xml:id="sec-parliament">
            <head>The parliament organisations</head>

            <p>The parliament organisations, distinguished by the <att>role</att> <val>parliament</val>,
            are also required to be present in the <gi>listOrg</gi> of every corpus root and
            are, by convention, the second and third organisation in the list. There will be
            two parliament organisations for bicameral chambers, at least in cases where the
            corpus contains the transcripts of both the upper and the lower house. For
            unicameral ones, or if the transcripts contains only the lower house, there will
            be only one parliament organisation. Which of the three options the organisation
            encoded is determined by the value of the <att>ana</att> attribute, which refers
            to the appropriate ID of the category in the legislature taxonomy (cf. the
            Section on <ref target="#sec-classDecl">Class declaration</ref>). The
            <att>ana</att> attribute should also specify whether the parliament is national
            or regional, with the categories also specified in the legislature taxonomy. In
            short, the theoretically possible values of <att>ana</att> for the parliament
            organisations are:
            <list>
              <item><code>#parla.national #parla.uni</code>: national parliament, unicameral system</item>
              <item><code>#parla.national #parla.lower</code>: national parliament, lower house</item>
              <item><code>#parla.national #parla.upper</code>: national parliament, upper house</item>
              <item><code>#parla.regional #parla.uni</code>: regional parliament, unicameral system</item>
              <item><code>#parla.regional #parla.lower</code>: regional parliament, lower house</item>
              <item><code>#parla.regional #parla.upper</code>: regional parliament, upper house</item>
            </list>

            Otherwise, the structure of the parliament organisations is identical to the one
            for the government, i.e. it gives the ID and name of the parliament and encodes
            the successive parliaments, using the <gi>event</gi> element:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-parliament">
              <org ana="#parla.national #parla.lower"
                   role="parliament"
                   xml:id="be_federal_parliament">
                <orgName full="yes" xml:lang="nl">Federaal Parlement van België</orgName>
                <orgName full="yes" xml:lang="en">Belgian Federal Parliament</orgName>
                <event from="1831-02-07">
                  <label xml:lang="en">existence</label>
                </event>
                <listEvent>
                  <head xml:lang="nl">Zittingsperiode</head>
                  <head xml:lang="en">Legislative period</head>
                  <event to="2007-05-02" from="2003-06-05" xml:id="period_51">
                    <label xml:lang="nl">Zittingsperiode 51</label>
                    <label xml:lang="en">Legislative period 51</label>
                  </event>
                  ...
                  <event from="2019-06-20" xml:id="period_55">
                    <label xml:lang="nl">Zittingsperiode 55</label>
                    <label xml:lang="en">Legislative period 55</label>
                  </event>
                </listEvent>
              </org>
            </egXML>
            </p>
          </div>
          
          <!--div xml:id="sec-republic">
              <head>The "Republic" organisation</head>
              <p>ParlaMint also distinguished an organisation that has the <att>role</att>
              equal to <val>republic</val>. This organisation is primarily used to give the affiliation
              of the president of the republic to ...
              </p>
              </div-->
          
          <div xml:id="sec-parties">
            <head>Political parties and parliamentary groups</head>

            <p>In the scope of ParlaMint, very important organisations are political parties
            (distinguished by the <att>role</att> <val>politicalParty</val>) and, even more so,
            parliamentary groups that represent political parties in the parliament
            (distinguished by the <att>role</att> <val>parliamentryGroup</val>). These
            organisations are linked to <gi>person</gi> elements (i.e. speakers) so that is
            known to which political party or parliamentary group the speaker belongs to or
            represents in a certain moment of time, as further explained in the <ref
            target="#sec-affiliation">Section on Speaker affiliations</ref>.</p>

            <p>ParlaMint requires that a corpus must use parliamentary groups, while
            the use of political parties is optional. Note that if political parties are used,
            it is also expected to encode which political parties constitute a parliamentary group;
            this is encoded via the <gi>relation</gi> element, as further explained in the
            Section on <ref target="#sec-relation">Relations between organisations</ref>.</p>

            <p>The introduction to this chapter already gave examples of how organisations are
            encoded in general, so we here only give examples of the encoding of the
            additional metadata that can also be associated with political parties or parliamentary groups,
            i.e. their political orientation on the left-to-right scale and the variables of the 
            <ref target="https://www.chesdata.eu/ches-europe">Chapel Hill Expert Surveys for Europe</ref>,
            CHES for short. This additional metadata is encoded in the <gi>state</gi> element(s), which
            should be the last element(s) in the <gi>org</gi>.</p>

            <div xml:id="sec-politicalOrientation">
              <head>Encoding political orientation</head>
              <p>Political orientation is encoded with the <gi>state</gi> element with the value of its
              <att>type</att> attribute equal to <val>politicalOrientation</val>. The nested <gi>state</gi>
              elements then give the type of the information, which can be either the (corpus) encoder or
              Wikipedia, its <att>source</att> (either a pointer to the ID of the person for encoder, or to
              the Wikipedia URL), and the reference to the category definition (defined in the
              <val>politicalOrientation</val> taxonomy) via the <att>ana</att> attribute, as is illustrated
              in the example below:
              
              <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-orientation">
                <org role="parliamentaryGroup" xml:id="MR">
                  <orgName full="abb">MR</orgName>
                  <orgName full="yes">Mouvement Réformateur</orgName>
                  <idno type="URI" subtype="wikimedia">https://en.wikipedia.org/wiki/Reformist_Movement</idno>
                  <state type="politicalOrientation">
                    <state type="encoder" source="#GrietDepoorter" ana="#orientation.CRR">
                      <note xml:lang="en">Orientation determined by encoder, using own knowledge of the parliamentary group.</note>
                    </state>
                    <state type="Wikipedia" source="https://en.wikipedia.org/wiki/Reformist_Movement" ana="#orientation.CR">
                      <note xml:lang="en">From 1992 the Reformist Movement (MR) consisted of: FDF, MCC, PRL and PFF. In September
                      2001, FDF decides to leave the alliance and chooses a new name, becoming DeFI.</note>
                    </state>
                  </state>
                </org>
              </egXML>
              Note also that a <gi>state</gi> may have a note that gives furuther free-text information about the orientation.</p> 
            </div>
            
            <div xml:id="sec-CHES">
              <head>Encoding CHES variables</head>
              <p>The second type of metadata on organisations, in particular on political parties and parliamentary groups comes
              from the <ref target="https://www.chesdata.eu/ches-europe">Chapel
              Hill Expert Surveys for Europe</ref> (CHES), either from the 1999-2019 edition, of from the 2019 edition. Here the
              top-level <gi>state</gi> element gives the <att>type</att> of the state, i.e. <val>CHES</val> and the URL of
              the CSV source for the information, as well as the name of the political party in CHES (which typically differs from
              its name or ID in ParlaMint) in the <att>key</att> attributes and the year span that the CHES information covers
              in <att>from</att> and <att>to</att> attributes.</p>

              <p>Each subordinate <gi>state</gi> (of <att>type</att> <val>variable</val>) then encodes one CHES variable, which is
	      given, via the <att>ana</att> attribute, as the reference to the appropriate category defined in the <val>CHES</val>
	      taxonomy (cf. the Section on <ref target="#sec-classDecl">Class declaration and taxonomies</ref>).

              Finally, as CHES gives the values of its variables according to years,
              the third level of <gi>state</gi> (of <att>type</att> <val>value</val>) stores the
	      periods of the variable together with its numeric value in the <att>n</att> attribute, as illustrated in the
	      example below:
              
              <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ches">
                <state type="CHES" key="MR" from="2002" to="2018" source="https://www.chesdata.eu/s/1999-2019_CHES_dataset_meansv3.csv">
                  <state type="variable" ana="#ches-lrgen">
                    <state type="value" from="2002" to="2005" n="6.35"/>
                    <state type="value" from="2006" to="2009" n="6.67"/>
                    <state type="value" from="2010" to="2013" n="7.0"/>
                    <state type="value" from="2014" to="2018" n="7.0"/>
                  </state>
                  <state type="variable" ana="#ches-lrecon">
                    <state type="value" from="2002" to="2005" n="7.3"/>
                    <state type="value" from="2006" to="2009" n="7.5"/>
                    <state type="value" from="2010" to="2013" n="7.62"/>
                    <state type="value" from="2014" to="2018" n="7.60"/>
                  </state>
                  ...
                </state>
              </egXML>
              </p>
            </div>
            
            <div xml:id="sec-relation">
              <head>Relations between organisations</head>
              
              <p>As mentioned, the relations between various organisations, in particular,
              which parliamentary groups of political parties are in the <hi>coalition</hi> or in
              <hi>opposition</hi> and when, are encoded in the final element of <gi>listOrg</gi>,
              namely <gi>listRelation</gi>, which then contains <gi>relation</gi> elements, as
              shown in the example below:
              
              <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-relations">
                <listRelation>
                  <relation name="coalition"
                            mutual="#parliamentaryGroup.MR #parliamentaryGroup.OpenVld"
                            from="2014-10-11"
                            to="2018-12-09"
                            ana="#period_54"/>
                  <relation name="opposition"
                            active="#parliamentaryGroup.Ecolo #parliamentaryGroup.cdH"
                            passive="#government.BE"
                            from="2014-10-11"
                            to="2018-12-09"
                          ana="#period_54"/>
                <relation name="representing" active="#parliamentaryGroup.MR"
                          passive="#politicalParty.CSSD.153 #politicalParty.ENO.1"
                          from="2013-10-29" to="2017-10-26"/>
              </listRelation>
              </egXML>

              The type of relation is given in the <att>name</att> attribute.
              ParlaMint allows the following values of <att>name</att>:
              
              <list>
               <item><val>coalition</val>: the pointers to the organisations
               (i.e. parliamentary groups or political parties) are given in the <att>mutual</att>
               attribute (because a coalition is mutual relation betweent its members);</item>
               
               <item><val>opposition</val>: the pointers to the organisations are given in
               the <att>active</att> attribute, as the organisations are in an active relation to the
               government, the pointer to which is given in the <att>passive</att> attribute;</item>
               
               <item><val>representing</val>: a parliamentary group representing one or more political
               parties in the parliament. The parliamentary group is given as the value of the
               <att>active</att> attribute, while the political parties are given as the value of
               the <att>passive</att> attribute.</item>
               
               <item><val>renaming</val>: the two organisation (typically political parties)
               referred to are essentially the same organistion, which has been, however, renamed at
               some point in time; the reference to the old organisation is given in the
               <att>passive</att> attribute, while the reference to the new one is given in the
               <att>active</att> attribute;</item>
               
               <item><val>successor</val>: an organisation (again, typically a political party), or
               several of them, ceased to exist, but a sucessor was created; as with renaming, the
               previous organisation is given as the value of the <att>passive</att> attribute,
               while the new one uses the <att>active</att> attribute.</item>
               
              </list>
              
              For the relations it is typically also necessary to specify <att>from</att>
              and possibly <att>to</att> when the relation was in force. Finally, it is
              possible, but not necessary, to also give the legislative period and/or the
              government when this particular coalition or opposition existed.
              </p>
            </div>
          </div>
        </div>
      </div>

      <div xml:id="chp-transcript">
        <head>Transcriptions</head>
        
        <p>The transcriptions are encoded in the <gi>text</gi> element of corpus components.  This
        element contains only the element <gi>body</gi>, which should then contain at least
        one division, <gi>div</gi>, as illustrated below:
        
        <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-div">
          <text ana="#reference">
            <body>
              <div type="commentSection">...</div>
              <div type="debateSection">...</div>
              <div type="debateSection">...</div>
              ...
            </body>
          </text>
        </egXML>
        
        As shown, the <gi>text</gi> element should be (as is the top level <gi>TEI</gi> element,
        as discussed in the Section on <ref target="#sec-roottags">Attributes of top-level
        elements</ref>) marked with the <att>ana</att> attribute as to which subcorpus the text
        belongs to, with the subcorpora themselves defined in the appropriate taxonomy (cf. the
        Section on <ref target="#sec-classDecl">Class declaration</ref>).</p>

        <div xml:id="chp-div">
          <head>Divisions</head>
          
          <p>A text body contains a series of divisions, <gi>div</gi> in cases when the source
          document can be reliably split into sections, which is typically done on the basis of
          headings identified in the source. When this is not possible, the complete body will be
          just one division.</p>

          <p>In ParlaMint we have two types of divisions, which are distinguished by the value of
          their (required) <att>type</att> attribute. If its value is <val>debateSection</val>,
          then the divisions must contain at least one speech, while the value
          <val>commentSection</val> must not contain any speeches, i.e. it contains transcriber
          (or other) comments only, e.g. the table of contents, references to laws etc.</p>
          
          <p>Inside a <val>debateSection</val>-type division, the main elements of interest are
          speeches, encoded as the utterance element, <gi>u</gi>. However, this type of division
          can (and the <val>commentSection</val>-type must) also contain headings and notes by
          the transcribers that serve to structure and comment the speeches, as well as page
          breaks, as illustrated below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-div-detail">
            <text ana="#reference">
              <body>
                <div type="debateSection">
                  <pb n="1"/>
                  <head>Child Poverty Unit</head>
                  <note>Question</note>
                  <note>Asked by</note>
                  <u>...</u>
                  ...
                  <pb n="2"/>
                  ...
                </div>
                <div type="debateSection">
                  <head>Trade Union Act 2016 (Political Funds)</head>
                  <note>Motion to Approve</note>
                  <note>Moved by</note>
                  <u>...</u>
                  ...
                </div>
                ...
              </body>
            </text>
          </egXML>
          
          It should be noted that the <gi>head</gi> element can appear only at the start of a
          division while notes can be interspersed with speeches anywhere inside a division, and
          can also appear inside speeches, i.e. inside <gi>u</gi> elements. It is also possible
          to specify the type of note, and, in fact, to use more precise elements than just
          <gi>note</gi>, which is further explained in the Section on
          <ref target="#sec-comments">Transcriber comments</ref>.</p>

          <p>Page breaks, <gi>pb</gi> are an optional element, and can also appear both
          inside divisions and inside speeches or segments (as well as inside the <gi>note</gi>
          element, and, for the lingustically analysed version, inside sentences).
          They are used to preserve the page breaks from the digital source,
          possibly together with the page number, as the value of the <att>n</att>
          attribute. They can also point to the source of a particular page in the corpus
          via the <att>source</att> attribute and to their media file via the
          <att>corresp</att> attribute (cf. the Section on <ref
          target="#sec-sourceDesc">Source description</ref> and esp. the <ref
          target="#exa-sourceDescComp">Example on the source description of a component
          file</ref>), as illustrated in the example below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-pb">
            <div type="debateSection">
              <pb n="1" source="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001001.htm"
                  corresp="#ps2013-001-01-000-000.audio1"/>
              <note type="speaker">Předsedající Miroslava Němcová</note>
              ...
            </div>
          </egXML>
          </p>
        </div>
        
        <div xml:id="sec-uterrance">
          <head>Utterances</head>
          <p>A speech is marked up using the <gi>u</gi> (utterance) element, as illustrated below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-speech">
            <u who="#DavidPrior" ana="#regular">
              <seg>I ask that the draft Regulations laid before the House on 5 December be approved.</seg>
              <seg>The relevant document is the 20th Report from the Legislation Committee.</seg>
            </u>
          </egXML>
          
          The most important attribute of an utterance is <att>who</att>, which gives the
          pointer to the <gi>person</gi> element containing the metadata of the speaker,
          which is discussed in the Section on <ref
          target="#sec-speakers">Speakers</ref>. Despite its importance in allowing analyses
          of speeches by speaker and their metadata, it does happen that the speaker of
          every speech cannot be determined; for such cases, the <att>who</att> attribute
          should be omitted.

          <!-- Not sure we should actually have this, because, what if the person is a #guest, 
               and we don't know who they are? Are they then marked as "#guest #unknown"?
               
          By convention the fact that the speaker is unknown is
          additionally marked by the value of <val>#unknown</val> in the <att>ana</att>
          attribute, which points to the appropriate speaker type taxonomy. </p>

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-speech">
            <u ana="#unknown">
              <seg>I will present to you an issue that I consider very important.</seg>
              ...
            </u>
          </egXML>
          -->
          </p>
          
          <p>The <gi>u</gi> element should also have the <att>ana</att> attribute giving a
          pointer to the typology of types of speakers, which is especially important to
          enable the distinction between the speeches of a session chair (who mostly speak
          on procedural matters) from regular, and, possibly, guest speakers. Note that we
          used the <val>#regular</val> values not only for MPs but for all other speakers
          that can regularly speak in a parliament, e.g. ministers, the MP, members of
          parlimentary commissions etc. There is also a special type of speaker, called
          <val>#interrupting</val>, which we discuss further in the Section on <ref
          target="#sec-interruptions">Interrupted utterances</ref>.</p>
          
          <p>The utterances are then segmented using the <gi>seg</gi> element, which encodes the
          paragraphs of the source transcription. Even if the source files do not contain
          paragraph markings, each speech should contain at least one segment.</p>

          <p>Finally, an utterance (just as a division) can also contain transcriber comments
          (notes), as further detailed in the next section.</p>
        </div>
        
        <div xml:id="sec-comments">
          <head>Transcriber comments</head>
          
          <p>Transcriber comments give information on who spoke, what the time was,
          interruptions and the reason for them, what is happening in the chamber, results of voting, etc.
          While section headings can also be taken as a kind of transcriber comments, these
          serve to structure the transcription and are encoded as <gi>head</gi> elements, as
          explained at the start of this chapter, cf. the <ref
          target="#exa-div">Example</ref> there. Another type of transcribe comment treated separately
          is the presence of gaps in the transcript; these are treated in the
          Section on <ref target="#sec-gaps">Gaps</ref>.</p> 

          <p xml:id="para-hierarchy-comments">Apart heads and gaps, transcriber comments
          are encoded using the
          <gi>note</gi> element or one of several so called <q>incident</q> elements, as
          explained below. These elements can be placed directly inside <gi>div</gi>,
          <gi>u</gi>, <gi>seg</gi> or even <gi>s</gi> in the linguistically annotated version.
          They should be placed  as far up the
          hierarchy as possible, i.e. if they would appear at the start or end of a segment or utterance,
          to encode them before the start, or, respectively, after the end of this segment or utterance.
          If possible, it is especially conventient not to have them inside <gi>seg</gi>
          (which contains text), as placing these elements there leads to mixed content,
          which is more difficult to process further, in particular when linguistically
          annotating the corpus. Similary, it is also better to move them outside <gi>s</gi> elements.
          However, if a transcriber comments were placed in the middle
          of the text for good reasons then they can be encoded inside the segment or sentence.
          Note, however, that 
          utterances can also be split on transcriber comments, as is explained in the Section on <ref
          target="#sec-interruptions">Interrupted utterances</ref>.</p>
            
          <div xml:id="sec-notes">
            <head>Notes</head>
            
            <p>In general, transcriber comments are encoded using <gi>note</gi>, which can be
            further qualified via its <att>type</att> attribute. We do not currently specify
            what the valid values of this attribute are. Some comments can also be encoded
            using more precise TEI elements, as further explained below. The following example
            gives typical transcriber comments:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-notes">
              <note type="speaker">The president, Dr. Milan Brglez:</note>
              ...
              <note type="vote-ayes">84 voted for the adoption of the measure.</note>
              ...
              <note type="vote-noes">2 voted against the adoption of the measure.</note>
              ...
              <note type="time">The session began at 10 o'clock.</note>
              ...
            </egXML>
            
            The first note simply gives the speaker of the utterance that would follow it, the
            second and third are notes on the voting results, while the fourth gives the time
            when the session started. Note that in this case we can also explicitly add the
            time when the sessions started, as in the following example:
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-note-time">
              <note type="time">The session began at <time when="2016-04-13T010:00:00">10 o'clock</time>.</note>
            </egXML>
            Note that, in ParlaMint, the <att>when</att> attribute of <gi>time</gi> must
            contain not only the time, but the date as well, so that users of the corpus do
            not need to infer it.
            </p>
          </div>
          
          <div xml:id="sec-incidents">
            <head>Incidents</head>
            <p>Some types of transcriber comments, which we term <term>incidents</term> can
            be encoded using more specific TEI elements. These elements can also be
            further qualified by the <att>type</att> attribute, with the values being
            determined by the ParlaMint schema. The three incident elements are:
            
            <list>
              <item><gi>vocal</gi> marks any vocalised but not necessarily lexical phenomenon,
              with the values of the (for this element) obligatory <att>type</att> attribute
              being, for example <val>interruption</val>, <val>laughter</val>,
              <val>murmuring</val> etc.</item>
              
              <item><gi>kinesic</gi> marks any communicative phenomenon, not necessarily vocalised,
              with the optional <att>type</att> values being e.g. <val>applause</val>,
              <val>laughter</val>, <val>gesture</val>.</item>
              <!-- laughter should surely be under vocal! -->
              
              <item><gi>incident</gi> marks any phenomenon or occurrence, not necessarily vocalised
              or communicative, with the optional <att>type</att> values being
              e.g. <val>break</val>, <val>sound</val>, <val>action</val>.</item>
            </list>
            
            The example below illustrates the use of these three elements:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-incidents">
              <vocal type="interruption">
                <desc>sounds from the chamber</desc>
              </vocal>
              ...
              <kinesic type="signal">
                <desc>signal for end of debate</desc>
              </kinesic>
              ...
              <incident type="action">
                <desc>minute of silence</desc>
              </incident>
            </egXML>
            
            As the example shows, the original content of the transcriber comment is retained
            in the <gi>desc</gi> element. Note that in cases when the agent of an incident is
            known, they can be specified in the optional <att>who</att> attribute, just as on
            utterances.</p>
            
            <p>While the incidents must have at least one <gi>desc</gi> element, they can also
            have several, so that the description of the incident can be, if so desired,
            translated into English, as illustrated in the following example:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-incidents-descs">
              <vocal type="interruption">
                <desc xml:lang="sl">oglašanje z dvoraner</desc>
                <desc xml:lang="en">sounds from the chamber</desc>
              </vocal>
            </egXML>
            </p>
          </div>
        </div>

        <div xml:id="sec-gaps">
          <head>Gaps</head>

          <p>The transcribers can also note that a part of the speech was not transcribed,
          typically because it was not understood, sometimes also noting the reason why, such
          as that the microphone was not turned on, that there was noise in the chamber, or
          that the speaker was speaking too quietly. These notes can be encoded as the
          <gi>gap</gi> element, which is then also marked by
          <att>reason</att>=<val>inaudible</val>.  The original transcriber comment is left
          in the <gi>desc</gi> element, as illustrated below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-gap">
            ... I would further state that
            <gap reason="inaudible">
              <desc>speaker spoke too quietly, not understood</desc>
            </gap>
            and furthermore ...
          </egXML>
          </p>

          <p>Another reason for omitting a part of the transcription can be an editorial
          decision of the corpus compilers. The transcript can, for example, contain material
          that they do not want to include in the corpus, such as tables, or parts of the
          transcription that for technical reasons cannot be converted to text. In these
          cases, the reason given should be <val>editorial</val>, while the <gi>desc</gi>
          should contain what has been omitted, as illustrated below.
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-gap-editorial">
            <gap reason="editorial">
              <desc xml:lang="en">Table omitted</desc>
            </gap>
          </egXML>
          </p>
          
          <p>Sometimes a passage of the transcription is in a foregin language, and, esp. as
          the corpus is to be linguistically annotated, the passage is best left out of the
          transcription proper. This can be achieved by encoding it as a gap in the
          transcription with the reason <val>foreign</val>, while the <gi>desc</gi> should
          contain the omitted text. In this case therefore the description does not give the
          reason for the ommission, but rather the text that has been ommited.

          The language of the foreign passage should be indicated on the <att>xml:lang</att>
          attribute of <gi>desc</gi>. If the language has not been identified, the ISO 639 code
          for undetermined language <q>und</q> can be used, while in cases where more than one
          language is used in such a passage, the <q>mul</q> code for multiple languages is
          used.  All languages used on <gi>desc</gi> should of course be documented in the
          <gi>langUsage</gi> element. Below an example:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-gap-foreign">
            <gap reason="foreign">
              <desc xml:lang="und">Huliniahuanngittunga</desc>
            </gap>
          </egXML>
          </p>
          
          <p>As with <ref target="#sec-incidents">incidents</ref>, gaps can also contain
          several descriptions so that they can be, if so desired, translated:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-gap-descs">
            <gap reason="editorial">
              <desc xml:lang="de">Zitierte Druckfassung entfernt</desc>
              <desc xml:lang="en">Quoted printed matter omited</desc>
            </gap>
          </egXML>
          </p>
        </div>
        
        <div xml:id="sec-interruptions">
          <head>Interrupted utterances</head>
          
          <p>A special case occurs when a transcription note states that somebody interrupted the
          speaker and gives the transcript of the interruption, possibly with who interrupted,
          with the main speaker then continuing with their speech, as in the following made up
          snippet:
          
          <eg xml:id="exa-split-vocal-text">Boris Johnson: I propose a no-deal
          Brexit. /Jeremy Corbyn: Traitor!/ Because England does not want any dealings with
          the European Union.</eg>
          
          The standard manner in which such interruptions are encoded is using the default
          <gi>note</gi> element, or, much better, the <gi>vocal</gi> element, as explained in the
          Section on <ref target="#sec-incidents">Incidents</ref>, as below:

          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-split-vocal">
            <u who="#BorisJohnson" ana="#regular">
              <seg>I propose a no-deal Brexit. <vocal type="interruption"><desc>Jeremy Corbyn:
              Traitor!</desc></vocal> Because England does not want any dealings with the European
              Union.</seg>
            </u>
          </egXML>

          This solution is relatively easy to implement and valid in ParlaMint, however, it
          has the disadvantage of leaving what is essentially a speech as the content of a
          comment. In cases where it is possible to consistently identify such <q>mini
          speeches</q>, an alternative and more useful encoding will turn this comment into
          a separate speech, and split the main utterance into two (or more) pieces. The
          example below illustrates how this is encoded:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-splitu">
            <u who="#BorisJohnson" ana="#regular" xml:id="GB001.8.3" next="#GB001.8.5">I propose a no-deal Brexit.</u>
            <u who="#JeremyCorbyn" ana="#regular #interrupting" xml:id="GB001.8.4">Traitor!</u>
            <u who="#BorisJohnson" ana="#regular" xml:id="GB001.8.5" prev="#GB001.8.3">Because England does not want any dealings with the European Union.</u>
          </egXML>
          
          As can be seen, the split is indicated by the use of the <att>next</att> attribute
          on the first part of the split utterance and by the <att>prev</att> attribute of
          the next part of the split utterance, while the fact that an utterance interrupts
          another one is signaled by the addition of <val>#interrupting</val> to the
          <att>ana</att> attribute.  The values of the <att>next</att> and <att>prev</att>
          attributes are pointers to the next of previous identifiers of the appropriate
          part of the split utterance.<note>Note that, in general, the utterance can also be
          split in the middle of a sentence, which brings with it problems for automatic
          linguistic processing, as, ideally, the parts should be first joined, and only
          then processed.</note>.</p>

          <p>In the example the speaker of the interrupting speech has also been identified
          and marked in the <att>who</att> attribute; in cases where this is not possible,
          this attribute can be omitted.  As mentioned this speaker should also have the
          value <val>#interrupting</val> in their <att>ana</att> attribute; this value comes
          from the appropriate category of the ParlaMint speaker type taxonomy. In case the
          speaker is identified, and their status can be determined, <att>ana</att> should
          also contain the type proper of the speaker, i.e. whether they are
          <val>#chair</val>, <val>#regular</val> or <val>#guest</val> speaker is assumed.</p>
        </div>
      </div>
      
      <div xml:id="chp-linguistic">
        <head>Linguistic annotation</head>

        <p>This section introduces the ParlaMint linguistic annotation.
        An important note is that a linguistically annotated ParlaMint corpus is
        stored separately from its base (or <term>plain-text</term>) version, i.e. the
        version that has been discussed in the preceding sections.
        The encoding of the linguistically annotated version differs from the plain-text one in the following:
        
        <list>
          <item>All  the corpus root and components file names 
          the extension <code>.ana.xml</code>.
          For example, if the plain-text root has the file name
          <code>ParlaMint-CZ.xml</code>, the linguistically annotated one should be
          <code>ParlaMint-CZ.ana.xml</code>, or 
          <code>ParlaMint-CZ_2016-04-13.xml</code> and 
          <code>ParlaMint-CZ_2016-04-13.ana.xml</code></item>
          
          <item>Because the file ID (i.e. the value of the top level element attribute
          <att>xml:id</att>, as explained in the Section on <ref
          target="#sec-roottags">Attributes of top-level elements</ref>) should be the same
          as the file name (cf. the Section on <ref target="#sec-files">File names and
          directory structure</ref>), the previous point also means that the linguistically
          annotated files should have the top level ID suffixed with <code>.ana</code>,
          e.g. <code>&lt;teiCorpus xml:id="ParlaMint-CZ.ana"&gt;</code></item>

          <item>The corpus stamp in the main title of the corpus root or components
          (cf. the Section on <ref target="#sec-titleStmt">Title statement</ref>) which is
          <code>[ParlaMint]</code> in the plain-text version, should be
          <code>[ParlaMint.ana]</code> for the linguistically annotated version.</item>
          
          <item>All the plain text of the utterance segments (i.e. the text immediately
          contained by the <gi>seg</gi> elements) of the plain-text version should be
          linguistically annotated on the specified levels, as is further explained in the
          following Section on <ref target="#sec-ana-markup">Linguistic
          markup</ref>.</item>
          
          <item>The linguistically annotated version of the corpus should also have some
          added metadata in the TEI header of the corpus root, which is detailed in the
          following Section on <ref target="#sec-ana-metadata">Metadata for linguistic
          annotation</ref>.</item>

        </list>
        </p>
        
        <div xml:id="sec-ana-markup">
          <head>Linguistic markup</head>

          <p>Linguistic annotation is added only to the text content of <gi>seg</gi> elements
          inside the speeches, i.e <gi>u</gi> elements. For this text, ParlaMint requires the
          following additional markup to be present:
          <list>
            <item>tokens: what is a word, and what is punctuation, with preserved
            information on inter-token spaces;</item>
            
            <item>sentences: what is a sentence;</item>
            
            <item>lemmas: what is the base form of each word;</item>
            
            <item><ref target="https://universaldependencies.org/">Universal
            Dependencies</ref> (UD) part-of-speech and morphological features, and,
            optionally, part-of-speech tags from a different (local) tagset;</item>
            
            <item>named entities (NE): what is a name, categorised at least into the standard
            four NE classes;</item>
            
            <item>the UD dependency syntactic parse of the sentences;</item>
            
            <item>USAS semantic annotations on words and phrases but <hi>only</hi> for the
            machine translated corpora.</item>
          </list>
          
          Below, we explain the encoding of each of these levels.</p>
          
          <div xml:id="sec-ana-words">
            <head>Word-level annotation</head>
            
            <p>Basic linguistic annotation comprises tokenisation, sentence segmentation,
            part-of-speech tagging and lemmatisation, and this mark-up is illustrated in the
            example below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-word">
              <s>
                <w msd="UPosTag=DET|Case=Gen|Gender=Neut|Number=Sing|PronType=Dem" lemma="ta">Tega</w>
                <w msd="UPosTag=PRON|PronType=Prs|Reflex=Yes|Variant=Short" lemma="se">se</w>
                <w msd="UPosTag=PART" lemma="sploh">sploh</w>
                <w msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin" lemma="biti">nisem</w>
                <w msd="UPosTag=VERB|Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part" lemma="zavesti" join="right">zavedel</w>
                <pc msd="UPosTag=PUNCT">.</pc>
              </s>
            </egXML>
            
            Sentences are marked up using the <gi>s</gi> element, words with the
            <gi>w</gi> element and punctuation symbols with the <gi>pc</gi> element. To
            retain the linguistically significant whitespace, the <att>join</att> element
            with the fixed value <val>right</val> is used, meaning there should be no whitespace
            to the right of the token. There can be (depending on the language and
            annotation tool used) an added complication with tokenisation, which is
            further taken up in the next Section on <ref target="#sec-ana-norm">Syntactic
            words</ref>.
            </p>
            
            <p>The base form or lemmas of a word is given as the value of the
            <att>lemma</att> attribute, while punctuation characters, <gi>pc</gi>, do not
            have this attribute.</p>
            
            <p>The UD part-of-speech and morphological features are both packed in the
            <att>msd</att> attribute, with the part-of-speech having the <code>UPosTag</code>
            linguistic attribute, and the features separated by the vertical bar.</p>

            <p>ParlaMint also allows (but does not require) part-of-speech tags from some
            other tagset<note>These are typically tagset developed and used for specific
            languages and can be found in the XPOS column of CoNLL-U files, which is the
            native format for UD treebanks.</note> to be added to the linguistic
            annotation. Where this information is encoded, depends on the type of tagset.</p>

            <p>For <term>synthetic tagsets</term>, such as the <ref
            target="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn
            Treebank tagset</ref>, which have atomic tags that cannot always be decomposed
            into attribute-value pairs (e.g. the tag <q>TO</q> for the word <q>to</q>)
            should be encoded using the <att>pos</att> on words and punctuation symbols, as
            shown in the example below:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-pos">
              <s>
                <w lemma="I" msd="UPosTag=PRON|Case=Nom|Number=Sing|Person=1|PronType=Prs" pos="PRP">I</w>
                <w lemma="support" msd="UPosTag=VERB|Mood=Ind|Tense=Pres|VerbForm=Fin" pos="VBP">support</w>
                <w lemma="the" msd="UPosTag=DET|Definite=Def|PronType=Art" pos="DT">the</w>
                <w lemma="amendment" msd="UPosTag=NOUN|Number=Sing" pos="NN" join="right">amendment</w>
                <pc msd="UPosTag=PUNCT" pos=".">.</pc>
              </s>
            </egXML>
            </p>

            <p>For <term>analytic tagsets</term>, where a part-of-speech tag can be always
            decomposed into a set of attribute-values, the pointing attribute
            <att>ana</att> should be used. An example of such a collection of tagsets for
            various languages is given in the <ref
            target="http://nl.ijs.si/ME/V6/msd/html/">MULTEXT-East morphosyntactic
            specifications</ref>, and we give below an example that uses this tagset:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-ana">
              <s>
                <w ana="mte:Vmpr1p" lemma="prehajati" msd="UPosTag=VERB|Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin">Prehajamo</w>
                <w ana="mte:Sa" lemma="na" msd="UPosTag=ADP|Case=Acc">na</w>
                <w ana="mte:Ncnsa" join="right" lemma="odločanje" msd="UPosTag=NOUN|Case=Acc|Gender=Neut|Number=Sing">odločanje</w>
                <pc ana="mte:Z" msd="UPosTag=PUNCT" >.</pc>
              </s>
            </egXML>

            The <code>mte:</code> is a prefix that is, via the TEI extended pointer syntax
            as defined in the TEI header (cf. the Section on <ref
            target="#sec-ana-prefixDef">Prefix definitions</ref>) expanded so that the
            value of such an <att>ana</att> attribute points to the expansions of the
            given tag to a feature structure. For example, the value <val>mte:Vmpr1p</val>
            would be expanded to
            <val>https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p</val>, which
            then resolves to the feature-structure below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-fs">
              <fs xml:id="Vmpr1p" xml:lang="en" corresp="#Ggnspm">
                <f name="CATEGORY"><symbol value="Verb"/></f>
                <f name="Type"><symbol value="main"/></f>
                <f name="Aspect"><symbol value="progressive"/></f>
                <f name="VForm"><symbol value="present"/></f>
                <f name="Person"><symbol value="first"/></f>
                <f name="Number"><symbol value="plural"/></f>
              </fs>
            </egXML>
            </p>
          </div>
          
          <div xml:id="sec-ana-norm">
            <head>Syntactic words</head>
            
            <p>Certain frameworks, in particular the UD one (cf. their information on <ref
            target="https://universaldependencies.org/u/overview/tokenization.html">Tokenization
            and Word Segmentation</ref> and on <ref
            target="https://universaldependencies.org/format.html#words-tokens-and-empty-nodes">Words,
            Tokens and Empty Nodes</ref>), allow for tokens to be decomposed into several
            words, and it is these <term>syntactic words</term>, and not tokens, that are
            further annotated.</p>

            <p>To allow for such mismatches between word tokens and syntactic words, we use
            embedded empty words with associated <att>norm</att> attributes and the standard
            attributes with linguistic annotation. For example, Czech has the word <q>abyste</q>
            which is in UD decomposed into two syntactic words, <q>aby</q> and <q>byste</q>. This
            should be encoded as in the following example<note>Note that the example is rendered
            in three lines, however, the correct encoding in the corpus is actually in a single
            line, without any spaces between the elements, as otherwise the new line and
            indenting spaces are actually a part of the word <q>abyste</q>.</note>:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-norm1">
              <w>abyste
              <w norm="aby" lemma="aby" msd="UPosTag=SCONJ"/>
              <w norm="byste" lemma="být" msd="UPosTag=AUX|Mood=Cnd|Number=Plur|Person=2|VerbForm=Fin"/>
              </w>
            </egXML>
            
            Note also that if such a multi-word token does not have a space following it,
            <code>join="right"</code> should be added to the top level word.</p>
            <!-- Check if everybody does it this way. 
                 Ideally, @join would be added also to the last syntactic word 
            -->
            
            <p>While we do not have examples of such a practice yet, there could also be
            cases where two (or more) tokens correspond to one syntactic word. In such cases,
            it is the syntactic word that is on the top level, while the inner words are the
            actual tokens. To take an example from historical language, Slovene used to form
            the superlative form of adjectives with the word <q>naj</q> written separately
            (and often as <q>nar</q>), while in contemporary Slovene, the <q>naj</q> is a
            prefix of the adjective. This would be encoded as follows:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-norm2">
              <w norm="najlepši" lemma="lep"><w>nar</w> <w>lepši</w></w>
            </egXML>
            In this case, if such a multi-token syntactic word would not have a space following
            it, <code>join="right"</code> should be added to the last token,
            i.e. <q>lepši</q>.</p>
            <!-- Ideally, @join would be added also to the top level word  -->
            
          </div>
          
          <div xml:id="sec-ner">
            <head>Named entities</head>
            
            <p>ParlaMint also requires annotation of Named Entities (NE), which should be
            categorised into the following four types:
            <list>
              <item><val>PER</val>: person</item>
              <item><val>LOC</val>: location</item>
              <item><val>ORG</val>: organisation</item>
              <item><val>MISC</val>: miscellaneous</item>
            </list>
            
            These types are also specified in a specialised taxonomy, as further explained in the
            Section on <ref target="#sec-ana-taxo">Linguistic taxonomies</ref>.</p>
            
            <p>The identified names and their type are marked up as the <gi>name</gi> element
            with the appropriate value of its <att>type</att> attribute, as shown in the example below:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-name">
              ...
              <w lemma="and" msd="UPosTag=CCONJ">and</w>
              <name type="ORG">
                <w lemma="Westminster" msd="UPosTag=PROPN|Number=Sing">Westminster</w>
                <w join="right" lemma="Hall" msd="UPosTag=PROPN|Number=Sing">Hall</w>
              </name>
              <w lemma="," msd="UPosTag=PUNCT" >,</w>
              ...
            </egXML>
            </p>

            <p>ParlaMint also supports more complex NE annotation schemes, such as the one <ref
            target="http://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf">used for
            Czech data</ref>, which introduces very detailed NE types and also allows for nested
            named entities. The example below gives such a case, where the top level and
            ParlaMint-compatible person name also contains nested names:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-name-complex">
              <name type="PER" ana="ne:p">
                <name ana="ne:pf"> 
                  <w>Františka</w> 
                </name> 
                <name ana="ne:ps"> 
                  <w>Laudáta</w> 
                </name>
              </name>
            </egXML>

            Here, the language specific NE annotations are given as the value of the
            <att>ana</att> attribute, which are pointers using the TEI extended pointer syntax
            (cf. the Section on <ref target="#sec-ana-prefixDef">Prefix definitions</ref>) into a
            corpus-specific taxonomy that defines these local NE types (for how ParlaMint
            linguistic taxonomies are defined, cf. the Section on <ref
            target="#sec-ana-taxo">Linguistic taxonomies</ref>).
            </p>
          </div>
          
          <div xml:id="sec-parses">
            <head>Syntactic parses</head>
            
            <p>Sentences are accompanied by a
            <ref target="https://universaldependencies.org/u/overview/syntax.html">Universal
            Dependencies</ref> parse.
            These analyses are encoded inside their sentence mark-up, although in a stand-off manner.
            This means that each token must be given an ID, while the syntactic analysis is stored in the link
            group, <gi>linkGrp</gi> element containing a series of <gi>link</gi>
            elements.
            Each is labeled by a dependency label and joins two tokens.
            The example below illustrates the syntactic encoding:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-syn">
              <s xml:id="ParlaMint-GB_2021-01-06.seg393.8">
                <w xml:id="ParlaMint-GB_2021-01-06.seg393.8.1">I</w>
                <w xml:id="ParlaMint-GB_2021-01-06.seg393.8.2">support</w>
                <w xml:id="ParlaMint-GB_2021-01-06.seg393.8.3">the</w>
                <w join="right" xml:id="ParlaMint-GB_2021-01-06.seg393.8.4">amendment</w>
                <pc xml:id="ParlaMint-GB_2021-01-06.seg393.8.5">.</pc>
                <linkGrp targFunc="head argument" type="UD-SYN">
                  <link ana="ud-syn:nsubj" target="#ParlaMint-GB_2021-01-06.seg393.8.2 #ParlaMint-GB_2021-01-06.seg393.8.1"/>
                  <link ana="ud-syn:root" target="#ParlaMint-GB_2021-01-06.seg393.8 #ParlaMint-GB_2021-01-06.seg393.8.2"/>
                  <link ana="ud-syn:det" target="#ParlaMint-GB_2021-01-06.seg393.8.4 #ParlaMint-GB_2021-01-06.seg393.8.3"/>
                  <link ana="ud-syn:obj" target="#ParlaMint-GB_2021-01-06.seg393.8.2 #ParlaMint-GB_2021-01-06.seg393.8.4"/>
                  <link ana="ud-syn:punct" target="#ParlaMint-GB_2021-01-06.seg393.8.2 #ParlaMint-GB_2021-01-06.seg393.8.5"/>
                </linkGrp>
              </s>
            </egXML>
            
            The example shows that each token, as well as the sentence element should be
            given an <att>xml:id</att> attribute, that the link group comes at the end (but
            inside) the sentence, that <gi>linkGrp</gi> has two attributes
            <att>targFunc</att> and <att>type</att>, both with the fixed values of <val>head
            argument</val> and <val>UD-SYN</val>, and that it contains a series of empty
            <gi>link</gi> elements.</p>
            
            <p>The link elements then give, via the value of their <att>target</att>
            attribute, references the head and argument tokens of the syntactic relation,
            which is specified in the <att>ana</att> attribute. By convention, the links are
            ordered so that the argument references follow the ordering of the tokens
            in the sentence, i.e. all the tokens in the sentence should appear in order in the second position.
            Note that for the top level <code>root</code> relation (of which there should be only one in the sentence),
            the head is the reference to the sentence ID.</p>
            
            <p>The relations themselves are pointers which use the <code>ud:</code> prefix that
            is, via the TEI extended pointer syntax as defined in the TEI header (cf. the Section
            on <ref target="#sec-ana-prefixDef">Prefix definitions</ref>) expanded so that the
            value of such an <att>ana</att> attribute points to the categories of the special UD
            syntactic taxonomy which must be a part of the linguistically annotated version of
            the corpus; how to insert this taxonomy is specified in the Section on <ref
            target="#sec-ana-taxo">Linguistic taxonomies</ref>. There is one more detail to watch
            out for, namely, that UD allows the colon symbol <code>:</code> to appear in extended
            relations, e.g. <code>acl:relcl</code> for relative clause modifier. As we already use
            the colon for the extended pointer prefix, the colons in the relations should be
            changed to underscore, e.g. to <code>ud-syn:acl_relcl</code>. Note, however, that the
            relations specified in a <gi>link</gi> <att>ana</att> attribute are just pointers,
            and could have any value; it is the UD taxonomy that actually determines the correct
            value of the relation.</p>
          </div>

          <div xml:id="sec-semantics">
            <head>Semantic annotation</head>
            
            <p>The machine translated ParlaMint corpora (cf. the Section on <ref target="#sec-mt">Translations of corpora</ref>)
            have one more level of annotation, namely the semantic annotation of tokens and phrases (i.e. Multi-Word Expressions
            or MWEs) with <ref target="https://ucrel.lancs.ac.uk/usas/">USAS semantic tags</ref>.
            First, in order to be able to semantically tag MWEs, a new element was introduced inside sentences, namely phrase,
            <gi>phr</gi><note>Because <gi>name</gi> and <gi>phr</gi> can give conflicting markup (i.e. crossing tags) the current
            script annotates phrases only where they are not related to names, i.e. not only conflicting markup, but also nestings of
            phr/name and name/phr are forbidden and such MWEs are not retained in the XML. Furthermore, due to a bug in the script,
            phrases adjecent to names are also not retained. We hope to introduce a better script and encoding in the future.</note>.
            The USAS semantic tags are then encoded in two attributes of tokens or MWEs, namely <att>function</att> and <att>ana</att>,
            as illustated in the example below:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-analysis-sem">
              <w function="Z5" ana="sem:Z5">the</w>
              <w function="G1.1/S2mf,S9/S2mf" ana="sem:G1.1 sem:S2">Minister</w>
              <w function="Z5" ana="sem:Z5">of</w>
              <name type="ORG">
                <w function="I1" ana="sem:I1">Finance</w>
              </name>
              <w function="Z5" ana="sem:Z5">and</w>
              <phr type="sem" function="Z1mf,Z3c" ana="sem:Z1">
                <w function="Z1mf,Z3c" ana="sem:Z1">Deputy</w>
                <w lemma="Prime" function="Z1mf,Z3c" ana="sem:Z1">Prime</w>
                <w lemma="Minister" function="Z1mf,Z3c" ana="sem:Z1">Minister</w>
              </phr>
            </egXML>

            The first thing to note is that all the tokens inside a MWE receive identical semantic markup as its encompassing MWE
            <gi>phr</gi> element.
            Second, the <att>function</att> attribute gives the USAS tags exactly as output by the tool used for semantic tagging,
            which includes not only the tag computed to be the most appropriate for the given context, but also all the other lexically
            possible tags, with the comma being the separator.
            The first tag is then used to compute the values of the <att>ana</att> attribute. Here the conjunctive USAS tag (here the
            delimiter is the slash) is transformed into a series of references to the USAS taxomomy (cf. the Section on
            <ref target="#sec-ana-taxo">Linguistic taxonomies</ref>) while also removing tag qualifiers. For example, the tag
            <code>G1.1/S2mf</code> is transformed into <code>sem:G1.1 sem:S2</code> i.e. the <code>mf</code> (for male and female)
            qualifiers are removed from <code>S2mf</code>. Note also that the prefix <code>sem</code> 
            (cf. the Section on <ref target="#sec-ana-prefixDef">Prefix definitions</ref>) is used for pointing into the USAS taxonomy.
            With this set-up it is possible to encode the exact USAS tags as well as their ParlaMint categories, which give the not only the
            tag, but also its gloss.</p>
            
          </div>
        </div>
        
        <div xml:id="sec-ana-metadata">
          <head>Metadata for linguistic annotation</head>

          <p>What kind of metadata a plain-text ParlaMint corpus should contain was explained in
          the Section on <ref target="#sec-metadata">Corpus metadata</ref> and in this section we
          detail what additions must be made to the metadata for the linguistically annotated
          version. Note that the <hi>changes</hi> for this version have been already explained at
          the start of this <ref target="#chp-linguistic">Chapter</ref>. In short, there are three
          additional parts that should be added to the <gi>teiHeader</gi> of the corpus root,
          namely a description of the tool(s) used to linguistically annotate the corpus, two
          additional taxonomies (one for named entities, and one for UD syntactic relations) and
          the definition of the prefix expansions for UD syntactic relations. These descriptions
          should also serve as the point of departure for those that want to introduce their own
          prefixes and taxonomies for defining additional and corpus-specific part-of-speech
          tagging schemes or named entity classes.</p>

          <div xml:id="sec-ana-appInfo">
            <head>Application information for linguistic processing</head>
            
            <p>As the linguistic analysis of a ParlaMint will be performed by a tool, the
            information on which tool (or tools) have been used should be documented in the
            corpus root TEI header. This information is encoded in the <gi>appInfo</gi> element
            of the <gi>encodingDesc</gi>, as shown in the example below:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-appInfo">
              <appInfo>
                <application version="1.0" ident="classla">
                  <label>CLASSLA</label>
                  <desc xml:lang="en">Linguistic processing performed with with CLASSLA trained for
                  Slovene, available from <ref
                  target="https://github.com/clarinsi/classla">https://github.com/clarinsi/classla</ref>.</desc>
                </application>
              </appInfo>
            </egXML>
            
            The <gi>appInfo</gi> element contains, in general, a series of <gi>application</gi>
            elements, each one giving the information on one tool. The element gives the
            <att>version</att> number of the tool and specifies, via <att>ident</att>, and
            identifying code. It has two subordinate elements, with <gi>label</gi> giving the name
            of the tool and <gi>desc</gi> a short description of it, preferably with a pointer to
            the URL where it can be found or is at least documented.</p>
          </div>
          
          <div xml:id="sec-ana-taxo">
            <head>Linguistic taxonomies</head>

            <p>Some linguistic annotations have fixed vocabularies and these should
            be encoded as taxonomies in the TEI header of the linguistically analysed corpus
            root, similarly to other taxonomies, as discussed in the Section on the <ref
            target="#sec-classDecl">Class declaration</ref>.</p>

            <p>The first taxonomy is the Named Entity types, which has - apart from translating
            the categories into the local language - a fixed structure, as follows:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-taxonomy-NER">
              <taxonomy xml:id="ParlaMint-taxonomy-NER.ana">
                <desc xml:lang="en"><term>Named entities</term></desc>
                <category xml:id="PER">
                  <catDesc xml:lang="sl"><term>oseba</term></catDesc>
                  <catDesc xml:lang="en"><term>person</term></catDesc>
                </category>
                <category xml:id="LOC">
                  <catDesc xml:lang="sl"><term>lokacija</term></catDesc>
                  <catDesc xml:lang="en"><term>location</term></catDesc>
                </category>
                <category xml:id="ORG">
                  <catDesc xml:lang="sl"><term>organizacija</term></catDesc>
                  <catDesc xml:lang="en"><term>organisation</term></catDesc>
                </category>
                <category xml:id="MISC">
                  <catDesc xml:lang="sl"><term>drugo</term></catDesc>
                  <catDesc xml:lang="en"><term>miscellaneous</term></catDesc>
                </category>
              </taxonomy>
            </egXML>
            </p>

            <p>The second taxonomy to be inserted is the one for Universal Dependency
            relations. We currently do not use corpus specific taxonomies, even though different languages use
            different subsets of the UD syntactic relations, but rather a common taxonomy giving all the
            UD relations; the taxonomy has currently also not been localised, i.e. it is available in the English
            language only. Below we illustrate by giving a few relation definitions:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-taxonomy-UD">
              <taxonomy xml:id="ParlaMint-taxonomy-UD-SYN.ana">
                <desc xml:lang="en">
                  <term>UD syntactic relations</term>
                </desc>
                <category xml:id="acl">
                  <catDesc xml:lang="en">
                  <term>acl</term>: Clausal modifier of noun (adjectival clause)</catDesc>
                </category>
                <category xml:id="cc_preconj">
                  <catDesc xml:lang="en">
                  <term>cc:preconj</term>: Preconjunct</catDesc>
                </category>
                <category xml:id="dep">
                  <catDesc xml:lang="en">
                  <term>dep</term>: Unspecified dependency</catDesc>
                </category>
                <category xml:id="punct">
                  <catDesc xml:lang="en">
                  <term>punct</term>: Punctuation</catDesc>
                </category>
                <category xml:id="root">
                  <catDesc xml:lang="en">
                  <term>root</term>: Root</catDesc>
                </category>
              </taxonomy>
            </egXML>
            
            The ID and description, <gi>desc</gi> of the <gi>taxonomy</gi> are fixed, and the
            <gi>category</gi> elements have the usual structure. Note that the ID of a category is
            identical to its name given in <gi>term</gi>, except that the colon, <code>:</code> in
            the official name of the relation must be substituted by the underscore,
            <code>_</code>, to enable correct referencing of these IDs, as discussed in the
            Section on <ref target="#sec-parses">Syntactic parses</ref>.</p>

            <p>The third taxonomy is only relevant for the machine translated corpora
            (cf. the Section on <ref target="#sec-mt">Translations of corpora</ref>) and gives the
            categories of the <ref target="https://ucrel.lancs.ac.uk/usas/">USAS semantic tags</ref>
            (for their encoding in the corpora cf. the Section on <ref target="#sec-semantics">Semantic annotation</ref>).
            This taxonomy is only available in the English language and is identical for all the machine translated corpora. 
            Below we illustrate its structure by giving the start of the taxonomy:

            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-taxonomy-USAS">
              <taxonomy xml:id="ParlaMint-taxonomy-USAS.ana" xml:lang="en">
                <desc xml:lang="en"><term>USAS categories</term>: Semantic categories following the USAS Semantic tagset, ...</desc>
                <category xml:id="A1">
                  <catDesc><term>A1</term>: General And Abstract Terms</catDesc>
                  <category xml:id="A1.1.1">
                    <catDesc><term>A1.1.1</term>: General actions / making</catDesc>
                    <category xml:id="A1.1.1n">
                      <catDesc><term>A1.1.1-</term>: Inaction</catDesc>
                    </category>
                  </category>
                  <category xml:id="A1.1.2">
                    <catDesc><term>A1.1.2</term>: Damaging and destroying</catDesc>
                    <category xml:id="A1.1.2n">
                      <catDesc><term>A1.1.2-</term>: Fixing and mending</catDesc>
                    </category>
                  </category>
                  ...
                </category>
                ...
              </taxonomy>
            </egXML>

            The current USAS taxonomy covers a subset of all USAS semantic tags and was derived from the official list of 
            <ref target="https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt">USAS semantic subcategories</ref>.
            A category in the taxonomy can contain up to one positive (USAS = '+', taxonomy = 'p') or negative
            (USAS = '-', taxonomy = 'n') modifier. Other USAS modifiers (i.e. regex [mfnci%@]) are not retained.
            The taxonomy includes 455 categories, each with its USAS code and gloss.
            </p>
            
          </div>
          
          <div xml:id="sec-ana-prefixDef">
            <head>Prefix definitions</head>
            
            <p>Pointing attributes, such as <att>ana</att>, take as their value a series of
            references to the value of <att>xml:id</att> elements in an XML document. If this is
            the same document, then the reference to the ID is the hash character, <code>#</code>
            prefixed to the particular ID, e.g. <val>#parla.uni</val>, and if they are in another
            XML document, then the hash is prefixed with the URL of the document, e.g.
            <val>https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p</val>.</p>

            <p>Because the complete URL tends to be long, which is especially inconvenient
            when such references are given to every token in a corpus, TEI introduces the so
            called Extended pointer syntax, whereby the reference to an ID can be given in the
            form of a prefix, which is separated by a colon from the local part of the ID
            reference, and the value of this prefix is determined via the <gi>prefixDef</gi>
            element in the <gi>profileDesc</gi> of the TEI header.</p>

            <p>ParlaMint uses this
            mechanism for all linguistic annotations with a closed vocabulary, in particular
            for the Universal Dependencies syntactic relations, for the optional and
            corpus-specific analytical part-of-speech tags (c.f. the Sections on <ref
            target="#sec-parses">Syntactic parses</ref> and <ref
            target="#sec-ana-words">Word-level annotation</ref>), and for semantic annotation in the machine translated corpora
            (c.f. the Section on <ref target="#sec-semantics">Semantic annotation</ref>).
            The example below illustrates
            the prefix definitions for the obligatory UD syntactic relations and for the
            optional MULTEXT-East tags:
            
            <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-ana-listPrefix">
              <listPrefixDef>
                <prefixDef ident="ud-syn" matchPattern="(.+)" replacementPattern="#$1">
                  <p xml:lang="en">Private URIs with this prefix point to elements giving their name. In this document they are simply local references into the UD-SYN taxonomy categories in the corpus root TEI header.</p>
                </prefixDef>
                <prefixDef ident="mte"
                           matchPattern="(.+)"
                           replacementPattern="http://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl.xml#$1">
                  <p xml:lang="en">Private URIs with this prefix point to feature-structure elements defining the Slovenian MULTEXT-East Version 6 MSDs.</p>
                </prefixDef>
              </listPrefixDef>
            </egXML>
            
            The specialised element for listing prefix definitions, <gi>listPrefixDef</gi> gives a
            series of prefix definitions, i.e. <gi>prefixDef</gi> elements. Each prefix definition
            defines its prefix as the value of the <att>ident</att> attribute, and then specifies
            a regular expression that matches the part of the ID reference after the prefix in its
            <att>matchPattern</att> attribute, and its substitution as the value of the
            <att>replacementPattern</att> attribute. The first prefix definition thus defines the
            <code>ud-syn</code> prefix, so for any ID reference with this prefix,
            e.g. <val>ud-syn:acl_relcl</val>, the part after the prefix (<code>acl_relcl</code>)
            should be matched against <code>(.+)</code> and the result being the matched part
            (here the entire relation <code>acl_relcl</code>) substituted by
            <code>#$1</code>, i.e. by the hash character followed by the original value, so that
            <val>ud-syn:acl_relcl</val> gives <val>#acl_relcl</val>. This substitution is of
            course trivial, and hardly necessary, but was implemented so that all fixed-vocabulary
            linguistic analyses have the same treatment.</p>
            
            <p>More to the point is the second example, where very short ID references, such as
            <val>mte:Vmpr1p</val> are transformed to
            <val>https://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-sl.xml#Vmpr1p</val>, as already
            explained in the Section on <ref target="#sec-ana-words">Word-level
            annotation</ref>.</p>

            <p>Finally, each prefix definition also contains a possibly bi-lingual paragraph
            explaining the definition.</p>
          </div>

        </div>
      </div>
      
      <div xml:id="sec-mt">
        <head>Translations of corpora</head>
        
        <p>The ParlaMint machine translated corpora are encoded simliarly to corpora in their
        source language, i.e. they have an identically structured corpus root and components.
        The most obvious differences are the following:

        <list>
          <item>The filenames are extended with the target language code, as described in the
          Section on <ref target="#sec-files">Filenames</ref>, e.g. the linguistically analysed
          English translation of the Latvian corpus would have the root filename
          <code>ParlaMint-LV-en.ana.xml</code> and a component filename could be
          <code>ParlaMint-LV_2014-11-04.ana.xml</code>.</item>
          <item>The top-level <att>xml:id</att> of the corpus root and of components must also be changed
          accordingly to e.g. <code>ParlaMint-LV-en.ana</code> or 
          <code>ParlaMint-LV_2014-11-04.ana</code>.</item>
          <item>The stamp in the main titles should be changed too by adding the target langauge
          suffix, e.g. to <code>[ParlaMint-en.ana]</code>.</item>
          <item>The top level <att>xml:lang</att> in the root and components should be set to
          the target language, e.g. <code>xml:id="en"</code>.</item>
        </list>
        </p>
        
        <p>The structure of the <gi>text</gi>, inluding the transcriber comments and
        the linguistic analysis is encoded the same as as for the corpora in the source
        language. The two differences are that the aligned elements are linked
        to their corresponding elements in the source corpus, and, to simplify processing, the
        transcriber comments are moved outside sentences in the (rare) cases where they appeared inside them
        in the original language corpus. In the current machine translated corpora
        the alignment are given to utterances, segments, sentences, and transcriber comments, which have, furthermore,
        always a 1-1 mapping to the corresponding source element. Therefore the alignment is
        trivial, simply specifiying the same <att>xml:id</att> value of the element the source corpus,
        as illustrated in the following example:
        
        <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-mt">
          <div type="debateSection" xml:lang="en">
            <note type="speaker" xml:id="ParlaMint-LV_2019-01-31-PT13-516.ana.note1"
                  corresp="mt-src:ParlaMint-LV_2019-01-31-PT13-516.ana.note1">Head of the sitting.</note>
            <u who="#ĀboltiņaSolvita" xml:id="ParlaMint-LV_2014-11.u1"
               ana="#chair" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11.u1">
              <seg xml:id="ParlaMint-LV_2014-11-04.seg1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04.seg1">
                <s xml:id="ParlaMint-LV_2014-11-04.s1" corresp="mt-src:ParlaMint-LV_2014-11-04.s1">
                  <w xml:id="ParlaMint-LV_2014-11-04.s1.t1" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
                  <w xml:id="ParlaMint-LV_2014-11-04.s1.t2" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
                  ...
                </s>
                ...
              </seg>
              ...
            </u>
            ...
          </div>
        </egXML>

        As can be seen above, the alignment is specified on the <att>corresp</att> attibute.
        The alignment reference makes use of the TEI extended pointer syntax (cf. also the Section on
        <ref target="#sec-ana-prefixDef">Prefix definitions</ref>), to define the <val>mt-src</val> prefix
        which must resolve to the correct component file of the corpus in the source language.</p>
        
        <p>In contrast to other linguistic annotations that specify one prefix definition for the complete
        corpus (so, inside the corpus root <gi>teiHeader</gi>), the translated corpora should specify a
        prefix definition inside each corpora component because its definition depends on the component file.
        For example, the prefix definition of the corpus component file <code>ParlaMint-LV-en_2014-11-04.ana.xml</code>
        should be as in the following example:
        
        <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-mt-prefixDef">
          <listPrefixDef>
            <prefixDef ident="mt-src"
                       matchPattern="(.+)"
                       replacementPattern="../../ParlaMint-LV.TEI.ana/2014/ParlaMint-LV_2014-11-04.ana.xml#$1">
              <p>Private URIs with this prefix point to aligned source elements of the MTed corpus.</p>
            </prefixDef>
          </listPrefixDef>
        </egXML>

        The assumption above is that the two corpora are available in the same directory
        (so, <code>./ParlaMint-LV.TEI.ana/</code> and <code>./ParlaMint-LV-en.TEI.ana/</code>), so that <att>corresp</att> values
        of the file <code>ParlaMint-LV-en.TEI.ana/2014/ParlaMint-LV-en_2014-11-04.ana.xml</code> with the <code>mt-src</code>
        prefix will point to
        <code>ParlaMint-LV-en.TEI.ana/2014/../../ParlaMint-LV.TEI.ana/2014/ParlaMint-LV_2014.ana.xml</code>
        i.e. to <code>./ParlaMint-LV.TEI.ana/2014/ParlaMint-LV_2014-11-04.xml</code>.</p>
        
        <p>The final additional element of the transated corpora is the information on the application, <gi>application</gi>,
        (cf. also the Section on <ref target="#sec-ana-appInfo">Application information for linguistic processing</ref>) 
        i.e. on the program that was used to translate the corpora, as illustrated by the following example:
        
        <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-mt-appInfo">
          <application ident="EasyNMT" version="2.0">
            <label>EasyNMT (OPUS-MT model)</label>
            <desc>Translation to English done with EasyNMT
            (<ref target="https://github.com/UKPLab/EasyNMT">https://github.com/UKPLab/EasyNMT</ref>)
            with OPUS-MT model bat
            (<ref target="https://github.com/Helsinki-NLP/Opus-MT">https://github.com/Helsinki-NLP/Opus-MT</ref>)</desc>
            </application>
        </egXML>

        This element should be given in the corpus root, together with all the other information on applications
        inside the application information (<gi>appInfo</gi>) element.</p>
        
    </div>
      
          
      <!--div xml:id="sec-audio">
          <head>Audio</head>
          
          <p>The transcription can refer to and align with external audio data using
          the <gi>timeline</gi> element, explained in the Section on <ref
          target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASYMP">Placing
          Synchronous Events in Time</ref> of the TEI Guidelines, and further elaborated in
          <ref target="https://www.iso.org/standard/37338.html">ISO 24624:2016 Language
          resource management - Transcription of spoken language</ref>. While the ISO standard
          is better elaborated, it also changes and adds element definitions, so we are using
          the standard TEI variant of speech encoding as far as the schema is concerned, while
          taking into account, as much as possible, specific encoding choices as proposed by
          the ISO standard.</p>
          
          <p>First, TEI offers the <gi>recordingStmt</gi> element (a part of <gi>fileDesc</gi>
          of the TEI header) which contains the information about the recording(s) of the
          transcription. This information can be unstructured (i.e. a series of <gi>ab</gi>
          elements) or structured (contained in the <gi>recording</gi> element); Parla-CLARIN
          recommends the structured version. As shown in the example below, the element
          contains information of the type of recording (audio / video), its duration<note>The
          <att>dur</att> values should follow the <ref
          target="https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#duration">W3C
          datatype</ref>.</note> and a pointer to the file, possibly a responsibility statement
          (<gi>respStmt</gi>) of the person or agency that made the recording, the date when
          the recording file was made and the equipment used:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-recordingStmt">
          <recordingStmt>
          <recording type="audio" dur="PT43M45S">
          <media mimeType="audio/wav" url="WAV/Session_2018-12-01a.wav"/>
          <respStmt>
          <resp>Audio capture</resp>
          <name>John Dury</name>
          </respStmt>
          <time>2016-04-15</time>
          <equipment>
          <ab>Video downloaded from U.K. parliament site.</ab>
          <ab>Audio extracted from video with Audacity 1.4</ab>
          </equipment>
          </recording>
          </recordingStmt>
          </egXML>
          </p>
          
          <p>The mapping of time intervals of the recording to IDs in the TEI document is
          encoded in the <gi>timeline</gi> element, in particular in the contained
          <gi>when</gi> elements. As explained below, these IDs are then used to link elements
          in the transcription to the timeline, therefore each <gi>when</gi> element must have
          the <att>xml:id</att> attribute. The <gi>when</gi> elements must also be in the same
          order as the time-points they encode.</p>
          
          <p>As the example below shows, the timeline gives the unit in which the intervals are
          specified (typically second, <code>s</code>) and the time origin of the timeline,
          here referring to the first <gi>when</gi> element, at the very start of the
          recording, so specified with the absolute time.  Further <gi>when</gi> elements give
          the interval between this origin point and their end:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-timeline">
          <timeline unit="s" origin="#T0">
          <when xml:id="T0" absolute="00:00:00.0"/>
          <when xml:id="T1" interval="1.13" since="#T0"/>
          <when xml:id="T2" interval="3.84" since="#T0"/>
          <when xml:id="T3" interval="5.33" since="#T0"/>
          <when xml:id="T4" interval="9.35" since="#T0"/>
          <when xml:id="T5" interval="12.62" since="#T0"/>
          </timeline>
          </egXML>
          </p>
          
          <p>The IDs of the timeline synchronisation are then used by the <gi>u</gi>
          elements in the transcription via their <att>start</att> and
          <att>end</att> attributes. In the examples below we give three cases of such linking:
          the first one gives a straightforward temporal structure on the <gi>u</gi>;
          the second one uses the
          empty <gi>anchor</gi> element to give additional temporal structure for cases where
          the synchronised parts of the utterance are not further marked-up
          (or the synchronisation is required for elements that don't have the <att>start</att> and
          <att>end</att> attributes); while the third and
          fourth demonstrate the case where two utterances are partially overlapping:<note>Note
          that in such cases we would also use the <att>prev</att> and <att>next</att>
          attributes, as shown in the <ref target="#exa-splitu">Example on split
          utterances</ref>.</note>:
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-utiming">
          <u who="#SPK0" start="#T0" end="#T1" xml:id="u2">Good morning!</u>
          <u who="#SPK1" start="#T1" end="#T3">Good morning, <anchor synch="#T2"/>Mr. president!</u>
          <u who="#SPK0" start="#T4" end="#T7">You do not have the <anchor synch="#T5"/>floor!</u>
          <u who="#SPK1" start="#T5" end="#T6">Sorry, <anchor synch="#T2"/>mate!</u>
          </egXML>
          </p>
          </div>
          
          <div xml:id="sec-facsimile">
          <head>Facsimile</head>
          <p>In cases where the facsimile of the original transcription is available
          (especially valuable for older parliamentary proceedings, where the exact
          appearance of the original proceedings is of interest), it is advantageous to
          enable viewing the original together with the encoded transcription. How to
          achieve this, in general, is explained in the <ref
          target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html">Chapter on
          Representation of Primary Sources</ref> of the TEI Guidelines.</p>
          
          <p>The simplest but also the most limiting way to achieve this is to have per-page
          facsimile files and use the page break i.e. <gi>pb</gi> element to mark page
          boundaries in the transcript and then directly specify the image file of the page
          with the <att>facs</att> attribute, as illustrated in the example below:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-pb">
          <body>
          <pb facs="PNG/page1.png"/>
          ...
          <pb facs="PNG/page2.png"/>
          ...
          </body>
          </egXML>
          </p>
          
          <p>By convention, this encoding indicates that the image indicated by the
          <att>facs</att> attribute represents the whole of the text following the
          <gi>pb</gi> element, up to the next <gi>pb</gi> element. The page break element
          can also contain a reference to the source HTML of web-harvested proceeding using
          the <att>source</att> attribute, and per-page links to audio or vidoe (cf.  the
          Section on <ref target="#sec-speechvideo">Speech and video</ref>), as illustrated
          in the following example:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-pb-source">
          <pb n="1"
          source="https://www.psp.cz/eknih/2013ps/stenprot/044schuz/s044051.htm"
          xml:id="ParlaMint-CZ_2016-04-13-ps2013-044-02-006-162.pb1"
          corresp="#ps2013-044-02-006-162.audio1"/>
          </egXML>
          </p>
          
          <p>A more complicated solution to referring to facsimiles, where it is possible
          to have several images per page (e.g. in different resolutions) or to specify
          areas of a page is enabled by the <gi>facsimile</gi> element, which should appear
          immediately before the <gi>text</gi> element of a TEI document. The example below
          refers to the first and third pages directly with the <gi>graphic</gi> element,
          whereas the second page is encoded as a <gi>surface</gi> which then contains the
          page image in two resolutions:
          
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-facsimile">
          <facsimile>
          <graphic xml:id="page1" url="PNG/page1.png"/>
          <surface xml:id="page2">
          <graphic type="600dpi" url="PNG/page2-highRes.png"/>
          <graphic type="300dpi" url="PNG/page2-lowRes.png"/>
          </surface>
          <graphic xml:id="page3" url="PNG/page3.png"/>
          </facsimile>
          </egXML>
          
          Such definitions are then referred to via local references in <att>facs</att> attribute of
          <gi>pb</gi>, as discussed previosly.</p>
          
          <p>More complicated cases, such as delimiting portions of a page are also supported
          by the TEI Guidelines, and for these the reader is referred directly to the Section
          on <ref
          target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PHFAX">Digital
          Facsimiles</ref>.</p>
          </div>
          </div-->
      
      <div xml:id="chp-validation">
        <head>Validation and conversion</head>
        
        <p>The chapter explains how to validate and finalise a ParlaMint corpus, and introduces
        scripts for converting a ParlaMint corpus to other, derived formats.</p>

        <div xml:id="sec-validate">
          <head>Validating ParlaMint corpora</head>
          
          <p>The XML structure of ParlaMint corpora can be validated via RelaxNG schemas, which
          exist in two versions, one that was produced as a customisation of the TEI Guidelines,
          and a set of schemas that were made from scratch for ParlaMint.</p>

          <p>The TEI customisation is written as a TEI ODD document, which is, in fact, the XML
          version of this document, and is available in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/main/TEI">TEI/ directory</ref> of
          the ParlaMint GitHub repository. The XML contains not only the prose guidelines, but
          also the formal specification of the TEI schema, which is given in the <ref
          target="#schema">Appendix A</ref>. In the XML it contains the formal schema
          specification, while in the on-line version this is converted to a reference to all the
          elements, attributes and classes used in ParlaMint corpora. The ODD document is not
          immediately useful for XML validation, but has to be converted with TEI XSLT
          stylesheets first in order to obtain a RelaxNG schema, and this schema is also
          available in the same directory under the name of <ref
          target="https://github.com/clarin-eric/ParlaMint/blob/main/TEI/ParlaMint.rng">ParlaMint.rng</ref>
          (in RelaxNG XML syntax) and <ref
          target="https://github.com/clarin-eric/ParlaMint/blob/main/TEI/ParlaMint.rnc">ParlaMint.rnc</ref>
          (in RelaxNG compact syntax). This schema should be used to check that ParlaMint
          component files validate against TEI.</p>
  
          <p>However, it is difficult to constrain a TEI ODD-derived XML schema to allow only the
          kinds of nestings and attributes that should appear in a ParlaMint corpus, so this
          schema allows (and lists <ref target="#schema">Appendix A</ref>) nesting of elements,
          as well as attributes that are in fact forbidden in ParlaMint corpora.</p>
          
          <p>For this reason, we have also developed a set of RelaxNG schemas from scratch, which
          do allow only those elements, attributes and content models that are in fact valid for
          a ParlaMint corpus. There are all together four such schemas, one for a "plain-text"
          corpus root, one for its corpus components, one for the linguistically annotated corpus
          root, and one for its components. These schemas can be found in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/main/Schema">Schema/
          directory</ref> of the ParlaMint GitHub repository, with the
          <ref target="https://github.com/clarin-eric/ParlaMint/blob/main/Schema/README.md">README</ref>
          file giving instructions on how to use them.</p>

          <p>Validating with XML schemas checks the formal structure of XML files but is less
          successful in validating other aspects of conformance, such as the textual content or
          linking of pointer attributes. For this reason, we have also developed an XSLT script
          that assumes a schema-validated ParlaMint file on its input, and checks various other
          aspects of conformance. These validation scripts can be found in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/main/Scripts">Scripts/
          directory</ref> of the ParlaMint GitHub repository, with the <ref
          target="https://github.com/clarin-eric/ParlaMint/blob/main/Scripts/README.md">README</ref>
          file listing them.</p>

          <p>It should be noted that it is not necessary to run the validation scripts directly, as
          the validation can be performed by the main <ref
          target="https://github.com/clarin-eric/ParlaMint/blob/main/Makefile">Makefile</ref> of the
          project. The Makefile is self-documenting, i.e. to see how to use it, please run
          <code>make help</code> in the top level directory of the ParlaMint project.</p>
          
          <p>While each contributor of a corpus should validate their files with the ParlaMint
          schemas and validation script, there also exist further stages of validation, which are
          also applied to ParlaMint corpora:
          <list>
            <item>The corpora are converted to derived formats, in particular, the linguistically
            annotated version of the corpus to CoNLL-U and to the so called vertical format for
            CQP-type concordancers. The Universal Dependencies project provides a <ref
            target="https://universaldependencies.org/release_checklist.html#validation">program</ref>
            for validating the formatting and linguistic analyses in CoNLL-U files, and this
            validation is used on the CoNLL-U files derived from their XML source, up to level 2
            conformance. The vertical files, on the other hand, are first compiled with <ref
            target="https://nlp.fi.muni.cz/trac/noske">manatee</ref> (the back end of <ref
            target="https://www.sketchengine.eu/nosketch-engine/">(no)Sketch Engine</ref>) and
            this compilation can also expose various errors.</item>

            <item>The last stage in validation is <q>human validation</q> where e.g. simply
            looking at various produced metadata files or at the concordances of a corpus exposes
            errors.</item>
          </list>
          </p>
        </div>
        
        <div xml:id="sec-final">
          <head>Finalisation of corpora</head>
          
          <p>While the vast majority of converting source encodings into the ParlaMint corpus
          format is left to the compilers of a corpus, there are a few metadata elements that can
          be produced by a common script on the basis of nearly finished corpora, which then
          results in the final version of the corpus for a particular release. This includes
          setting the date, edition and handle under which the corpus will be distributed, and
          also calculating the size of the corpus (cf. the Sections on <ref
          target="#sec-extent">Extents</ref> and on <ref target="#sec-tagsDecl">Tags
          declaration</ref>).

          The script for finalisation can be found in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/main/Scripts">Scripts/
          directory</ref> of the ParlaMint GitHub repository and the
          <ref target="https://github.com/clarin-eric/ParlaMint/blob/main/Scripts/README.md">README</ref>
          file briefly explains
          its function; more comments can be found in the script itself.
          <!-- The README there does not in fact mention finalisation, needs updating -->
          <!-- This script is for V2.1, it has to be updated for 3.0 -->
          </p>
        </div>
        
        <div xml:id="sec-conversion">
          <head>Conversions</head>
          
          <p>A TEI encoded document is, in general, not meant to be used directly by software
          programs, rather, it serves as an interchange and storage format. The ParlaMint project
          has produced various scripts to down-convert the XML encoded corpora to other formats
          and they can be found in the <ref
          target="https://github.com/clarin-eric/ParlaMint/tree/main/Scripts">Scripts/
          directory</ref> of the ParlaMint GitHub repository, with the
          <ref target="https://github.com/clarin-eric/ParlaMint/blob/main/Scripts/README.md">README</ref>
          file listing them
          and explaining their function. In short, the scripts convert the ParlaMint XML to plain
          text, to CoNLL-U, and to vertical format. There is also a script that takes a ParlaMint
          corpus and makes from it a sample for inclusion to the <ref
          target="https://github.com/clarin-eric/ParlaMint/">ParlaMint GitHub
          repository</ref>.</p>
        </div>
      </div>
      
      <div xml:id="chp-contributing">
        <head>Contributing to ParlaMint</head>
        
        <p>The <ref target="https://github.com/clarin-eric/ParlaMint/">ParlaMint GitHub
        repository</ref> contains these guidelines, the ParlaMint XML schemas, the scripts used
        to validate, finalise and convert the ParlaMint TEI XML corpora to derived formats, and
        samples of the ParlaMint corpora. There are four main branches in the repository:
          <list>
            <item><ref target="https://github.com/clarin-eric/ParlaMint/tree/main">main</ref>
              is the default branch used for the synchronisation of other branches.
              It is also used for releasing sample files that correspond to published corpora.</item>
            <item><ref target="https://github.com/clarin-eric/ParlaMint/tree/data">data</ref>
              serves as a pushing place for new sample files in ./Data/ParlaMint-XX directories.</item>
            <item><ref target="https://github.com/clarin-eric/ParlaMint/tree/devel">devel</ref>:
              development of scripts and documentation.</item>
          </list>
        </p>

        <p>The validation procedure for corpora is explained in the Section on <ref target="#sec-validate">Validating ParlaMint
        corpora</ref>, while the technical aspects of contributing corpora is further explained in the <ref
        target="https://github.com/clarin-eric/ParlaMint/blob/main/CONTRIBUTING.md">CONTRIBUTING file</ref> of the repository.
        </p>
      </div>
      
      <div xml:id="sec-ack">
        <head>Acknowledgements</head>
        
        <!--p>The authors would like to thank all the participants of the <ref
        target="https://www.clarin.eu/blog/clarin-parlaformat-workshop">CLARIN
        ParlaFormat</ref> workshop (May 23-24, 2019, Amersfoort) for their very useful comments
        and suggestions.</p>
        
        <p>This proposal was inspired by a number of related projects, in
        particular: <ref target="https://tei-c.org/extra/teiinlibraries/">Best Practices for
        TEI in Libraries</ref>, the DARIAH and ELEXIS funded initiative <ref
        target="https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html">TEI
        Lex0</ref> to develop an interchange encoding for machine readable dictionaries, and
        the <ref target="https://www.distant-reading.net/eltec/">ELTeC corpus</ref> initiative
        by the COST Action CA16204 <q>Distant Reading for European Literary History</q>.</p-->
        
        <p>The work on these recommendations was funded by the <ref
        target="https://www.clarin.eu/">CLARIN</ref> Research Infrastructure for Language
        Resources and Tools.</p>
      </div>
    </body>

    <back>
      <divGen type="subtoc"/>
      <!--div xml:id="examplar">
          <head>Example document</head>
          <p>This section gives a complete example document that validates according to Parla-CLARIN and
          aims to illustrate the encoding of various aspects of parliamentary proceedings corpora.</p>
          <egXML xmlns="http://www.tei-c.org/ns/Examples" xmlns:xi="http://www.w3.org/2001/XInclude"
          xml:id="exa-exemplar">
          <xi:include href="../Examples/Parla-CLARIN-Exemplar.xml"/>
          </egXML>
          </div-->
      
      <div xml:id="schema">
        <head>Formal specification</head>
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
                    href="ParlaMint-schemaSpecs.odd.xml"/>
      </div>
    </back>
  </text>
</TEI>