-
Notifications
You must be signed in to change notification settings - Fork 52
/
Copy pathParlaMint.odd.xml
3638 lines (3153 loc) · 208 KB
/
ParlaMint.odd.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="tei_odds.rng" type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns:rng="http://relaxng.org/ns/structure/1.0"
xmlns="http://www.tei-c.org/ns/1.0"
xmlns:sch="http://purl.oclc.org/dsdl/schematron"
xmlns:eg="http://www.tei-c.org/ns/Examples"
xmlns:egXML="http://www.tei-c.org/ns/Examples"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:tei="http://www.tei-c.org/ns/1.0"
xml:lang="en" n="tei_clarin">
<!-- To do:
- need a better introduction and discussions of samples
- maybe even more constrained schema
-->
<teiHeader>
<fileDesc>
<titleStmt>
<title>The structure and encoding of ParlaMint corpora</title>
<author>Tomaž Erjavec, [email protected]</author>
<author>Matyáš Kopp, [email protected]</author>
<author>Andrej Pančur, [email protected]</author>
</titleStmt>
<publicationStmt>
<publisher>CLARIN</publisher>
<date>2024-12-03</date>
<availability status="free">
<p>This file is freely available and you are hereby authorised to copy, modify, and
redistribute it in any way without further reference or permissions.</p>
</availability>
<pubPlace>
<ref target="https://github.com/clarin-eric/ParlaMint">https://github.com/clarin-eric/ParlaMint</ref>
</pubPlace>
</publicationStmt>
<sourceDesc>
<p>Made on the basis of the <ref
target="https://github.com/clarin-eric/ParlaMint">Parla-CLARIN</ref>
recommendation. For (possibly modified) examples of encoding uses the
<bibl xml:id="ParlaMint">ParlaMint corpus</bibl>.</p>
</sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
<p>Research Infrastructure for Language Resources and Tools <ref
target="https://www.clarin.eu/">CLARIN</ref>.</p>
</projectDesc>
</encodingDesc>
<revisionDesc>
<change when="2024-12-03">Tomaž Erjavec: Allow s/@ana.</change>
<change when="2024-05-14">Tomaž Erjavec: Make person/sex obligatory.</change>
<change when="2024-03-03">Tomaž Erjavec: Slighlty edit the section of affiliations.</change>
<change when="2024-02-08">Tomaž Erjavec: Change encoding of name so it allows more .ana elements.</change>
<change when="2024-01-25">Tomaž Erjavec: Change encoding of org/state so it is TEI compliant.</change>
<change when="2023-11-06">Tomaž Erjavec: Add info on semantic annotations and other small changes.</change>
<change when="2023-10-09">Tomaž Erjavec: Allow multilingual names.</change>
<change when="2023-08-22">Tomaž Erjavec: Change description of org/state.</change>
<change when="2023-07-17">Tomaž Erjavec: Add org/@role='federatedState'.</change>
<change when="2023-06-13">Tomaž Erjavec: Start work on section for MT.</change>
<change when="2023-04-06">Tomaž Erjavec: allow orgName in affiliation.</change>
<change when="2022-12-21">Tomaž Erjavec: reorganise, fix errors, add explanations.</change>
<change when="2022-10-17">Tomaž Erjavec: add description for new XIncludes, reorganise</change>
<change when="2022-10-04">Tomaž Erjavec: reorganise sect. on organizations, add state.</change>
<change when="2022-07-27">Tomaž Erjavec: many changes to (explanations of) structure.</change>
<change when="2022-05-18">Katja Meden: add ParlaMint examples in exemplums.</change>
<change when="2022-05-16">Tomaž Erjavec: add section 3.5 on referring attributes.</change>
<change when="2022-03-04">Matyáš Kopp: finish first pass through schemaSpecs.</change>
<change when="2022-02-16">Tomaž Erjavec: first draft.</change>
</revisionDesc>
</teiHeader>
<text>
<front>
<titlePage>
<docTitle>
<titlePart type="main">The structure and encoding of
<ref target="https://github.com/clarin-eric/ParlaMint">ParlaMint corpora</ref></titlePart>
</docTitle>
<!--docEdition>v4.2</docEdition-->
<docDate>2024-12-03</docDate>
<!--byline>Tomaž Erjavec, Matyáš Kopp, Katja Meden</byline-->
</titlePage>
<p></p>
<divGen type="toc"/>
</front>
<body>
<div xml:id="chp-intro">
<head>Introduction</head>
<p>This document is meant to serve as a reference for the encoding of ParlaMint
corpora of parliamentary proceedings. In order for the ParlaMint corpora to be
interoperable (i.e. so that the same scripts can be used to process them), their
structure is fairly rigid, both in terms of file names and folder structure, as well
as their TEI XML encoding. This is not to say that all the corpora have to contain
exactly the same information because we distinguish obligatory information, which
all the corpora should contain, from that which is optional, and present only in the
corpora for which it has been possible to gather it from the corpus sources.</p>
<p>This document is a specialisation of <ref
target="https://github.com/clarin-eric/parla-clarin">Parla-CLARIN</ref>, itself a
customisation the <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html">TEI
Guidelines</ref>. But while Parla-CLARIN gives fairly general recommendations for
encoding corpora of parliamentary proceedings, ParlaMint, as mentioned, is much
stricter. This document gives very specific encoding recommendations without
necessarily stating the reasons for their choice. It covers the overall structure
of ParlaMint corpora, the metadata they contain, the encoding of transcriptions,
and, for the linguistically annotated version, the encoding of word-level
linguistic annotatios, syntactic dependencies and named entities.</p>
<p>The document is not meant as a tutorial on TEI or ParlaMint, but as a reference to
elements, their nesting and attributes exemplified by snippets from the existing ParlaMint
corpora. Other sources can help in understanding the encoding of ParlaMint corpora:
<list>
<item>The freely available paper:<lb/> Erjavec, T., Ogrodniczuk, M., Osenova,
P. et al. The ParlaMint corpora of parliamentary proceedings. Language
Resources & Evaluation (2022). <ref
target="https://doi.org/10.1007/s10579-021-09574-0">https://doi.org/10.1007/s10579-021-09574-0</ref>.</item>
<item>The <ref
target="https://clarin-eric.github.io/parla-clarin/">Parla-CLARIN</ref>
guidelines, which provide general guidelines for encoding parliamentary corpora
in TEI; they also give links to the relevant chapters of the <ref
target="https://www.tei-c.org/release/doc/tei-p5-doc/en/html">TEI Guidelines
</ref>.</item>
<item>Samples of ParlaMint corpora, available in the Data/ directory of the
ParlaMint GitHub repository, esp. useful as they give the complete picture of a
ParlaMint corpus; note that the samples in the <ref
target="https://github.com/clarin-eric/ParlaMint/tree/main/Data">main
branch</ref> are supposed to be publication-ready, while those in the <ref
target="https://github.com/clarin-eric/ParlaMint/tree/data/Data">data
branch</ref> are work in progress.
</item>
</list>
</p>
<p>The rest of these recommendations are structured as follows:
<list>
<item><ref target="#chp-overall">Chapter 2</ref> explains the overall XML
structure of a ParlaMint corpus, and introduces the distinction between the
corpus root and corpus components;</item>
<item><ref target="#chp-general">Chapter 3</ref> explains some general requirements
and the file-naming conventions a ParlaMint corpus has to meet; it also introduces
the top level elements and their attributes and the main pointing attributes;</item>
<item><ref target="#sec-metadata">Chapter 4</ref> concentrates on the stucture and
encoding of the corpus metadata, such as the title information, documenting the source
of the corpus, taxonomies used etc.;</item>
<item><ref target="#chp-speaker">Chapter 5</ref> explains how and what information
must be encoded about the persons giving the speeches and the (political)
organisations they belong to;</item>
<item><ref target="#chp-transcript">Chapter 6</ref> treats the encoding of the
transcripts, including speeches and transcriber notes;</item>
<item><ref target="#chp-linguistic">Chapter 7</ref> details the addition of
linguistic annotations to the corpus;</item>
<item><ref target="#chp-validation">Chapter 8</ref> introduces scripts to finalise,
validate and convert a ParlaMint corpus to other formats;</item>
<item><ref target="#chp-contributing">Chapter 9</ref> gives instructions on how to
contribute samples of a ParlaMint corpus to GitHub;</item>
<item><ref target="#schema">Appendix A</ref> gives the formal specification of
the Parla-CLARIN schema.</item>
</list>
</p>
</div>
<div xml:id="chp-overall">
<head>Overall corpus structure</head>
<div xml:id="chp-xml-struct">
<head>XML structure</head>
<p>The parliamentary proceeding of one country of autonomous region constitute one
ParlaMint corpus, which is stored as one XML document, with <gi>teiCorpus</gi> as its
top-level element. It is composed of a <gi>teiHeader</gi>, giving the metadata for the
corpus as a whole (further detailed in the Section on <ref target="#sec-metadata">Corpus
metadata</ref>), followed by a series of <gi>TEI</gi> elements that each contain one
<term>corpus component</term>, as illustrated<note>Note that this is a illustrative
example, i.e. a valid ParlaMint corpus would also need certain attributes to be defined
on the illustrated elements. This holds for all the examples in this section.</note>
below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure">
<![CDATA[
<!-- Corpus root -->
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>...</teiHeader>
<TEI>...</TEI> <!-- Corpus component -->
<TEI>...</TEI> <!-- Corpus component -->
... <!-- More corpus components -->
</teiCorpus>]]>
</egXML>
Each corpus component should contain at most the transcripts for <hi>one
day</hi>, although several components can contain the transcript for the same
day, e.g. for different (types of) meetings. How and if these further
subdivisions into separate components are realised is dependent on the corpus, as
the granularity of parliamentary proceedings corpora, not to mention the national
rules of structuring the workings of the parliament, differ substantially.</p>
<p>A corpus component will thus be rooted in the <gi>TEI</gi> element, which then
contains its metadata in its own <gi>teiHeader</gi>, followed by the
<gi>text</gi> element, which contains the transcription of the particular
component, as illustrated below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure-comp">
<TEI>
<teiHeader>...</teiHeader>
<text>...</text>
</TEI>
</egXML>
</p>
<p>The <gi>teiHeader</gi> of a corpus component (further detailed in the Section on
<ref target="#sec-metadata">Corpus metadata</ref>) contains the metadata specific for
this component (along with some redundant metadata about the provenance), and which
should be unique in the corpus, i.e. the corpus component metadata should distinguish
it from all the other components of the corpus.</p>
</div>
<div xml:id="sec-xinclude">
<head>Use of XInclude</head>
<p>The fact that a corpus is one XML document does not mean that it is also stored in
one file. In fact, ParlaMint requires that each corpus component is stored in a
separate file, with the <term>corpus root</term>, i.e. the top-level
<gi>teiCorpus</gi>, also stored as one file. Furthermore, some parts of the corpus
root metadata are also stored in separate files.</p>
<p>To enable one XML document to be composed of many files, we use the <ref
target="https://www.w3.org/TR/xinclude/">XInclude</ref> mechanism, and the
<term>corpus root</term> uses this mechanism (i.e. the <gi>include</gi>
elements in the XInclude namespace) to include its <term>corpus component</term>
files, so a corpus root will be in fact encoded similarly to the following
example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-docstructure-root">
<![CDATA[
<!-- Corpus root file -->
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" >
<teiHeader>...</teiHeader>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="2014/ParlaMint-NL_2014-04-16.xml"/> <!-- Corpus component file -->
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="2014/ParlaMint-NL_2014-04-17.xml"/> <!-- Corpus component file -->
... <!-- More corpus component files -->
</teiCorpus>]]>
</egXML>
</p>
<p>Apart from corpus components, some parts of the overall corpus metadata (i.e. the
<gi>teiCorpus</gi> <gi>teiHeader</gi> element) are also stored as separate files, and
hence also included in the corpus root using the same XInclude mechanism as explained
above.</p>
</div>
<div xml:id="sec-files">
<head>File names and directory structure</head>
<p>ParlaMint has strict rules on how to name the various files that constitute a
corpus, and how to collect them in directories.</p>
<p>The file names have the the following structure:
<list>
<item>The corpus root file name should start with the string
<code>ParlaMint-</code>, followed by the ISO
3166 country (or automous region) code
(cf. Section on <ref target="#sec-standard">Standard values</ref>)
e.g. <code>ParlaMint-NL.xml</code> or <code>ParlaMint-ES-CT</code>.</item>
<item>For machine-translated corpora the ISO 639 code of
the language (cf. Section on <ref target="#sec-standard">Standard
values</ref>) should follow the country code, e.g. <code>ParlaMint-NL-en.xml</code>.
</item>
<item>A corpus component filename should start with the name of the root, followed
by an underscore and the ISO 8601 formatted date of the transcript, for example
<code>ParlaMint-IS_2015-01-21-54.xml</code>. In case a corpus component is further
distinguished, so that there are are several components with the same date, the
corpus compilers are free to extend the file name by a hyphen and any suffix containing
only ASCII letters and numbers and the hyphen character,
e.g. <code>ParlaMint-NL_2018-10-30-eerstekamer-4.xml</code> or
<code>ParlaMint-CZ_2016-04-13-ps2013-044-02-016-098.xml</code></item>
<item>Certain metadata elements from the corpus root <gi>teiHeader</gi> are stored
in separate files, in particular the list of speakers, <gi>listPerson</gi>,
the list of political parties and other organisations, <gi>listOrg</gi>, and
the ParlaMint structural and linguistic taxonomies, i.e. <gi>taxonomy</gi> elements.
The file names for such metadata files start with the name of the corpus root,
followed by a hyphen, and then the name of the element,
e.g. <code>ParlaMint-BE-listPerson.xml</code>. Where there are more files for
instances of the same element name, as is the case for taxonomies, the filename
should end with another hypen, followed by the ID of the particular element,
e.g. <code>ParlaMint-BE-taxonomy-UD-SYN.xml</code>. Finally, some of the
taxonomies are not corpus-specific, i.e. identical files are used by all
ParlaMint corpora. In this case, the country or region code is ommitted,
e.g. <code>ParlaMint-taxonomy-parla.legislature.xml</code>.
</item>
<item>The file names of the corpus as a whole or corpus components that have
been automatically converted from the source XML into some other format should
have the same name as the corpus root or components, respectively, but with
appropriate file extensions, e.g, <code>ParlaMint-IS_2015-01-21-54.txt</code>; this
is further explained in the Section on <ref
target="#sec-conversion">Conversions</ref>.</item>
<item>As discussed in the Chapter on <ref target="#chp-linguistic">Linguistic
annotation</ref> we distinguish the linguistically annotated version of the
corpus from the <q>plain-text</q> one, with the linguistic annotated version
having the additional suffix <code>.ana</code> on the corpus root and
components, e.g. <code>ParlaMint-ES-CT.ana.xml</code> or
<code>ParlaMint-IS_2015-01-21-54.ana.xml.</code></item>
</list>
</p>
<p>For distribution the complete XML corpus should be stored in a
directory that has the same name prefix as the corpus root file. The directory
then contains the corpus root file and its metadata files,
while the corpus components should be in subdirectories, one per year, for example:
<eg xml:id="exa-directories"><lb/>
<code>ParlaMint-BE.TEI/ParlaMint-BE.xml</code><lb/>
<code>ParlaMint-BE.TEI/ParlaMint-BE-listPerson.xml</code><lb/>
<code>ParlaMint-BE.TEI/ParlaMint-BE-listOrg.xml</code><lb/>
<code>ParlaMint-BE.TEI/ParlaMint-taxonomy-parla.legislature.xml</code><lb/>
<code>ParlaMint-BE.TEI/ParlaMint-taxonomy-speaker_types.xml</code><lb/>
<code>...</code><lb/>
<code>ParlaMint-BE.TEI/2014/ParlaMint-BE_2014-06-19.xml</code><lb/>
<code>ParlaMint-BE.TEI/2014/ParlaMint-BE_2014-06-30.xml</code><lb/>
<code>ParlaMint-BE.TEI/2014/ParlaMint-BE_2014-07-17.xml</code><lb/>
<code>...</code><lb/>
<code>ParlaMint-BE.TEI/2015/ParlaMint-BE_2015-01-06-54.xml</code><lb/>
<code>ParlaMint-BE.TEI/2015/ParlaMint-BE_2015-01-07-54.xml</code><lb/>
<code>ParlaMint-BE.TEI/2015/ParlaMint-BE_2015-01-08-54.xml</code><lb/>
<code>...</code><lb/>
</eg>
The lingistically annotated version of the corpus is stored
separately, with the main directory and, as mentioned, the
corpus root and component filenames having the
additional suffix <code>.ana</code>, e.g.
<eg xml:id="exa-directories.ana"><lb/>
<code>ParlaMint-BE.TEI.ana/ParlaMint-BE.ana.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/ParlaMint-BE-listPerson.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/ParlaMint-BE-listOrg.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/ParlaMint-taxonomy-speaker_types.xml</code><lb/>
<code>ParlaMint-taxonomy-NER.xml</code><lb/>
<code>ParlaMint-taxonomy-UD.xml</code><lb/>
<code>...</code><lb/>
<code>ParlaMint-BE.TEI.ana/2014/ParlaMint-BE_2014-06-19.ana.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/2014/ParlaMint-BE_2014-06-30.ana.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/2014/ParlaMint-BE_2014-07-17.ana.xml</code><lb/>
<code>...</code><lb/>
<code>ParlaMint-BE.TEI.ana/2015/ParlaMint-BE_2015-01-06-54.ana.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/2015/ParlaMint-BE_2015-01-07-54.ana.xml</code><lb/>
<code>ParlaMint-BE.TEI.ana/2015/ParlaMint-BE_2015-01-08-54.ana.xml</code><lb/>
<code>...</code><lb/>
</eg>
</p>
</div>
</div>
<div xml:id="chp-general">
<head>General requirements</head>
<p>This section gives some general requirements a ParlaMint corpus has to meet,
in particular those relating to the characters in a corpus, and the use of
standards. It also details the structure of the file names of the ParlaMint root
and component files, as well as the attributes expected on the <gi>teiCorpus</gi>
and <gi>TEI</gi> tags.</p>
<div xml:id="sec-chars">
<head>Characters</head>
<p>The corpus should be encoded in Unicode, using the UTF-8 character encoding, at
least for European languages. In cases where the original contains characters from
the Unicode Private Use Area, these should, if possible, be given their closest
Unicode equivalents or substituted by the Unicode replacement character
U+FFFD. End-of-line hyphens, if present in the source files, should be removed, and
the split words joined in order to enhance searching the corpus and to simplify
linguistic processing.</p>
<p>The following characters, esp. prevalent when the source documents were in Word or
HTML, deserve special mention:
<list>
<item>TAB (U+0009) character helps the alignment of strings on successive
lines. As ParlaMint is not interested in preserving the layout, all TAB
chacters are substituted by space characters (U+0020).</item>
<item>NO-BREAK SPACE (U+00A0) prevents, with some applications, an automatic
line break at its position and also collapsing such consecutive characters
into a single space. As the use of this character complicates (or breaks)
further processing, esp. linguistic annotation, these characters should be
substituted by the normal space character (U+0020). The same holds for other
variants of spaces (U+2000 - U+200A), which are, however, used much less
frequently.</item>
<item>NON-BREAKING HYPHEN (U+2011), similarly to NO-BREAK SPACE, prevents a
line break, in this case following its position. With a similar reasoning as
above, this character should be substituted by the normal hyphen character
('-', U+002D).</item>
<item>SOFT HYPHEN (U+00AD) indicates that a word can be hyphenated at that
point. Occurrences of this character should be removed from the corpus.</item>
</list>
</p>
<p>Text-bearing elements should also not start or end with space characters, and
sequences of whitespace characters should be changed into a single space.</p>
</div>
<div xml:id="sec-standard">
<head>Standard values</head>
<p>Whenever possible, ParlaMint uses standards for information coding. In
particular, the following information must be standardised:
<list>
<item xml:id="iso3166">As the identity of a ParlaMint corpus is determined by the country or
region of the particular parliament, its code appears in many places. For
specifying these codes, the <ref
target="https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes">ISO
3166</ref> standard should be used, in particular <ref
target="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">ISO 3166-1
alpha-2</ref> for the two letter codes of the countries (for national
parliaments) and <ref target="https://en.wikipedia.org/wiki/ISO_3166-2">ISO
3166-2</ref> for the names of country subdivision (for parliaments of
autonomous provinces,). So, for example, the country code for Spain is "ES",
while the code for the autonomous Basque community is "ES-PV". Note that we
use the term <term>regional parliaments</term> for such cases.</item>
<item>The codes for the languages used in the corpora (i.e. the possible
values of the <att>xml:lang</att> attribute) should follow <ref
target="https://tools.ietf.org/html/bcp47">BCP 47</ref> (cf. also <q><ref
target="https://www.w3.org/International/questions/qa-when-xmllang">xml:lang
in XML document schemas</ref></q>. Essentially, this means that the value for
a language code should have two letters, following <ref
target="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">ISO 639-1</ref>
or, and only if a two letter code does not exist for a language, the
three-letter <ref target="https://en.wikipedia.org/wiki/ISO_639-2">ISO
639-2/T</ref> code. For example, the code for Basque is 'eu'. ParlaMint
corpora will use at least two languages, i.e. the language that the
transcriptions are written in, which we will call the <term>local
language</term> and English, as the meta-language, which is (also) used in
the metadata.
</item>
<item>Temporal, i.e. time-related information is typically stored in the
<att>when</att>, <att>from</att> and <att>to</att> attributes of various
elements. To specify a date or time as the value of these attributes,
formatting according to the <ref
target="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</ref> standard should
be used, e.g. <val>2022-04-01</val> for the 1st of April 2022. More
information on temporal attributes is given in the Section on <ref
target="#sec-temporal">Temporal attributes</ref>.</item>
</list>
</p>
</div>
<div xml:id="sec-roottags">
<head>Attributes of top-level elements</head>
<p>The Chapter on <ref target="#chp-overall">Overall corpus structure</ref>
introduced the top level elements of the corpus root file and of the component
files (i.e. the <gi>teiCorpus</gi> and <gi>TEI</gi> elements), but did not
elaborate on their attributes; these are presented in this section.</p>
<p>The corpus root has three required attributes, as shown below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-teiCorpusRoot">
<![CDATA[
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0"
xml:id="ParlaMint-FR"
xml:lang="fr">]]>
</egXML>
All three attributes can also be used on any other element, and are thus of
special importance:
<list>
<item><att>xmlns</att> determines the namespace of the element, and this should
always be the TEI namespace, i.e. <val>http://www.tei-c.org/ns/1.0</val>. Note
that all lower level elements in the same file inherit this namespace, so it is
not necessary (although it is not an error) for other elements to also define
their namespace.</item>
<item><att>xml:id</att> is an attribute form the (implicitly assumed) XML
namespace, and gives the identifier for the corpus root or component. The
value of an ID should be unique in the corpus as a whole and should obey
format requirements as defined by <ref
target="https://www.w3.org/TR/xml-id/">W3C</ref>.
For the corpus root, as well as for the components, it is required that this
top level identifier is identical to the file name (without the file
extension).
The <att>xml:id</att> is a global attribute, so any element can have
it. While this is not required, it is necessary for any element that
is then referred to (via this same ID) by some other element, such
as many elements in the <gi>teiHeader</gi>, as is explained in the
Section on <ref target="#sec-metadata">Corpus metadata</ref>.
The subordinate elements in the transcription that have an ID (such as
utterances and segments), are recommended to have the top level <att>xml:id</att> as a
prefix and to indicate the element name in the ID. For example, if the top
level ID is <val>ParlaMint-GB_2021-01-06</val>, the first utterance
would have the ID <val>ParlaMint-GB_2021-01-06-lords.u1</val> and the first
segment <val>ParlaMint-GB_2021-01-06-lords.seg1</val>. The number of the
element should not have leading zeros.</item>
<item><att>xml:lang</att> is also a global attribute and gives the language
code of the text content of the element; for the corpus root this does not
(just) mean the content of its TEI header, but primarily the textual content
of its XIncluded components. The convention is that language of the text
content of an element is determined by the value of the first
<att>xml:lang</att> attribute on its ancestor axis. In cases where the content
is multilingual, the language code should be of the majority language. When
the proportion of the languages is about equal, then the <val>mul</val> code
for multiple languages can also be used.</item>
</list>
</p>
<p>A corpus component also has the same three required attributes, but
additionally also the <att>ana</att> attribute:
<!-- The info in @ana should actually go in the
TEI/teiHeader/profileDesc/textClass - but it is probably too late to change
this now -->
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-teiCorpusComp">
<![CDATA[
<TEI xmlns="http://www.tei-c.org/ns/1.0"
xml:id="ParlaMint-FR_2017-07-04-E1001"
xml:lang="fr"
ana="#parla.sitting #reference">]]>
</egXML>
The same as for the corpus root, the component also sets the TEI namespace, and
gives the language of its textual content, while its <att>xml:id</att>, of
course, identifies the particular component. The <att>ana</att> attribute is a
pointing attribute, and we introduce the these attributes in the next
section.</p>
</div>
<div xml:id="sec-pointing">
<head>Pointing attributes</head>
<p>The ParlaMint encoding uses pointing attributes for a number of purposes,
e.g. for references to taxonomy categories, to speaker metadata, or to
linguistic categories.</p>
<p>While a few elements have dedicated pointing attributes, there are three
generally used ones. They share the characteristics that they are all used by a
large number of different elements and that their value is a series of pointers,
i.e. a white-space delimited sequence of references to the values of some
<att>xml:id</att> attribute in the corpus or, in general, to an URI.
The three attributes are:
<list>
<item><att>ana</att> serves to provide an analysis or to classify an element
according to some pre-determined vocabulary. In ParlaMint the target element will
typically be a category in a taxonomy, an event or date, or an organisation.</item>
<item><att>corresp</att> points to items that correspond to the current
element in some way, e.g. the (URL of a) media file to a page break.
</item>
<item><att>ref</att> provides an explicit reference to the full definition or
identity for the entity being named. In ParlaMint it is used e.g. for
connecting a person's affiliation with a particular organisation. The value
of this attribute is often, but not always, an URL, e.g. for associating a
place name with its GeoNames URL.</item>
</list>
</p>
<p>To illustrate, the example below gives some elements that contain one or more of these
attributes:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-pointing">
<meeting ana="#parla.upper #parla.term #LEG.18">18 Legislatura</meeting>
...
<affiliation ref="#group.L-SP-PSd.Az" role="member" ana="#LEG.18" from="2018-03-27"/>
...
<placeName ref="https://www.geonames.org/2523918">Palermo</placeName>
...
<link ana="ud-syn:det" target="#ParlaMint-IT.seg1.2.6 #ParlaMint-IT.seg1.2.5"/>
</egXML>
The first example, with the <gi>meeting</gi> element classifies it (the
definitions are given in the relevant taxonomy) as a meeting of the upper house,
in the scope of a parlimentary term, specifically in the XVIII Legislative Term.
The example with <gi>affiliation</gi> (again, the definitions are given the
elements with the pointed-to ID) specifies that the (person that has this)
affiliation is a member of the parliamentary group <q>Lega-Salvini
Premier-Partito Sardo d'Azione</q> in the scope of the XVIII Legislative
Term. The <gi>placeName</gi> example gives the definition of Palermo in the
GeoNames database via the used URL. Finally, the <gi>link</gi> example
illustrates a Universal Dependencies determiner syntactic link between two
tokens. The link uses the TEI extended pointer syntax, further explained in the
Section on <ref target="#sec-ana-prefixDef">Prefix definitions</ref>.</p>
<p>It is often difficult to decide which of the attribute to use for a
particular pointer, therefore examples of usage given with the relevant element
should be always consulted.</p>
</div>
<div xml:id="sec-temporal">
<head>Temporal attributes</head>
<p>ParlaMint makes a lot of use of temporal information, e.g. to determine when
a session took place or the period when a certain person was an MP. As mentioned
in the Section on <ref target="#sec-standard">Standard values</ref>, the <ref
target="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</ref> format should be
used to specify the dates or times.</p>
<p>The following attributes are used to specify temporal information:
<list>
<item>The <att>when</att> attribute is used when the temporal information
refers to a point in time, typically a date, and is used e.g. to give the date
when the corpus was published, or when a change in the corpus was made.</item>
<item>The <att>from</att> and <att>to</att> attributes give the starting and
ending date or time of an interval, e.g. the time period the corpus covers, or
the period when a person was an MP. If only one of the two attributes is
present, then the assumption is that this interval extends at least to the
start (if <att>from</att> is missing) or after the end (if <att>to</att> is
missing) of time period that the particular ParlaMint corpus covers. Similary,
if both attributes are missing, the assumption is that the interval covers the
complete time period of the ParlaMint corpus.</item>
</list>
</p>
</div>
</div>
<div xml:id="sec-metadata">
<head>Corpus metadata</head>
<p>As mentioned, <gi>teiCorpus</gi> and <gi>TEI</gi> elements contain the obligatory
<gi>teiHeader</gi> element, which stores the metadata to the corpus root or
component. In this section we explain and give examples of the required and optional
metadata that is contained in the <gi>teiHeader</gi>, proceeding through its various
elements, and there distinguishing which parts and what content is appropriate for
the corpus root, and which for a corpus component.</p>
<p>As a general remark, most metadata contains free text, and it is a requirement
of ParlaMint that this data is given in the English language, to help researchers
for other countries to understand it, and it is recommended to also give it in the
local language in which the (main portion of) parliamentary transcripts is
written, for a local researcher to be able to use it in their native tongue.</p>
<p>A ParlaMint <gi>teiHeader</gi> contains three obligatory elements: the file
description, <gi>fileDesc</gi>, the encoding description, <gi>encodingDesc</gi>, and
the profile description, <gi>profileDesc</gi>, and an optional revision description,
<gi>revisionDesc</gi>:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-teiHeader">
<teiHeader>
<fileDesc>...</fileDesc>
<encodingDesc>...</encodingDesc>
<profileDesc>...</profileDesc>
<revisionDesc>...</revisionDesc>
</teiHeader>
</egXML>
Below we explain each of these element in turn.
</p>
<div xml:id="sec-fileDesc">
<head>File description</head>
<p>The file description, <gi>fileDesc</gi> is composed of five obligatory
elements, namely the title statement, <gi>titleStmt</gi>, the edition statement,
<gi>editionStmt</gi>, the extent, <gi>extent</gi>, the publication statement,
<gi>publicationStmt</gi>, and the source description, <gi>sourceDesc</gi>:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-fileDesc">
<fileDesc>
<titleStmt>...</titleStmt>
<editionStmt>...</editionStmt>
<extent>...</extent>
<publicationStmt>...</publicationStmt>
<sourceDesc>...</sourceDesc>
</fileDesc>
</egXML>
</p>
<div xml:id="sec-titleStmt">
<head>Title statement</head>
<p>The title statement, <gi>titleStmt</gi> gives the title of the corpus root or
component, along with the specification of the particular session(s) of the
parliament contained, the persons responsible for compiling the corpus, and the
funder(s) of the project.</p>
<p>This structure is exemplified by the following <hi>corpus root</hi> title
statement:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-titleStmtRoot">
<titleStmt>
<title type="main">Slovenski parlamentarni korpus ParlaMint-SI [ParlaMint]</title>
<title type="main" xml:lang="en">Slovenian parliamentary corpus ParlaMint-SI [ParlaMint]</title>
<title type="sub">Zapisi sej Državnega zbora Republike Slovenije, 7. in 8. mandat (2014 - 2020)</title>
<title type="sub" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia, Term 7 and 8 (2014 - 2020)</title>
<meeting n="7" corresp="#DZ" ana="#parla.lower #parla.term #DZ.7">7. mandat</meeting>
<meeting n="8" corresp="#DZ" ana="#parla.lower #parla.term #DZ.8">8. mandat</meeting>
<respStmt>
<persName ref="https://orcid.org/0000-0001-6143-6877">Andrej Pančur</persName>
<persName ref="https://orcid.org/0000-0002-1560-4099">Tomaž Erjavec</persName>
<resp>Kodiranje ParlaMint TEI XML</resp>
<resp xml:lang="en">ParlaMint TEI XML corpus encoding</resp>
</respStmt>
<funder>
<orgName>Raziskovalna infrastruktura CLARIN</orgName>
<orgName xml:lang="en">The CLARIN research infrastructure</orgName>
</funder>
<funder>
<orgName>Slovenska raziskovalna infrastruktura CLARIN.SI</orgName>
<orgName xml:lang="en">The Slovenian research infrastructure CLARIN.SI</orgName>
</funder>
</titleStmt>
</egXML>
The title statement starts with two titles (one main, the other subordinate), both
in English and the local language, with the appropriate language code possibly
inherited from a superordinate element. They are distinguished by the value
<val>main</val> or <val>sub</val> of their <att>type</att> attribute and the
value of their <att>xml:lang</att> attribute.</p>
<p>The main title has a formulaic structure <q><Country name>
parliamentary corpus ParlaMint-<Country code> [ParlaMint]</q>, with an
equivalent structure for the local language. Note that the corpus <q>stamp</q>
in square brackets can also be <q>[ParlaMint.ana]</q> for the linguistically
annotated version of the corpus (as explained in the Chapter on <ref
target="#chp-linguistic">Linguistic annotation</ref>) or <q>[ParlaMint
SAMPLE]</q> for corpus data samples, as available on the <ref
target="https://github.com/clarin-eric/ParlaMint/">ParlaMint GitHub
repository</ref>.</p>
<p>The subordinate title, in contrast to the main one, is free text, and usually
formed on the basis of the source of the corpus. As with the main one, it should
be given in both languages.</p>
<p>After the titles come the specification of the particular sessions that the
corpus contains, encoded as <gi>meeting</gi> elements: the two meeting
elements in the above example state that the ParlaMint-SI corpus contains the
meetings of the 7th and 8th terms of the lower house of the National Assembly
of the Republic of Slovenia. The <gi>meeting</gi> elements can give, as the
value of their <att>n</att> attribute, the numbers of the meetings that the
corpus covers, and their text content can give a free-text description of the
meetings in the local language.</p>
<p>The formal information on the meetings is given in the values of the
<att>corresp</att> and <att>ana</att> attributes, which are pointing
attributes, as already explained in the Section on <ref
target="#sec-roottags">Attributes of top-level elements</ref>. Here they refer
to the definition of organisations further explained in the Section on <ref
target="#sec-orgs">Organisations</ref> and the categories of taxonomy
elements, further explained in the Section on the <ref
target="#sec-classDecl">Class declaration</ref>. The value of the
<att>corresp</att> attribute points to the governmental body of which a
particular meeting element is a meeting of (in this case the National Assembly
of the Republic of Slovenia), while the <att>ana</att> attribute contains a
space-delimited sequence of pointers: <val>#parla.lower</val> points to the
definition of the lower house, <val>#parla.term</val> to the definition of a
parliamentary term, and <val>#DZ.7</val> to the definition of the seventh
mandate.
<!-- This is silly, why two attributes, why DZ in one, lower in the other?
Did we have a reason for such an encoding? Maybe ask Andrej
We could post an issue. -->
</p>
<p>Next come one or more responsibility statements, <gi>respStmt</gi>, each
one containing one or more person names, <gi>persName</gi>, with an optional
<att>ref</att> attribute, giving the URL, where more information about the
person can be found, and the responsibility element <gi>resp</gi>, which
specifies what responsibility the statement is about.</p>
<p>In a similar manner, the <gi>funder</gi> elements give information on the
organisations which have financially contributed to the compilation of the corpus,
with the names of the organisations given in the <gi>orgName</gi> elements.</p>
<p>A <hi>corpus component</hi> has a very similar title statement to the
corpus root, except that certain elements specify the metadata of the
component, rather than the complete corpus. The also contain some redundant
metadata, in particular, the responsibility statement and the funder, as
illustrated in the example below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-titleStmtComp">
<titleStmt>
<title type="main">Slovenski parlamentarni korpus ParlaMint-SI, izredna seja 59 [ParlaMint]</title>
<title type="main" xml:lang="en">Slovenian parliamentary corpus ParlaMint-SI, Extraordinary Session 59 [ParlaMint]</title>
<title type="sub">Zapisi sej Državnega zbora Republike Slovenije, 7. mandat, 59. izredna seja, 13.4.2018</title>
<title type="sub" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia, Term 7, Extraordinary Session 59, 13.4.2018</title>
<meeting n="59"
corresp="#DZ"
ana="#parla.lower #parla.meeting.extraordinary">Izredna</meeting>
<meeting n="7" corresp="#DZ" ana="#parla.lower #parla.term #DZ.7">7. mandat</meeting>
<respStmt>
<persName>Andrej Pančur</persName>
<resp>Kodiranje TEI</resp>
<resp xml:lang="en">TEI corpus encoding</resp>
</respStmt>
<funder>
<orgName>Raziskovalna infrastruktura CLARIN</orgName>
<orgName xml:lang="en">The CLARIN research infrastructure</orgName>
</funder>
<funder>
<orgName>Slovenska raziskovalna infrastruktura CLARIN.SI</orgName>
<orgName xml:lang="en">The Slovenian research infrastructure CLARIN.SI</orgName>
</funder>
</titleStmt>
</egXML>
In the example it can be seen that the main title of a corpus component is simply
an extension of the corpus root title, as it also gives the name of the particular
meeting that the component contains, while the subordinate title is, again, free
text. Both titles must be unique in the complete corpus.</p>
<p>The other difference is in the <gi>meeting</gi> elements, which here specify a
particular meeting of the corpus component transcription. In the exmple above,
this is an extraordinary meeting of the lower house in the seventh term of the
National Assembly of the Republic of Slovenia.
</p>
</div>
<div xml:id="sec-editionStmt">
<head>Edition statement</head>
<p>ParlaMint corpora have their edition statement, <gi>editionStmt</gi> both in
the corpus root and components. As illustrated below, the only element it contains
is <gi>edition</gi>:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-editionStmt">
<editionStmt>
<edition>3.0</edition>
</editionStmt>
</egXML>
We use semantic versioning to specify the version of the corpus, i.e. giving the
version number, where a new major version means substantial changes to the
corpus, while the minor version is reserved for e.g. correcting errata or other
minor changes. We do not use the patch number. It should be noted that - at least
so far - all the ParlaMint corpora were released together, so that they are all
of the same edition, i.e. have the same version number. At the time of writing,
the latest version is 2.1, with the next one planned to be 3.0.</p>
</div>
<div xml:id="sec-extent">
<head>Extents</head>
<p>The <gi>extent</gi> element gives information on selected sizes of the
complete corpus (in the corpus root) or of one corpus component, as illustrated
below in the case of a corpus root extent:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-extent">
<extent>
<measure unit="speeches" quantity="75122" xml:lang="sl">75.122 govorov</measure>
<measure unit="speeches" quantity="75122" xml:lang="en">75,122 speeches</measure>
<measure unit="words" quantity="20190034" xml:lang="sl">20.190.034 besed</measure>
<measure unit="words" quantity="20190034" xml:lang="en">20,190,034 words</measure>
</extent>
</egXML>
ParlaMint requires two sizes to be given, and in both languages, which are
distinguished by their <att>unit</att> attribute, namely the number of speeches
and the number of words. The exact quantity is given in the <att>quantity</att>
attribute, while the text content of <gi>measure</gi> gives the quantity together
with the unit - if possible, the number here should contain the thousands
separator appropriate for the language.</p>
<p>It should be noted that both sizes are somewhat complex to compute and are
inserted into the TEI headers in the finalisation of a corpus (cf. the Section on
<ref target="#sec-final">Finalisation of corpora</ref>) by a common script,
so it is not necessary to insert the extent in the process of developing a ParlaMint
corpus.</p>
</div>
<div xml:id="sec-publicationStmt">
<head>Publication statement</head>
<p>The publication statement <gi>publicationStmt</gi> must appear in the corpus
root as well as, in identical form, in the corpus components. As illustrated
below, it contains information about the publisher of the corpus, the persistent
identifier where the complete corpus can be found, under which licence it is
distributed, and when it was released:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-publicationStmt">
<publicationStmt>
<publisher>
<orgName xml:lang="sl">Raziskovalna infrastrukutra CLARIN</orgName>
<orgName xml:lang="en">CLARIN research infrastructure</orgName>
<ref target="https://www.clarin.eu/">www.clarin.eu</ref>
</publisher>
<idno type="URI" subtype="handle">http://hdl.handle.net/11356/1432</idno>
<availability status="free">
<licence>http://creativecommons.org/licenses/by/4.0/</licence>
<p xml:lang="sl">To delo je ponujeno pod
<ref target="http://creativecommons.org/licenses/by/4.0/">Creative Commons Priznanje avtorstva 4.0
mednarodna licenca</ref>.</p>
<p xml:lang="en">This work is licensed under the
<ref target="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0
International License</ref>.</p>
</availability>
<date when="2021-06-11">11. 6. 2023</date>
</publicationStmt>
</egXML>
The <gi>publisher</gi> is, at least for the corpora produced in the scope of the
CLARIN ParlaMint project, the CLARIN research infrastructure, and the element
also gives the home page of the infrastructure.
The <q>identifier number</q> element, <gi>idno</gi>, specifies via its
<att>type</att> and <att>subtype</att> attributes with fixed values
<val>URI</val> and <val>handle</val> that the identifier is a handle, and
contains the handle where the complete corpus corresponding to the specified
version can be found.
The <gi>availability</gi> specifiers, via its <gi>licence</gi> element the
fixed-value CC BY 4.0 URL, and in the following paragraph gives a prose
description of the licence, including its URL via the <att>target</att> attribute
of <gi>ref</gi>. As usual, the textual information is given in both languages.
Finally, the <gi>date</gi> gives the date of the release, where the
<att>when</att> gives the date in the ISO 8601 format, while the textual content
can give it according to the conventions used in the local language.
</p>
</div>
<div xml:id="sec-sourceDesc">
<head>Source description</head>
<p>The source description <gi>sourceDesc</gi> of the <hi>corpus root</hi> encodes
the original digital source of the ParlaMint corpus in the <gi>bibl</gi> element,
as shown in the following example:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-sourceDescRoot">
<sourceDesc>
<bibl>
<title type="main" xml:lang="sl">Zapisi sej Državnega zbora Republike Slovenije</title>
<title type="main" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia</title>
<idno type="URI">https://www.dz-rs.si</idno>
<date from="2014-08-01" to="2020-07-16">1.8.2014 - 16.7.2020</date>
</bibl>
</sourceDesc>
</egXML>
<!-- But what about if the corpus was made from a previous corpus, or if they
received the dump directly from the government? Have a look at the French
data! -->
Apart from the bi-lingual <gi>title</gi>s, it should also give in <gi>idno</gi>
with the fixed <att>type</att> as <val>URI</val> the government URL where the
transcripts were first harvested from, while the dates of the earliest and latest
transcript in the corpus are indicated by the <att>from</att> and <att>to</att>
attributes of the <gi>date</gi> element. As usual, the values of these attributes
should be according to ISO 8601, while the textual content can be formatted
according to the local rules for writing dates.</p>
<p>For <hi>corpus components</hi> the source description is very similar to the
one for the corpus root, except that the <gi>title</gi> can be modified to
constrain the description to the exact meeting the component contains.
The <gi>date</gi> element must, of course, specify the exact date when the
meeting took place.
If the transcription of the meeting is avilable on the Web, the <gi>idno</gi>
should give this URL.
Furthermore, if the audio or video of the meeting is available, this information
can be given in the <gi>recodingStmt</gi>, as illustrated in the example below:
<egXML xmlns="http://www.tei-c.org/ns/Examples" xml:id="exa-sourceDescComp">
<sourceDesc>
<bibl>
<title type="main" xml:lang="cs">Parlament České republiky, Poslanecká sněmovna</title>
<title type="main" xml:lang="en">Parliament of the Czech Republic, Chamber of Deputies</title>
<idno type="URI">https://www.psp.cz/eknih/2013ps/stenprot/044schuz/s044033.htm</idno>
<date when="2016-04-13">13.04.2016</date>
</bibl>
<recordingStmt>
<recording type="audio">
<media xml:id="ps2013-044-02-000-000.audio1" mimeType="audio/mp3" source="https://www.psp.cz/eknih/2013ps/audio/2016/04/13/2016041308580912.mp3" url="2013ps/audio/2016/04/13/2016041308580912.mp3"/>
</recording>
</recordingStmt>
</sourceDesc>
</egXML>
<!-- The exemplified optional audio (or video) recording information is meant for
cases where the media file or, in general, URL encompasses the complete
transcription of the corpus component; for cases when the media are further
segmented, e.g. by speech, a different encoding is used, which is further
explained in the Section <ref target="#sec-audio">Audio</ref>. -->
</p>
<p>As the example shows, the recording statement contains a <gi>recording</gi>