PARTHENOS Workshop for CEE countries

Compiling parliamentary corpora, Hands-on (Tomaž Erjavec and Andrej Pančur)

How to get source data:

Focused web crawl
Web scraping (e.g. Beautiful Soup: tutorial at Programming Historian)
Directly from the parliamentary office!
Examples for each country in this GitHub repository :-)

What we need for this workshop?

Microsoft Word. Get your free trail version

Oxygen XML Editor. Get your free trial license key for all products.

To begin with, an example from Parla-CLARIN

Parla-CLARIN example in this repository.

Open Parla-CLARIN-Exemplar.docx

Basic annotation in Word file

We will use: DOCX to TEI to HTML Conversion: http://nl.ijs.si/tei/convert/. OxGarage-like service, based on TEI Stylesheets

Save a template.docx as a template (see tutorial).

Apply new Word Template to Parla-CLARIN-Exemplar.docx tutorial.

Paragraph level styles
Character level styles
Using standard Word formatting
Using named tei:* styles (see also)
Using special extensions for some named tei:* styles, e.g. tei:sp

Automatic conversion to TEI

Submit Parla-CLARIN-Exemplar.docx with JSI profile and return:

URL
zip

Additional annotation in TEI document

Get Oxygen XML Editor. Get Trial License Key for all products.

Example of semi-automatic annotation with XSLT stylesheets

Change tei:sp/tei:l in tei:sp/tei:p

Convert from TEI drama to TEI speech

Publish your corpus

Research data in CLARIN repository, e.g. http://hdl.handle.net/11356/1167

With teiPublisher

Slovenian example: https://exist.sistory.si/exist/apps/parla/

Concordancer: siParl in NoSkech Engine

Examples from CEE countries

In directory with country code (ISO 3166-1 alpha-2) name.

Mostly original PDF converted to Word document with ABBYY FineReader - optical character recognition (OCR) application: AT, CZ, DE, HR, HU, MD, ME, SI, UK, YU

Some original Word document: MK, RS

In some cases, I found only HTML webpages: BG, RO, SK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PARTHENOS Workshop for CEE countries

Compiling parliamentary corpora, Hands-on (Tomaž Erjavec and Andrej Pančur)

How to get source data:

What we need for this workshop?

To begin with, an example from Parla-CLARIN

Basic annotation in Word file

Automatic conversion to TEI

Additional annotation in TEI document

Example of semi-automatic annotation with XSLT stylesheets

Publish your corpus

Examples from CEE countries

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
AT		AT
BG		BG
CZ		CZ
DE		DE
HR		HR
HU		HU
MD		MD
ME		ME
MK		MK
RO		RO
RS		RS
SI		SI
SK		SK
Schema		Schema
UK-bound		UK-bound
UK		UK
YU		YU
parla-clarin		parla-clarin
project/tei/stylesheet		project/tei/stylesheet
README.md		README.md
parla-compile.pptx		parla-compile.pptx

DARIAH-SI/CPC

Folders and files

Latest commit

History

Repository files navigation

PARTHENOS Workshop for CEE countries

Compiling parliamentary corpora, Hands-on (Tomaž Erjavec and Andrej Pančur)

How to get source data:

What we need for this workshop?

To begin with, an example from Parla-CLARIN

Basic annotation in Word file

Automatic conversion to TEI

Additional annotation in TEI document

Example of semi-automatic annotation with XSLT stylesheets

Publish your corpus

Examples from CEE countries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages