Skip to content
/ CPC Public

PARTHENOS Workshop for CEE countries: Compiling parliamentary corpora, Hands-on

Notifications You must be signed in to change notification settings

DARIAH-SI/CPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PARTHENOS Workshop for CEE countries

Compiling parliamentary corpora, Hands-on (Tomaž Erjavec and Andrej Pančur)

How to get source data:

  • Focused web crawl
  • Web scraping (e.g. Beautiful Soup: tutorial at Programming Historian)
  • Directly from the parliamentary office!
  • Examples for each country in this GitHub repository :-)

What we need for this workshop?

Microsoft Word. Get your free trail version

Oxygen XML Editor. Get your free trial license key for all products.

To begin with, an example from Parla-CLARIN

Parla-CLARIN example in this repository.

Open Parla-CLARIN-Exemplar.docx

Basic annotation in Word file

We will use: DOCX to TEI to HTML Conversion: http://nl.ijs.si/tei/convert/. OxGarage-like service, based on TEI Stylesheets

Save a template.docx as a template (see tutorial).

Apply new Word Template to Parla-CLARIN-Exemplar.docx tutorial.

  • Paragraph level styles

  • Character level styles

  • Using standard Word formatting

  • Using named tei:* styles (see also)

  • Using special extensions for some named tei:* styles, e.g. tei:sp

Automatic conversion to TEI

Submit Parla-CLARIN-Exemplar.docx with JSI profile and return:

  • URL
  • zip

Additional annotation in TEI document

Get Oxygen XML Editor. Get Trial License Key for all products.

Example of semi-automatic annotation with XSLT stylesheets

Change tei:sp/tei:l in tei:sp/tei:p

Convert from TEI drama to TEI speech

Publish your corpus

Research data in CLARIN repository, e.g. http://hdl.handle.net/11356/1167

With teiPublisher

Slovenian example: https://exist.sistory.si/exist/apps/parla/

Concordancer: siParl in NoSkech Engine

Examples from CEE countries

In directory with country code (ISO 3166-1 alpha-2) name.

Mostly original PDF converted to Word document with ABBYY FineReader - optical character recognition (OCR) application: AT, CZ, DE, HR, HU, MD, ME, SI, UK, YU

Some original Word document: MK, RS

In some cases, I found only HTML webpages: BG, RO, SK

About

PARTHENOS Workshop for CEE countries: Compiling parliamentary corpora, Hands-on

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published