- Focused web crawl
- Web scraping (e.g. Beautiful Soup: tutorial at Programming Historian)
- Directly from the parliamentary office!
- Examples for each country in this GitHub repository :-)
Microsoft Word. Get your free trail version
Oxygen XML Editor. Get your free trial license key for all products.
Parla-CLARIN example in this repository.
Open Parla-CLARIN-Exemplar.docx
We will use: DOCX to TEI to HTML Conversion: http://nl.ijs.si/tei/convert/. OxGarage-like service, based on TEI Stylesheets
Save a template.docx as a template (see tutorial).
Apply new Word Template to Parla-CLARIN-Exemplar.docx tutorial.
Paragraph level styles
Character level styles
Using standard Word formatting
Using named tei:* styles (see also)
Using special extensions for some named tei:* styles, e.g. tei:sp
Submit Parla-CLARIN-Exemplar.docx with JSI profile and return:
- zip
Get Oxygen XML Editor. Get Trial License Key for all products.
Change tei:sp/tei:l in tei:sp/tei:p
Convert from TEI drama to TEI speech
Research data in CLARIN repository, e.g. http://hdl.handle.net/11356/1167
With teiPublisher
Slovenian example: https://exist.sistory.si/exist/apps/parla/
Concordancer: siParl in NoSkech Engine
In directory with country code (ISO 3166-1 alpha-2) name.
Mostly original PDF converted to Word document with ABBYY FineReader - optical character recognition (OCR) application: AT, CZ, DE, HR, HU, MD, ME, SI, UK, YU
Some original Word document: MK, RS
In some cases, I found only HTML webpages: BG, RO, SK