Skip to content

Latest commit

 

History

History
466 lines (323 loc) · 23.5 KB

framapad.md

File metadata and controls

466 lines (323 loc) · 23.5 KB

#Use cases for ELTeC

  1. DS : intertextuality : how literary works are mentioned

  2. FJ: character constellations : network measures

  3. CS: MoU says "authorship attribution, topic modelling, character network analysis, stylistic analysis"

  4. DS: What kinds of places/names of places are described in different novels

  5. LB: structural properties, formal organization

  6. DS: description of characters; types of characters; readers as characters

  7. JR: do stylometric authorship attribution methods work in all langs in same way

  8. ?: what do people eat

  9. B: location, places where stories take part. morpho syntax, readability; e.g. words/sentence

  10. BN: how about metaphors

  11. B: requires a lot of work to identify them

  12. FJ: narratalogical features direct speech;

  13. DS: presentation of other european cultures

  14. BN: sentiment analysis

  15. LB: bibliometric aspects: book history

  16. CO: diachronic development; publication place; gender

  17. DS: addressing forms;

  18. G: Evolution and validation of literary movemenets; empirical definition

  19. BN: temporal expressions

  20. FJ: all of the above in a comparative perspective varying by e.g. male/female, date, place

  21. CS: subgenres?

  22. P: Rate and measurability of change; how innovation spreads; both in topic and mode of expression

  23. JS: Character idiolects ; places

  24. evolution of wordsenses over time

  25. local coherence; cohesiveness; evolution of realism into modernism

**A modest proposal for minimal selection criteria **

Level 1 ("eligibility"): In order to be included, a text must...

  1. have been first published as a book between 1850 and 1920, sub-grouped and equally split by decades

  2. have first been published in a European country. [maybe not "first": within that decade]

  3. be a novel, i.e. a fictional prose narrative of a minimum length of 10,000 words

  4. have originally been written in the language of the given subcollection Level 2 ("composition"): Among the novels in each language subcollection...

  5. at least 10%-50% have been written by female authors (upper limit for comparability?) for the language subcollection

  6. at least 10 authors are represented with three novels (upper limit for variety?)

  7. at least 30% are highly canonized novels, at least 30% should be non-canonized novels, based on reprints two groups: 1 Group: canon. reprinted more than once 2. Group: non-canon. not or once reprinted, period: 1980-2000

  8. at least 20% are short novels (10-50k word tokens), at least 20% are long novels (>200k word tokens) Aim to maximize the variation within each time period though.

If any of these criteria cannot be met within a given language subset, then the collection should indicate this

Level 3: Other desirable criteria: ]

Timespan information is missing... Pieter : cataloguing practice sometimes misrepresents actual dates; better to group by decade.

Volume of texts expands over time, but correcting for that doesn't affect the use cases cited.

Publication place is a problem, at least for Portuguese. Brazilian publication. Croatian authors published in Germany. Jan: Publication date can be a problem too: do we take the date of when the author finished the book (on good authority); first publication (e.g. in installments in newspapers - very frequent in our time) ; first frame); first book publication...? : we proposed first book publication what if there's a longer time lapse between either? Play it by ear?

How do we define/assess/operationalise canonicity? Lists of books? Reading lists at different educational levels? Or simply assess whether there has been at least one reprint within a given span? Two types of canonicity: reprint within short period, or over several decades? Use canonical classifications in literary history, but put them in metadata,

But publishing country is independent of language.

How do we avoid unconscious bias in making our selections? Should we construct an idealised corpus actually representative of population in parallel, or as an extension of ELTeC ? core ELTec as an exploratory corpus

  • later extent the corpus (random samples)length criterion: shd exclude short novels, since many use cases require larger extracts

scope of compositional criteria: 10% over all periods? or within a given period?

WG1 2018-02-12 NOTES (\url{**https://mensuel.framapad.org/p/distantReadingWG1-2018-02)**}

[ dinner for 18 at 7 pm (Carolin: Please note, that I just copied your entries and tried to sort them thematically)

MoU says we want to do something different in constructing ELTeC. Goal is a benchmark collection within specific period, for European literature. We want to co-ordinate construction by many partners; WG1 task is to focus on sampling and content, the data, its basic annotations and metadata. Three tasks addressed by WG1 : selection criteria; guidelines for basic annotation; workflow.

  1. Issues concerining selection criteria
  • Selection criteria are intended to define common ground, not to define what the novel is. Tho we need an operational definition: based on cataloguing.

      • Minimal criteria : a long prose fiction
      • Definition of novel : realism / length ?
  • Descriptive criteria are not the same as selection criteria

  • Topdown approach permits later complexity. WG2/WG3 communication point

  • Tension: how can we compare across disparately defined corpora?

  • No translations! But number of translations is important. "Degrees of canonicity" is measurable? It varies across different cultures.

Ranking criteria

    • Can we agree on core criteria? In Greek most of the texts should be scanned and OCRed from scratch since the existing e-versions do not have good quality.
    • Shd criteria be the same for each language?
  • Representative of production or reception or variety ? Link with "balance"

      • Do we have enough information to define whole population?
      • Is "relevance" a criterion? If so what does it mean? Influence on other novels..
      • It's mostly canonical novels in existing corpora: how to define canon?
      • We can operationalize canonicity by entries in a catalogue like WorldCat or the length of passages describing the author/novel in a literary history
      • Low/High brow distinction is shared by most literatures acc. Fotis. e can modify in the light of experience. It's better to have 100 novels soon! Methods of publication differ in different countries Do we have enough texts for multiple criteria? Novel index lists criteria, and give score for features present. Scaleable model. Should we require at least two novels by a given author? Authorship attribution is an important application:miultiple texts are needed to validate authorship claims.

Hungarian catalogue info is available! Collection in National Library. \url{http://www.mek.oszk.hu/indexeng.phtml}

Balacining: Maybe split the 100 novels: 50 canonized novels and 50 randomized novels

  1. Extension of the scope of ELTeC:

      • European languages uses or countries? (Portuguese/English/Spanish/German...)
      • including other languages?
  2. Different terms

  • Corpus/Collection
  • Annotation/Encoding
  1. General issues We should aim to make corpus available under CC-BY as soon as possible, and should be used. We should also write a paper.

Carolin presents summary of paper at \url{https://distantreading.github.io/sampling\_proposal.html}

Which texts should we include. How will we sample all European literature? Clear, operationalised and motivated selection criteria. Distinct from descriptive metadata.

Canon-based or metadata-based

There are many canons, eg reflecting differemt prestige groups, economics, reader response etc. Criteria are hard to define. Metadata criteria are reearch-based; distinct precise set of metadata. Selection without reading. Canons are normative, limiting, time-based. Metadata might allow creatioon of different perspectives, Different ways of considring actual text; work, extensions, manifestations. Digitizations are ontologically different from texts Representativeness is not an innate quality. Its relational, and we dont have knowledge of what it relates to. Intention is to represent variety of possible values; or represent the distribution of those values across population. We propose the former. Balance: control proportion of text according to features. Balancing gender over whole period or over specific time slots.

MoU requires equal no of langs, single genre

Suggested sampling criteria: clear, operational decidable without reading text

text edition: First book edition; no translations; digitally available; printed full texts; criteria: amiahe. date, reprint count, author sex, length. topic?

date: subgroups reprint counts; gender/sex length Balance example

topic is a fuzzy concept; some things identified as novel not previously so considered Link from wg3

Christof takes over chair

diana: reprints: size of edition /print run is also important ; population of readers vary in time; CO: but we dont know who they are. fotis: difficult to get numbers -- no complete catalog fpor german between 1910

  • copyright argument for 1st ed isnt correct; hard to be certain
  • difference between variation and balance? co: reflecting statistically wd mean fewer female authors; representing variety means we have female authors at each period downside is cant say anything about distribtion

mateij: propose to collect as many as possible from each lanfuage, with dynamic selection christof: difficult ! co: where do we stop?

bereneke: stick with biber; define target population first, wrt research question. So canonicity is relevant. co: our approach doesnt exclude canonical texts. representative of whole pop is not possible.but we can define what wed like to have. george: this is a multipurpose corpus, and requirements may conflict. shd be as representative as possible, with stratified sampling, so we can construct balanced corpora christof: we have problem of not knowing population; and size of sampling is too small pieter: tension between sampling for varietyu and sampling for population; id go for former. maximize variability as size of corpus increase jan: now you see why noone has done this before! literatrures behave differently: distribution for gender is dufferent. ask each nat rep for 100 novels. co: personal selection is not going to help.! historical variation in what "national" means; we dont want to go there diane: we have to be responsibie for our choices. proportions for balance vary jco: basic criteria afre enough. we can add more criteria later.

[internet fail]

....

we have to start somewhere

CO : propose some use cases we'd like to support.

coffee break

##Section Minimal Encoding Scheme

LB: Presentation of the encoding-proposal \url{https://distantreading.github.io/encoding\_slides.html#(1)} Note that most of the members know the TEI

LB: summarize the proposal and focus on the things which are difficult and still open It is important to define HOW we would like to use the TEI! --> in our Action: Defining a guaranteed minimum of features of the text (structural features and metadata) Goal: to inform the distant reading analysis Main goal: be consistent!

First Decision: In general: We agree on a first level annotation model, a fixed set of annotation which are document share/have in common In a next step (not today) we wcoudl decide to have a second level annotation model

Second Decision: Encoding: We transcribe the characters we have in the novel and use UTF8 encoding. We maintain the interpunctuation (no further interpretation is needed)

Third decision: Metadata: We don't want keywords (no own creation and no catalogue keywords).

4. Decision: Metadata; Only document the main language in the novel.

5. Decision: We use metadata in language of the novel. We need an identifer for places. We need to use an identifier for persons/author cf. wiki data etc. if available? 6. Decision: We will use the rest of the metadata proposed in Lou's paper. (including our today's decisions)

7.Decision: We have corpus description files. Representing the subcollection (langauge) and the ELTeC as a subcorpus.

8. Decision: If there are no book publications (case of latvian), we take the journal publication and add the metadata of the journals

9. Decision: List of things we would like to annotate 10. Decision: List of different kind of texts (with TEI annotation):

                   * no titlepage, 
                   * include preface and  introduction  (contemporary with the text) with annotating it and specifying the author,
                   * no table of contents
                   * include afterword and  appendix  (contemporary with the text) with annotating it and specifying the author,
                   * include footnotes or comments 
                   * no errata list

11. Decision: Encoding

               * First level annotation:
                   * annotation of footnotes (we will test whether finding footnotes will be a problem; if so they go to 2nd level ), 
                   * afterwords, appendix, preface, introduction
                   * include <p>
                   * no <lb>
                   * no annotation of a lists, the textual material will be in the corpus with the <p> annotation
                   * suppress the tables, annotate with gap
                   * suppress figures/pictures with a gap
                   * suppress the heading of a picture / figure 
                   * typographic information is ignored
                   * no <pb>
                   * hyphenation is merged
                   * include <head> (for chapters etc.)
                   * include <div>
                   * no annotation of quotes (cf. mottos), instead using <p>
                   * retain information from level0 if possible; but mark with <gap> and put into comments
                   * 

           * OPEN issue Second Level annotation (we need to decide soon, what will be contained in the second level annotation!):
                   * annotation of direct and indirect speech
                   * quote
                   * discontinuous sentences
                   * include line, list, table annotation 
                   * 

                   * 

Do we already have collections we can use for ELTeC

Spanish: ok Serbian: ok Britain: ok Swiss: ok Polish: ok italian: some kind of ok - problems with finding female authors French: ok Italien: ok German: ok Romain:digibook Latvian: difficult Russian: ok czech: there are some (epub-format) Greek: ok Portugese: ok problems with finding female authors norwegian: ok hungarian: ok

Homework: collection of sources of texts We should coordinate who work on which texts

LB: What about names of persons in different language? Gabor: What about authors who are not listed in the reference list/authority file?

Cvetana: In which language we write the metadata?

CO/FJ: Do we need keywords? Do we want a subgenre metadata?

Suggation Cvetana: We need a corpus metadata

FJ: Maybe: Workflow question: maybe we can use some converter: start to collect the metadata in some kind of style sheet/table and than convert it to proper TEI would be more easier

CS: using the element?

  • What would we like to represent?
  • we don't want to lose the information that there is something special at this point in the text but we don't want to define them in this first iteration. In a later step we may/can refine this annotation.
  • No distinct "special cases"

CO: what should be the scope of the annotation? With narrow scope we can then define exactly hich text is direct speech

Diana: Argue for the annotation of direct speech:

FJ: is it hard to define direct sppech. Again Suggestion: Start with a level zero markup, second level annotation (later) might include direct/indirect speech LB: Typolographie not that interesting as direct/indirect speech. What should be the threshold for accuracy

Berenike: What if we find a TEI document with richer mark up? LB: We remove the extra annotation (which is not included in our minimal TEI annotation.

What about reannotating existing documents? (for example annotation for mark up such as italics)

What about exceptions for the annotation model'?

We need to be consistent and need to document our decisions

Section Workflow

BN Presentation of the workflow-proposal \url{https://distantreading.github.io/workflow\_proposal.html}

CS: organisation of our github organization - one repository for each language as a subcollection. Each subcolection will be archived in zenodo and get a DOI - every time we make a new release. Every subcollection can be maintained seperately - we will then have the possibility to work independently on the subcollections we would like to use the homepage for promoting the Action

LB: Pointed out that in some case where the TEI documents are rich in a first step, we can derive level 1 annotation from level 2 annotation (reduce existing TEI mark up) CS: We will not necessarily proceed strictly from level 1 to level 4 - there will be several ways to create and derive the corpus data CO: The corpus will contain several formats containig different kinds of annotation. We don't aim to include or merge all annotation in one format/ corpus architecture

LB: What about the structure of the repository: Where to put the corpus header? Where to put the scripts?

Do the level zero files need to be complete / full - LB: no, we can easily upoad file by file in github -->argument for including level zero files in the github repository

CS: Correction: use dictionaries to automatically correct common errors. Use a frequency list of all words - put these table in our repository CS: How do we make sure that the text is complete?

To Do for the WG leads: - CO will update the slides in the repository - BN will upload the slides in the repository - BN; LB, CO will revise the proposals according to results of the WG meeting + updates will be found in the repository

Open Issues:

  • - Proposal: start the period 1840 - CS: we collect the arguments for the different options, sending them to the Action in order that everybody can react / agree / disagree - core group will decide based on everybody's arguments
    
  • [AT - Petition. The Italian group warmly suggests to change the time-span in [1840-1920]: additional ten years do not change basic rules for corpus collection, they are useful for partitioning the period into intervals of 20 years in length, and, first and foremost, the Italian corpus has the opportunity to include Alessandro Manzoni's "I Promessi sposi" (it represents a fundamental work to study both European and Italian literary history)]

    • decide on the common licence of ELTeC
    • Normalization? do we need one? Language specific? When to normalize
    • we need to work out a versioning scheme - Guidelines for this and person responsible for this
    • What about the structure of the repository: Where to put the corpus header (probably a single git repository)? Where to put the scripts (probably a single git repository)?
    • For correction, use spelling checker to automatically correct common errors.
    • How do we make sure that the text is complete? We need some heursitics: maybe checking the chapter headings (e.g. if all chapters all in)
    • person responsible for versioning
    • person responsible for fullfill the periods / timeline cf. MoU: iterations
      • due by Nov 2018: 6 subcollections with level 1 annotation
      • due by : 4 subcollections in the same period, with level 1 annotation
    • Team Encoding level 2: Borja, Lou, Christof, Raquel, Carolin, Bereneika, Mateij : next version to be done by easter
    • Team on Versioning Guidelines: Carolin, Raquel, Lou, to be done by end May
    • Sampling Support Team : Carolin, Pieter, Diana
    • Timeline Guardline and Integrity Police : Christof
  • Circulate team proposals to whole WG1: comments within 1 week.

  • TO DO: CO will create lists for the tools and the textarchives we might use in our Action

    • Decision: Define Identifer for our documents: for digits e.g. EN1111 + gets into the teiHeader
  • Normalisation?

  • -Maciej: we are preparing a clean training corpus for re-use in "real" distant reading. OCR must be clean, text must [not] be normalised; semantic level; not typographic level. word, syllable, para. we need to take care of quality

  • -Raquel: if that's so, we should not do orthographic normalisation.

    • Diana: we shouldnt normalise because better NLP tools need to know about spelling variation
    • Carolin: spelling variation is one kind of normalisn; inflection;
    • Christof: needs to be clarified
    • LB: linguistic normalisation; spelling normalisn; typographic normalisn
    • TEI tags can be used to represent normalisn: Serbian for example has many spelling changes
    • CO: spelling normalisn in DE is complex, e.g. identifying compounds is a linguistically-motivated decision
    • Christof: some materials we want to reuse will have already been normalised. We need to document for each text, whether it has been normalised
    • Jan: probably canonical texts will be normalised/modernised and noncanonical wont.
    • Diana: in PO there have been 4 spelling reforms, so there will be many versions. In NO they do not normalise old texts at all.
    • Bere: if we wanted to normalise DE texts there are tools
    • Raquel: NLP tool argument is good.
    • normalization of typographic features: not including the errors in the texts? Do we have time to do that? If it can be done by Find and Replace?
    • BO metadata needs to specify
  • Conclusion: no normalisation of any kind a priori.

  • Christof: candidate for grant proposal: better ways of handling normalisation

  • If we find errors in a text, how ensure consistency in different levels. Chapter missing. If conversion between levels is automatic, consistency can be automated eg by chron job. But may not be possible: better to communicate using github issue. New release. Versioning Guidelines should be decided.

  • Suggestion: collect one document for each language in level 1 or 2 usable for discussion, practice. LB volunteers to collect one novel in each lang in any format. Put a page on Github wiki to accumulate links.

  • Validation of level 2+ annotation can be done by multiple validators. Check for completeness should be done at selection time.

  • CS: List what we will contribute to the Cost Action and how many texts we already have (in HTML, TEI, etc.) to get an overview due by end of February

    • CO, LB; BN: will send an email to mailing list WG1 to call for contribution
    • CO: check the member list / languages proposed in the MoU which ones might be missing?
   * wiki: list of tools, set this link \url{https://lindat.mff.cuni.cz/en/services}
  • management of tine?

  • Romanian literature has three types of writing during 1850 and 1920

Latvian Novels - is it possible to reach criteria? First novel published in 1873 (in a periodical, 1879 printed in a book)

1873-1920 ~123 novels published: 38 printed in books (11 digitized) 85 in periodicals (~75% digitized), never published as books

3 written by women

From 38 printed books: 6 authors have more than 2 novels published (from above mentioned) 3 authors have more than 3 novels published