-
Notifications
You must be signed in to change notification settings - Fork 4
/
encoding_slides.html
72 lines (72 loc) · 20.3 KB
/
encoding_slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.tei-c.org/ns/1.0" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><title>WG1: Encoding Proposals</title><meta name="generator" content="Generated by TEISLIDY stylesheet"></meta><script src="https://www.w3.org/Talks/Tools/Slidy/slidy.js" type="text/javascript"></script><link rel="stylesheet" type="text/css" media="screen, projection" href="https://www.w3.org/Talks/Tools/Slidy/show.css"></link><link href="../css/egXMLhandling.css" rel="stylesheet" type="text/css"></link><link href="../css/tei.css" rel="stylesheet" type="text/css"></link></head><body class="simple" id="TOP"><div class="slide cover"><img src="media/logo.jpg" width="40%" style="float:left" alt="[Your logo here]" class="cover"></img><br clear="all"></br><h1>WG1: Encoding Proposals</h1><p></p></div><div class="slide"><h2>Who am I?</h2><p class="box">I am Lou Burnard, now on my third or fourth life.</p><ol><li class="item">I was born on the same day as the poet John Milton, but approx 300 years later. I studied at Oxford University, with a masters in English Studies, specialising in 19th century literature in 1971</li><li class="item">After which I taught World Literature in the University of Malawi for a couple of years</li><li class="item">For about 25 years I worked at Oxford University Computing Services, initially as a data centre operator, eventually as Assistant Director</li><li class="item">I started the Oxford Text Archive in 1976; the Text Encoding Initiative in 1987; the British National Corpus in 1994; </li><li class="item">In 2010 I took early retirement from OUCS and started work as a freelance</li><li class="item">Between 2009 and 2012 I worked closely with the TGE Adonis which eventually became HumaNum, the French digital humanities infrastructure </li></ol><p>Look me up on Google if you get bored during the rest of this talk... </p></div><div class="slide"><h2>Proposed Encoding Guidelines for the ELTeC</h2><p>A discussion document setting out the full proposal is available <a class="link_ref" href="https://distantreading.github.io/encoding_proposal.html">here</a></p><p>We summarize the proposals as follows: </p><ul><li class="item">Use TEI XML : a well-established, customizable, scholarly standard </li><li class="item">Capture a <span style="font-variant: small-caps">guaranteed minimum</span> of features for each text: <ul><li class="item">significant structural features (chapters, headings, paragraphs...) </li><li class="item">descriptive metadata (bibliographic and non bibliographic)</li></ul></li><li class="item">The proposal raises a number of <span style="font-variant: small-caps">open questions</span> as to <em>which</em> features should be captured</li><li class="item">The proposal defines an <span style="font-variant: small-caps">XML schema</span> and a set of rules which can be used to validate converted texts (more or less) automatically</li></ul></div><div class="slide"><h2>What sort of guaranteed minimum?</h2><p class="box">The focus is <em>not</em> to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.</p><p>For example, </p><ul><li class="item">to distinguish headings and annotations from the rest of the text</li><li class="item">to be able to locate stretches of text within gross structural features such as chapters and paragraphs</li><li class="item">to distinguish narrative voices (?)</li></ul></div><div class="slide"><h2>Why XML-TEI?</h2><p>Why not just use plain text?</p><ul><li class="item">By using an XML based format, we ensure that<ul><li class="item">ELTeC texts can be validated </li><li class="item">ELTeC texts can be converted to other formats using simple widely-available technologies </li><li class="item">ELTeC texts can be enriched with additional more sophisticated annotations</li></ul></li><li class="item">By using TEI, we can take advantage of tools and techniques, widely used across the research community likely to be interested in the ELTeC </li><li class="item">NB Using the TEI does <em>not</em> mean our encoding will represent every possible textual feature or metadatum ... on the contrary!</li></ul></div><div class="slide"><h2>Taming the TEI</h2><div class="figure" style="text-align: center"><img src="media/occam.jpg" alt="" style="float:center; height:40%;"></img></div><ul><li class="item"><p>The TEI offers a choice of over 450 different elements ... we will use (and our schema will only permit) about thirty. </p></li><li class="item"><p>The TEI is very flexible in the structures and perspectives it supports. We will apply Occam's razor extensively. </p></li></ul></div><div class="slide"><h2>Basic structure of an ELTeC text</h2><div class="figure"><img src="media/example-1.png" alt="" style="float:center; height:60%;"></img></div><p class="box">Goal : represent only what is essential to an understanding of the text</p></div><div class="slide"><h2>What are the essential components of a novel?</h2><p>It seems uncontroversial to distinguish in our markup chapters, headings, paragraphs but how about :</p><ul><li class="item">title page ?</li><li class="item">preface or introduction ?</li><li class="item">table of contents ?</li><li class="item">appendix or afterword ?</li><li class="item">footnotes or comments ?</li><li class="item">errata lists ?</li></ul><p>It's not hard to find TEI tags for these: but is it helpful? can we be consistent in their application ? </p></div><div class="slide"><h2>TEI encoding typically loses typographic subtleties </h2><div class="frame"><div class="col"><div class="figure"><img src="media/forster-chap2.png" alt="" class="graphic" style=" height:80%;"></img></div></div><div class="col"><div class="listhead">Are we bothered?</div><ul style="float:left"><li class="item">the chapter title is centred</li><li class="item">there are linebreaks within the paragraphs (and sometimes words get hyphenated as a result)</li><li class="item">the first word is capitalised</li><li class="item">paragraphs are indented (except for the first)</li><li class="item">dash and quote marks have narrative function</li><li class="item">hyphens may or may not be significant</li><li class="item">double quotes and single quotes have different functions</li></ul></div></div></div><div class="slide"><h2>Which typographic features should we keep ?</h2><div style="text-align:center"><img src="media/howards-chap1-1970.png" alt="(Penguin, 1970)" class="graphic" style=" width:60%;"></img><div class="caption">Figure 1. (Penguin, 1970)</div></div><div style="float:right"><img src="media/howards-chap1-1921.png" alt="(Knopf, 1921)" class="graphic"></img><div class="caption">Figure 2. (Knopf, 1921)</div></div><div class="figure"><img src="media/howards-chap1-1991.png" alt="(Everyman, 1991)" class="graphic"></img><div class="caption">Figure 3. (Everyman, 1991)</div></div><div style="text-align:center"><img src="media/howards-chap1-1910.png" alt="(First US ed, 1910)" class="graphic" style=" width:80%;"></img><div class="caption">Figure 4. (First US ed, 1910)</div></div></div><div class="slide"><h2>What about material other than running prose and dialogue ?</h2><p>Novels often contain material other than running prose</p><p>We could: </p><ol><li class="item">use the appropriate TEI elements for verse or drama (<span class="gi"><lg></span>, <span class="gi"><l></span>, <span class="gi"><sp></span>, <span class="gi"><stage></span>)</li><li class="item">use the appropriate TEI elements for lists and tables (<span class="gi"><list></span>, <span class="gi"><label></span>, <span class="gi"><item></span>, <span class="gi"><table></span>, <span class="gi"><cell></span>, <span class="gi"><row></span>)</li><li class="item">use the appropriate TEI elements for graphics (<span class="gi"><figure></span>, <span class="gi"><graphic></span>, <span class="gi"><head></span>)</li></ol><p>Or we could </p><ul><li class="item">suppress non-prose material, replacing it by <span class="gi"><gap></span></li><li class="item">lie </li></ul><p style="font-style:italic">Whichever we choose to do, we must be consistent!</p></div><div class="slide"><h2>An example</h2><div class="frame"><div class="col"><div class="figure"><img src="media/paulVirg-p127.png" alt="" class="graphic"></img></div></div><div class="col"><div class="p">Should this be encoded as: <div id="index.xml-egXML-d30e277" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><p></span>
<span class="element"><label></span>le vieillard.<span class="element"></label></span>
« Oh mon ami ! ne m’avez-vous pas dit que vous
n’aviez pas de naissance ?
<span class="element"></p></span></div> or (expensively) <div id="index.xml-egXML-d30e283" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><sp></span>
<span class="element"><speaker></span>le vieillard.<span class="element"></speaker></span>
<span class="element"><p></span>« Oh mon ami ! ne m’avez-vous pas dit que
vous n’aviez pas de naissance ?<span class="element"></p></span>
<span class="element"></sp></span></div> or (deceitfully) <div id="index.xml-egXML-d30e290" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><p></span>le vieillard.<span class="element"></p></span>
<span class="element"><p></span>« Oh mon ami ! ne m’avez-vous pas dit que
vous n’aviez pas de naissance ?<span class="element"></p></span></div></div></div></div></div><div class="slide"><h2>Another example</h2><div class="figure"><img src="media/howards-page61.png" alt="" class="graphic"></img></div><div class="p">Should this be encoded as: <div id="index.xml-egXML-d30e304" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><p></span> Even in her photographic days she had relied upon her smile and her figure to
attract, and now that she was <span class="element"><quote></span>
<span class="element"><l></span>"On the shelf,<span class="element"></l></span>
<span class="element"><l></span>On the shelf,<span class="element"></l></span>
<span class="element"><l></span>Boys, boys, I'm on the shelf,"<span class="element"></l></span>
<span class="element"></quote></span> she was not likely to find her tongue. Occasional bursts of song (of
which the above is an example) still issued from her lips, but the spoken word
was rare. <span class="element"></p></span></div> or (deceitfully) <div id="index.xml-egXML-d30e316" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><p></span>... and now that she was<span class="element"></p></span>
<span class="element"><p></span>"On the shelf,
<span class="element"><lb/></span>On the shelf,
<span class="element"><lb/></span>Boys, boys, I'm on the shelf,"<span class="element"></p></span>
<span class="element"><p></span>she was not likely to find her tongue. Occasional ...<span class="element"></p></span></div></div></div><div class="slide"><h2>Some other open questions</h2><ul><li class="item">should we capture page breaks in our source edition?</li><li class="item">should we remove/resolve end of line hyphenation? </li><li class="item">should we try to interpret typographic variation (italics, etc.) e.g. as <span class="gi"><title></span> <span class="gi"><emph></span> <span class="gi"><foreign></span> <span class="gi"><abbr></span>?</li><li class="item">should we capture (using <span class="gi"><hi></span>) typographic features (and if so should we use <span class="att">rend</span> or <span class="att">style</span>...</li><li class="item">or should we just ignore them ?</li></ul><p class="box">Again, consistency of practice is essential. Whether we decide to drop or to preserve these features, we must do so for every text. </p></div><div class="slide"><h2>Metadata : the TEI Header</h2><div class="figure"><img src="media/teiHeader.png" alt="" class="graphic"></img></div><p>We propose using this for all metadata. It will provide for each text </p><ul><li class="item">bibliographic information</li><li class="item">sampling and descriptive criteria applicable</li><li class="item">housekeeping information</li></ul><p>The schema will check consistency of data supplied. </p></div><div class="slide"><h2>A possible title statement</h2><p>We may need to modify the TEI definitions</p><div id="index.xml-egXML-d30e387" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><titleStmt></span>
<span class="element"><title></span>Howards End : ELTeC edition<span class="element"></title></span>
<span class="element"><author <span class="attribute">dates</span>="<span class="attributevalue">1879 1970</span>" <span class="attribute">sex</span>="<span class="attributevalue">M</span>"></span>
<span class="element"><persName></span>
<span class="element"><forename></span>Edward<span class="element"></forename></span>
<span class="element"><forename></span>Morgan<span class="element"></forename></span>
<span class="element"><surname></span>Forster<span class="element"></surname></span>
<span class="element"></persName></span>
<span class="element"><persName></span>E.M. Forster<span class="element"></persName></span>
<span class="element"><idno <span class="attribute">type</span>="<span class="attributevalue">viaf</span>"></span>https://viaf.org/viaf/31996364<span class="element"></idno></span>
<span class="element"><idno <span class="attribute">type</span>="<span class="attributevalue">wiki</span>"></span>https://www.wikidata.org/wiki/Q189119<span class="element"></idno></span>
<span class="element"></author></span>
<span class="element"><respStmt></span>
<span class="element"><resp></span>ELTeC encoding<span class="element"></resp></span>
<span class="element"><name></span>Lou Burnard<span class="element"></name></span>
<span class="element"></respStmt></span>
<span class="element"></titleStmt></span></div></div><div class="slide"><h2>An example source description</h2><div id="index.xml-egXML-d30e413" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><sourceDesc></span>
<span class="element"><bibl></span>
<span class="element"><author></span>E.M. Forster<span class="element"></author></span>
<span class="element"><title></span>Howards End<span class="element"></title></span>
<span class="element"><pubPlace></span>London<span class="element"></pubPlace></span>
<span class="element"><publisher></span>Edward Arnold<span class="element"></publisher></span>
<span class="element"><date></span>1910<span class="element"></date></span>
<span class="element"><idno <span class="attribute">type</span>="<span class="attributevalue">wiki</span>"></span>https://www.wikidata.org/wiki/Q1146642<span class="element"></idno></span>
<span class="element"></bibl></span>
<span class="element"><bibl></span>
<span class="element"><title></span>The Project Gutenberg Etext of Howards End, by E. M. Forster<span class="element"></title></span>
<span class="element"><ref <span class="attribute">target</span>="<span class="attributevalue">http://www.gutenberg.org/files/2891/2891-h/2891-h.htm</span>"></span>HTML
version downloaded on <span class="element"><date></span>2017-12-26<span class="element"></date></span>
<span class="element"></ref></span>
<span class="element"></bibl></span>
<span class="element"><note <span class="attribute">type</span>="<span class="attributevalue">editions</span>" <span class="attribute">source</span>="<span class="attributevalue">worldcat</span>"></span> Worldcat lists 484 print editions in
English<span class="element"></note></span>
<span class="element"></sourceDesc></span></div></div><div class="slide"><h2>And finally... profile and revision descriptions</h2><div id="index.xml-egXML-d30e440" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><profileDesc></span>
<span class="element"><langUsage></span>
<span class="element"><language <span class="attribute">ident</span>="<span class="attributevalue">en-BR</span>" <span class="attribute">usage</span>="<span class="attributevalue">99</span>"></span>British English<span class="element"></language></span>
<span class="element"><language <span class="attribute">ident</span>="<span class="attributevalue">de</span>" <span class="attribute">usage</span>="<span class="attributevalue">1</span>"></span>German<span class="element"></language></span>
<span class="element"></langUsage></span>
<span class="element"><textClass></span>
<span class="element"><keywords <span class="attribute">source</span>="<span class="attributevalue">http://wikidata.org</span>"></span>
<span class="element"><term></span>social class<span class="element"></term></span>
<span class="element"><term></span>social convention<span class="element"></term></span>
<span class="element"><term></span>modernity<span class="element"></term></span>
<span class="element"><term></span>family drama<span class="element"></term></span>
<span class="element"></keywords></span>
<span class="element"><catRef <span class="attribute">target</span>="<span class="attributevalue">#author_m #reprint_3</span>"/></span>
<span class="element"><classCode <span class="attribute">scheme</span>="<span class="attributevalue">UDC</span>"></span>8231.111<span class="element"></classCode></span>
<span class="element"></textClass></span>
<span class="element"></profileDesc></span></div><p>The values supplied by <span class="att">target</span> are defined in a project-wide <span class="gi"><taxonomy></span>; this and other project-wide metadata is held in a separate corpus header.</p><div id="index.xml-egXML-d30e468" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element"><revisionDesc></span>
<span class="element"><change <span class="attribute">when</span>="<span class="attributevalue">2018-02-11</span>" <span class="attribute">who</span>="<span class="attributevalue">LB</span>"></span>Added to EN collection<span class="element"></change></span>
<span class="element"></revisionDesc></span></div><p>Just one small question... </p></div><div class="slide"><h2>How do we get there from here?</h2><ul><li class="item">some, but by no means all, the titles we would like to include may already be available in digital form</li><li class="item">we can automatically (more or less) convert from other TEI vocabs, HTML, ePub, text to our target encoding </li><li class="item">(this may involve <em>removing</em> markup!)</li><li class="item">we need to investigate effectiveness of OCR for other materials</li><li class="item">syntactic validation of the result can be automated</li><li class="item">... but determining whether or not we are correctly representing a specific text is another matter </li></ul></div></body></html>