encoding_slides.html

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.tei-c.org/ns/1.0" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><title>WG1: Encoding Proposals</title><meta name="generator" content="Generated by TEISLIDY stylesheet"></meta><script src="https://www.w3.org/Talks/Tools/Slidy/slidy.js" type="text/javascript"></script><link rel="stylesheet" type="text/css" media="screen, projection" href="https://www.w3.org/Talks/Tools/Slidy/show.css"></link><link href="../css/egXMLhandling.css" rel="stylesheet" type="text/css"></link><link href="../css/tei.css" rel="stylesheet" type="text/css"></link></head><body class="simple" id="TOP"><div class="slide cover"><img src="media/logo.jpg" width="40%" style="float:left" alt="[Your logo here]" class="cover"></img><br clear="all"></br><h1>WG1: Encoding Proposals</h1><p></p></div><div class="slide"><h2>Who am I?</h2><p class="box">I am Lou Burnard, now on my third or fourth life.</p><ol><li class="item">I was born on the same day as the poet John Milton, but approx 300 years later. I studied at Oxford University, with a masters in English Studies, specialising in 19th century literature in 1971</li><li class="item">After which I taught World Literature in the University of Malawi for a couple of years</li><li class="item">For about 25 years I worked at Oxford University Computing Services, initially as a data centre operator, eventually as Assistant Director</li><li class="item">I started the Oxford Text Archive in 1976; the Text Encoding Initiative in 1987; the British National Corpus in 1994; </li><li class="item">In 2010 I took early retirement from OUCS and started work as a freelance</li><li class="item">Between 2009 and 2012 I worked closely with the TGE Adonis which eventually became HumaNum, the French digital humanities infrastructure </li></ol><p>Look me up on Google if you get bored during the rest of this talk... </p></div><div class="slide"><h2>Proposed Encoding Guidelines for the ELTeC</h2><p>A discussion document setting out the full proposal is available <a class="link_ref" href="https://distantreading.github.io/encoding_proposal.html">here</a></p><p>We summarize the proposals as follows: </p><ul><li class="item">Use TEI XML : a well-established, customizable, scholarly standard </li><li class="item">Capture a <span style="font-variant: small-caps">guaranteed minimum</span> of features for each text: <ul><li class="item">significant structural features (chapters, headings, paragraphs...) </li><li class="item">descriptive metadata (bibliographic and non bibliographic)</li></ul></li><li class="item">The proposal raises a number of <span style="font-variant: small-caps">open questions</span> as to <em>which</em> features should be captured</li><li class="item">The proposal defines an <span style="font-variant: small-caps">XML schema</span> and a set of rules which can be used to validate converted texts (more or less) automatically</li></ul></div><div class="slide"><h2>What sort of guaranteed minimum?</h2><p class="box">The focus is <em>not</em> to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.</p><p>For example, </p><ul><li class="item">to distinguish headings and annotations from the rest of the text</li><li class="item">to be able to locate stretches of text within gross structural features such as chapters and paragraphs</li><li class="item">to distinguish narrative voices (?)</li></ul></div><div class="slide"><h2>Why XML-TEI?</h2><p>Why not just use plain text?</p><ul><li class="item">By using an XML based format, we ensure that<ul><li class="item">ELTeC texts can be validated </li><li class="item">ELTeC texts can be converted to other formats using simple widely-available technologies </li><li class="item">ELTeC texts can be enriched with additional more sophisticated annotations</li></ul></li><li class="item">By using TEI, we can take advantage of tools and techniques, widely used across the research community likely to be interested in the ELTeC </li><li class="item">NB Using the TEI does <em>not</em> mean our encoding will represent every possible textual feature or metadatum ... on the contrary!</li></ul></div><div class="slide"><h2>Taming the TEI</h2><div class="figure" style="text-align: center"><img src="media/occam.jpg" alt="" style="float:center; height:40%;"></img></div><ul><li class="item"><p>The TEI offers a choice of over 450 different elements ... we will use (and our schema will only permit) about thirty. </p></li><li class="item"><p>The TEI is very flexible in the structures and perspectives it supports. We will apply Occam's razor extensively. </p></li></ul></div><div class="slide"><h2>Basic structure of an ELTeC text</h2><div class="figure"><img src="media/example-1.png" alt="" style="float:center; height:60%;"></img></div><p class="box">Goal : represent only what is essential to an understanding of the text</p></div><div class="slide"><h2>What are the essential components of a novel?</h2><p>It seems uncontroversial to distinguish in our markup chapters, headings, paragraphs but how about :</p><ul><li class="item">title page ?</li><li class="item">preface or introduction ?</li><li class="item">table of contents ?</li><li class="item">appendix or afterword ?</li><li class="item">footnotes or comments ?</li><li class="item">errata lists ?</li></ul><p>It's not hard to find TEI tags for these: but is it helpful? can we be consistent in their application ? </p></div><div class="slide"><h2>TEI encoding typically loses typographic subtleties </h2><div class="frame"><div class="col"><div class="figure"><img src="media/forster-chap2.png" alt="" class="graphic" style=" height:80%;"></img></div></div><div class="col"><div class="listhead">Are we bothered?</div><ul style="float:left"><li class="item">the chapter title is centred</li><li class="item">there are linebreaks within the paragraphs (and sometimes words get hyphenated as a result)</li><li class="item">the first word is capitalised</li><li class="item">paragraphs are indented (except for the first)</li><li class="item">dash and quote marks have narrative function</li><li class="item">hyphens may or may not be significant</li><li class="item">double quotes and single quotes have different functions</li></ul></div></div></div><div class="slide"><h2>Which typographic features should we keep ?</h2><div style="text-align:center"><img src="media/howards-chap1-1970.png" alt="(Penguin, 1970)" class="graphic" style=" width:60%;"></img><div class="caption">Figure 1. (Penguin, 1970)</div></div><div style="float:right"><img src="media/howards-chap1-1921.png" alt="(Knopf, 1921)" class="graphic"></img><div class="caption">Figure 2. (Knopf, 1921)</div></div><div class="figure"><img src="media/howards-chap1-1991.png" alt="(Everyman, 1991)" class="graphic"></img><div class="caption">Figure 3. (Everyman, 1991)</div></div><div style="text-align:center"><img src="media/howards-chap1-1910.png" alt="(First US ed, 1910)" class="graphic" style=" width:80%;"></img><div class="caption">Figure 4. (First US ed, 1910)</div></div></div><div class="slide"><h2>What about material other than running prose and dialogue ?</h2><p>Novels often contain material other than running prose</p><p>We could: </p><ol><li class="item">use the appropriate TEI elements for verse or drama (<span class="gi">&lt;lg&gt;</span>, <span class="gi">&lt;l&gt;</span>, <span class="gi">&lt;sp&gt;</span>, <span class="gi">&lt;stage&gt;</span>)</li><li class="item">use the appropriate TEI elements for lists and tables (<span class="gi">&lt;list&gt;</span>, <span class="gi">&lt;label&gt;</span>, <span class="gi">&lt;item&gt;</span>, <span class="gi">&lt;table&gt;</span>, <span class="gi">&lt;cell&gt;</span>, <span class="gi">&lt;row&gt;</span>)</li><li class="item">use the appropriate TEI elements for graphics (<span class="gi">&lt;figure&gt;</span>, <span class="gi">&lt;graphic&gt;</span>, <span class="gi">&lt;head&gt;</span>)</li></ol><p>Or we could </p><ul><li class="item">suppress non-prose material, replacing it by <span class="gi">&lt;gap&gt;</span></li><li class="item">lie </li></ul><p style="font-style:italic">Whichever we choose to do, we must be consistent!</p></div><div class="slide"><h2>An example</h2><div class="frame"><div class="col"><div class="figure"><img src="media/paulVirg-p127.png" alt="" class="graphic"></img></div></div><div class="col"><div class="p">Should this be encoded as: <div id="index.xml-egXML-d30e277" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;p&gt;</span>
&nbsp;<span class="element">&lt;label&gt;</span>le vieillard.<span class="element">&lt;/label&gt;</span> 
 «&nbsp;Oh mon ami&nbsp;! ne m’avez-vous pas dit que vous
 n’aviez pas de naissance ?
<span class="element">&lt;/p&gt;</span></div> or (expensively) <div id="index.xml-egXML-d30e283" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;sp&gt;</span>
&nbsp;<span class="element">&lt;speaker&gt;</span>le vieillard.<span class="element">&lt;/speaker&gt;</span>
&nbsp;<span class="element">&lt;p&gt;</span>«&nbsp;Oh mon ami&nbsp;! ne m’avez-vous pas dit que
&nbsp;&nbsp; vous n’aviez pas de naissance ?<span class="element">&lt;/p&gt;</span>
<span class="element">&lt;/sp&gt;</span></div> or (deceitfully) <div id="index.xml-egXML-d30e290" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;p&gt;</span>le vieillard.<span class="element">&lt;/p&gt;</span>
<span class="element">&lt;p&gt;</span>«&nbsp;Oh mon ami&nbsp;! ne m’avez-vous pas dit que
 vous n’aviez pas de naissance ?<span class="element">&lt;/p&gt;</span></div></div></div></div></div><div class="slide"><h2>Another example</h2><div class="figure"><img src="media/howards-page61.png" alt="" class="graphic"></img></div><div class="p">Should this be encoded as: <div id="index.xml-egXML-d30e304" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;p&gt;</span> Even in her photographic days she had relied upon her smile and her figure to
 attract, and now that she was <span class="element">&lt;quote&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;l&gt;</span>"On the shelf,<span class="element">&lt;/l&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;l&gt;</span>On the shelf,<span class="element">&lt;/l&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;l&gt;</span>Boys, boys, I'm on the shelf,"<span class="element">&lt;/l&gt;</span>
&nbsp;<span class="element">&lt;/quote&gt;</span> she was not likely to find her tongue.&nbsp; Occasional bursts of song (of
 which the above is an example) still issued from her lips, but the spoken word
 was rare. <span class="element">&lt;/p&gt;</span></div> or (deceitfully) <div id="index.xml-egXML-d30e316" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;p&gt;</span>... and now that she was<span class="element">&lt;/p&gt;</span>
<span class="element">&lt;p&gt;</span>"On the shelf,
<span class="element">&lt;lb/&gt;</span>On the shelf,
<span class="element">&lt;lb/&gt;</span>Boys, boys, I'm on the shelf,"<span class="element">&lt;/p&gt;</span>
<span class="element">&lt;p&gt;</span>she was not likely to find her tongue.&nbsp; Occasional ...<span class="element">&lt;/p&gt;</span></div></div></div><div class="slide"><h2>Some other open questions</h2><ul><li class="item">should we capture page breaks in our source edition?</li><li class="item">should we remove/resolve end of line hyphenation? </li><li class="item">should we try to interpret typographic variation (italics, etc.) e.g. as <span class="gi">&lt;title&gt;</span> <span class="gi">&lt;emph&gt;</span> <span class="gi">&lt;foreign&gt;</span> <span class="gi">&lt;abbr&gt;</span>?</li><li class="item">should we capture (using <span class="gi">&lt;hi&gt;</span>) typographic features (and if so should we use <span class="att">rend</span> or <span class="att">style</span>...</li><li class="item">or should we just ignore them ?</li></ul><p class="box">Again, consistency of practice is essential. Whether we decide to drop or to preserve these features, we must do so for every text. </p></div><div class="slide"><h2>Metadata : the TEI Header</h2><div class="figure"><img src="media/teiHeader.png" alt="" class="graphic"></img></div><p>We propose using this for all metadata. It will provide for each text </p><ul><li class="item">bibliographic information</li><li class="item">sampling and descriptive criteria applicable</li><li class="item">housekeeping information</li></ul><p>The schema will check consistency of data supplied. </p></div><div class="slide"><h2>A possible title statement</h2><p>We may need to modify the TEI definitions</p><div id="index.xml-egXML-d30e387" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;titleStmt&gt;</span>
&nbsp;<span class="element">&lt;title&gt;</span>Howards End : ELTeC edition<span class="element">&lt;/title&gt;</span>
&nbsp;<span class="element">&lt;author&nbsp;<span class="attribute">dates</span>="<span class="attributevalue">1879 1970</span>"&nbsp;<span class="attribute">sex</span>="<span class="attributevalue">M</span>"&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;persName&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;forename&gt;</span>Edward<span class="element">&lt;/forename&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;forename&gt;</span>Morgan<span class="element">&lt;/forename&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;surname&gt;</span>Forster<span class="element">&lt;/surname&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;/persName&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;persName&gt;</span>E.M. Forster<span class="element">&lt;/persName&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;idno&nbsp;<span class="attribute">type</span>="<span class="attributevalue">viaf</span>"&gt;</span>https://viaf.org/viaf/31996364<span class="element">&lt;/idno&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;idno&nbsp;<span class="attribute">type</span>="<span class="attributevalue">wiki</span>"&gt;</span>https://www.wikidata.org/wiki/Q189119<span class="element">&lt;/idno&gt;</span>
&nbsp;<span class="element">&lt;/author&gt;</span>
&nbsp;<span class="element">&lt;respStmt&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;resp&gt;</span>ELTeC encoding<span class="element">&lt;/resp&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;name&gt;</span>Lou Burnard<span class="element">&lt;/name&gt;</span>
&nbsp;<span class="element">&lt;/respStmt&gt;</span>
<span class="element">&lt;/titleStmt&gt;</span></div></div><div class="slide"><h2>An example source description</h2><div id="index.xml-egXML-d30e413" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;sourceDesc&gt;</span>
&nbsp;<span class="element">&lt;bibl&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;author&gt;</span>E.M. Forster<span class="element">&lt;/author&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;title&gt;</span>Howards End<span class="element">&lt;/title&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;pubPlace&gt;</span>London<span class="element">&lt;/pubPlace&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;publisher&gt;</span>Edward Arnold<span class="element">&lt;/publisher&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;date&gt;</span>1910<span class="element">&lt;/date&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;idno&nbsp;<span class="attribute">type</span>="<span class="attributevalue">wiki</span>"&gt;</span>https://www.wikidata.org/wiki/Q1146642<span class="element">&lt;/idno&gt;</span>
&nbsp;<span class="element">&lt;/bibl&gt;</span>
&nbsp;<span class="element">&lt;bibl&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;title&gt;</span>The Project Gutenberg Etext of Howards End, by E. M. Forster<span class="element">&lt;/title&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;ref&nbsp;<span class="attribute">target</span>="<span class="attributevalue">http://www.gutenberg.org/files/2891/2891-h/2891-h.htm</span>"&gt;</span>HTML
&nbsp;&nbsp;&nbsp;&nbsp; version downloaded on <span class="element">&lt;date&gt;</span>2017-12-26<span class="element">&lt;/date&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;/ref&gt;</span>
&nbsp;<span class="element">&lt;/bibl&gt;</span>
&nbsp;<span class="element">&lt;note&nbsp;<span class="attribute">type</span>="<span class="attributevalue">editions</span>"&nbsp;<span class="attribute">source</span>="<span class="attributevalue">worldcat</span>"&gt;</span> Worldcat lists 484 print editions in
&nbsp;&nbsp; English<span class="element">&lt;/note&gt;</span>
<span class="element">&lt;/sourceDesc&gt;</span></div></div><div class="slide"><h2>And finally... profile and revision descriptions</h2><div id="index.xml-egXML-d30e440" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;profileDesc&gt;</span>
&nbsp;<span class="element">&lt;langUsage&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;language&nbsp;<span class="attribute">ident</span>="<span class="attributevalue">en-BR</span>"&nbsp;<span class="attribute">usage</span>="<span class="attributevalue">99</span>"&gt;</span>British English<span class="element">&lt;/language&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;language&nbsp;<span class="attribute">ident</span>="<span class="attributevalue">de</span>"&nbsp;<span class="attribute">usage</span>="<span class="attributevalue">1</span>"&gt;</span>German<span class="element">&lt;/language&gt;</span>
&nbsp;<span class="element">&lt;/langUsage&gt;</span>
&nbsp;<span class="element">&lt;textClass&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;keywords&nbsp;<span class="attribute">source</span>="<span class="attributevalue">http://wikidata.org</span>"&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;term&gt;</span>social class<span class="element">&lt;/term&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;term&gt;</span>social convention<span class="element">&lt;/term&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;term&gt;</span>modernity<span class="element">&lt;/term&gt;</span>
&nbsp;&nbsp;&nbsp;<span class="element">&lt;term&gt;</span>family drama<span class="element">&lt;/term&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;/keywords&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;catRef&nbsp;<span class="attribute">target</span>="<span class="attributevalue">#author_m #reprint_3</span>"/&gt;</span>
&nbsp;&nbsp;<span class="element">&lt;classCode&nbsp;<span class="attribute">scheme</span>="<span class="attributevalue">UDC</span>"&gt;</span>8231.111<span class="element">&lt;/classCode&gt;</span>
&nbsp;<span class="element">&lt;/textClass&gt;</span>
<span class="element">&lt;/profileDesc&gt;</span></div><p>The values supplied by <span class="att">target</span> are defined in a project-wide <span class="gi">&lt;taxonomy&gt;</span>; this and other project-wide metadata is held in a separate corpus header.</p><div id="index.xml-egXML-d30e468" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;revisionDesc&gt;</span>
&nbsp;<span class="element">&lt;change&nbsp;<span class="attribute">when</span>="<span class="attributevalue">2018-02-11</span>"&nbsp;<span class="attribute">who</span>="<span class="attributevalue">LB</span>"&gt;</span>Added to EN collection<span class="element">&lt;/change&gt;</span>
<span class="element">&lt;/revisionDesc&gt;</span></div><p>Just one small question... </p></div><div class="slide"><h2>How do we get there from here?</h2><ul><li class="item">some, but by no means all, the titles we would like to include may already be available in digital form</li><li class="item">we can automatically (more or less) convert from other TEI vocabs, HTML, ePub, text to our target encoding </li><li class="item">(this may involve <em>removing</em> markup!)</li><li class="item">we need to investigate effectiveness of OCR for other materials</li><li class="item">syntactic validation of the result can be automated</li><li class="item">... but determining whether or not we are correctly representing a specific text is another matter </li></ul></div></body></html>