-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relation of METS and PAGE ReadingOrder #40
Comments
c.f. #55 |
After discussing this issue with @tboenig: Reading order is not represented within METS since it is a page-level datum. |
However, we find examples of reading orders represented in METS, e.g., within the DDR-Presseportal: <mets:div TYPE="article-part" ORDER="1" ID="article6-1">
<mets:div TYPE="article-zone" LABEL="title" ID="article6-zone1">
<mets:fptr>
<mets:area COORDS="194,886,658,170" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block18" FILEID="alto1"/>
</mets:fptr>
</mets:div>
<mets:div TYPE="article-zone" LABEL="body" ID="article6-zone2">
<mets:fptr>
<mets:area COORDS="183,1082,670,203" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block19" FILEID="alto1"/>
</mets:fptr>
</mets:div>
<mets:div TYPE="article-zone" LABEL="body" ID="article6-zone3">
<mets:fptr>
<mets:area COORDS="186,1290,673,559" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block20" FILEID="alto1"/>
</mets:fptr>
</mets:div>
<mets:div TYPE="article-zone" LABEL="body" ID="article6-zone4">
<mets:fptr>
<mets:area COORDS="189,1864,658,145" SHAPE="RECT" FILEID="default1"/>
</mets:fptr>
<mets:fptr>
<mets:area BETYPE="IDREF" BEGIN="block21" FILEID="alto1"/>
</mets:fptr>
</mets:div>
</mets:div> |
How can you represent document structure? |
@kba Proposal for OCR-D purposes: |
@tboenig We should update the guidelines asap. |
@tboenig Push. |
This is only awaiting the updated guidelines, right? #80 is closed and I agree fully with #40 (comment). For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE A solution for METS/MODS structural enrichment via external information available through our standard |
Possibly fixed by #154 |
superseded by #207, but unrelated AFAICS
Page-local reading order and structure is important both on its own, and as a contributor to document structure. The latter (i.e. structure across pages like section boundaries and cross-refs/indexes) cannot be adequately represented in fileGrps, though. The only place for that is still the logical structMap IMHO. So far, we have two conventions for its representation:
The second convention is of course more powerful and general, but not as widely used. In fact, is has been somewhat forgotten even in the context of newspaper digitization, as even DDB Zeitungsportal shied away from adopting it so far – despite listing the recording of article structure as task in its grant proposal (AP 6 p.10) and in its master planning (Tiefenerschließung Artikelebene, p. 20). The latter document references ENMAP specifically, giving it a certain spin:
So we can see there is a hen-vs-egg problem here: automatic structural tagging is still hard (although tools for visualizing and detecting article structure are getting better), hence enriched datasets are rare, therefore training is difficult. Not having everyone commit to the existing, agreed upon unified representation makes this even more difficult. But it's not just a matter of simply adopting the ENMAP spec: IMO it is not trivially compatible with the DFG profile. However this will be resolved, I do think it is worth pursuing some form of documentation and specification already – as enabler for tool developers and data providers. (For example, we could simply write some OCR-D processor extracting OLR results with headings and reading order into "coarse" document structure in either DFG-profile / |
We need to specify how these constructs are related, which one to use, how to handle contradictions.
The text was updated successfully, but these errors were encountered: