Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull recent changes from main repository #1

Open
wants to merge 99 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
13f57ec
add support for sic/corr
peterstadler Mar 17, 2020
44f8228
Merge pull request #14 from peterstadler/patch-1
dariok Apr 21, 2020
2e82e62
Merge pull request #17 from OpenArabicPE/wrap-pages-in-ab
dariok Apr 21, 2020
24d2127
Properly list all the helping hands
dariok Apr 21, 2020
f78761a
create tei:p from TextRegion [@type = 'paragraph']
dariok Apr 21, 2020
a24ce09
adjust XSpec (white spaces)
dariok Apr 21, 2020
0d976cc
adjust XSpec: ab as default
dariok Apr 21, 2020
784553e
Adjust XSpec: full choice/sic+corr
dariok Apr 21, 2020
c96c074
cleanup
dariok Aug 18, 2020
4945c60
escape curly braces
dariok Aug 18, 2020
3bdeb22
changed some info in the authors/contributors section
dariok Jul 24, 2021
af7e21c
improve usage of Transkribus meta data for teiHeader
dariok Jul 24, 2021
ff065fd
improve result indentation (I)
dariok Oct 17, 2021
a08c7e3
Marginalia: useful value for
dariok Oct 17, 2021
6c8ab97
Context for marginalia test
dariok Oct 17, 2021
5d5b58c
XSLT to create a bounding rectangle fro polygon
dariok Nov 8, 2021
f25e705
Merge branch 'master' of github.com:dariok/page2tei
dariok Nov 8, 2021
075c4ea
include Transkribus image URL as second tei:graphic
dariok Nov 10, 2021
b2a2faf
use p:page/@imageFilename for first graphic/@url
dariok Nov 10, 2021
01c06fe
force some formatting into facsimile
dariok Nov 10, 2021
59f1269
lb: create @n only if there's an index in @custom
dariok Jan 1, 2022
05ebe90
Merge
dariok Jan 1, 2022
2b8c7ef
combine tei:rs with
dariok Aug 4, 2022
af76a8e
merge
dariok Aug 4, 2022
d2a8d16
combine continued: both tests pass
dariok Aug 4, 2022
88e7ac0
move scenario file to main directory
dariok Oct 18, 2022
8c281e5
add parameter: use lines w/o baseline
dariok Oct 23, 2022
05a1053
region 'page-number' → tei:fw
dariok Oct 23, 2022
df5948f
regio paragraph: eval @custom, too
dariok Oct 23, 2022
85502df
evaluate @type and @custom for all region types
dariok Oct 23, 2022
7dbcb29
preserve italics from Transkribus
dariok Dec 8, 2022
9697759
endnotes
dariok Dec 10, 2022
7441ea2
extended scenarios
dariok Dec 10, 2022
af3e348
Create LICENSE
dariok Dec 14, 2022
5c370bd
Update README.md
dariok Dec 14, 2022
f538557
footnote has precedence over footer
dariok Dec 21, 2022
59f00ea
Merge branch 'master' of github.com:dariok/page2tei
dariok Dec 21, 2022
a45583e
basic tokenisation
dariok Jan 25, 2023
421f620
evaluate custom tags and properties
dariok Jan 25, 2023
c9ebfec
always round coordinates
dariok Mar 6, 2023
a958e0b
use relative paths and useful output file names
dariok Mar 6, 2023
0811ef5
do not fail on empty properties in attribute custom
dariok Mar 10, 2023
46b862e
Y-coord were not correctly extracted
dariok Apr 6, 2023
c820557
Prepare several new parameters
dariok May 26, 2023
e66b4f6
Fully implement and test parameter rs
dariok May 26, 2023
bc8ae76
Include tokenization by parameter
dariok Jun 4, 2023
93cb790
Include combine-continued by parameter
dariok Jun 4, 2023
e6dc0d0
parameter to return ab instead of element by name
dariok Jun 4, 2023
07f411c
Store word coordinates by parameter
dariok Jun 4, 2023
7ea386e
combining of entities also or *Name
dariok Jun 4, 2023
d7f69f3
include tokenization by parameter
dariok Jun 29, 2023
18d4610
improve node selectoin for continued rs
dariok Jun 29, 2023
02b2eee
improve selection of elements for tokenizatoin
dariok Jun 29, 2023
ac54f56
tokenize: add @break="no"
dariok Jun 29, 2023
409ce37
add langUsage if data present in trpMedata
dariok Jun 29, 2023
bd9c623
improve handling of nested and adjacent tags
dariok Jul 1, 2023
a14550f
add angled brackets as punctuation characters
dariok Jul 1, 2023
0db1852
continue choice[abbr]
dariok Jul 13, 2023
a90a07c
add single quotation marks to tokenize as pc
dariok Jul 15, 2023
6459934
combine hi continued over linebreaks
dariok Jul 17, 2023
67fe47e
handle white space at end of group
dariok Jul 17, 2023
e463ae3
new textStyle: smallCaps
dariok Jul 17, 2023
554ac83
improve group evaluation
dariok Jul 18, 2023
d35bde2
group eval when last is the only expan
dariok Jul 18, 2023
5b3a92e
convert letter-spacing
dariok Jul 19, 2023
21b24a4
scenarios
dariok Jul 21, 2023
6610773
tokenize: small characters may be Greek
dariok Jul 21, 2023
4d9e66b
ignore output dir (for testing purposes)
dariok Jul 21, 2023
d933bfe
correct typo
dariok Aug 9, 2023
21477fb
typo
dariok Aug 13, 2023
6292120
update readme
dariok Aug 13, 2023
c454488
update readme
dariok Aug 13, 2023
0aeac8f
do no remove hyphenation in unclear situations
dariok Nov 29, 2023
614772a
new param: write custom attributes, even if not valid TEI
dariok Dec 4, 2023
7ab7bbe
add info about parameters to README
dariok Dec 4, 2023
117a0d8
add some styling to README
dariok Dec 4, 2023
c184c12
Merge pull request #27 from dariok/parametrize
dariok Dec 4, 2023
f64ba97
fix faulty combination of hi over line breaks
dariok Jan 9, 2024
dd19609
added tests
dariok Jan 9, 2024
77593ef
use 0 for unset coordinates
dariok Jan 9, 2024
7792a93
add guillemets as punctuation characters
dariok Jan 9, 2024
4e30f05
consider column breaks when removing hyphentation
dariok Jul 27, 2024
807ee7c
no hyphenation when following line starts with a capital letter
dariok Jul 30, 2024
45a4f99
Completely rework hyphenation detection
dariok Aug 27, 2024
9dfc464
do not fail if continued but no peer for combination
dariok Aug 27, 2024
eaa1a5f
additional cases where a coordinate can be 0
dariok Aug 27, 2024
9e8daac
Also work on PAGE files with the 2019 namespace
dariok Jan 18, 2025
ba35e4c
Fix empty results where there were cases like head/choice/abbr etc.
dariok Jan 18, 2025
2c457b4
Fixed wrong placement of global parameters
dariok Jan 18, 2025
83ed597
additional fixes for tests
dariok Jan 18, 2025
1e99263
Create figure/graphic from GraphicElement
dariok Jan 18, 2025
7c66a79
publicationStmt instead of seriesStmt
dariok Jan 18, 2025
459f7b8
output graphic only if @imgUrl is present
dariok Jan 18, 2025
7b05999
subscripts
dariok Jan 18, 2025
c065bb1
House keeping
dariok Jan 18, 2025
10d8605
Merge pull request #31 from dariok/parametrize
dariok Jan 18, 2025
2c1d9c9
Picked some of the requirements from https://github.com/dariok/page2t…
dariok Jan 18, 2025
289b4ea
Use pc:TextLine as well
dariok Jan 19, 2025
ae90430
fix bug when recombining hyphenated words
dariok Jan 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
/tests-ga/
/tests-wd/
/tests2/
AS/
page2tei-output/

21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2018 Dario Kampkaspar

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
35 changes: 34 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,40 @@
# page2tei
PAGE2TEI was created and is maintained by Dario Kampkaspar and is licensed under the MIT license.

Use the METS File for your transformation:
## How to use
Apply page2tei-0.xsl to the METS File:

```
java -jar saxon9he.jar -xsl:page2tei-0.xsl -s:mets.xml -o:[your tei file].xml
```

Additional stylesheets can be applied to the output created by the basic transformation:
- `combine-continued.xsl` (or set parameter `combine=true()`) — try to combine entities that are split over a line break into one element
- `simplify-coordinates.xsl` (parameter `bounding-rectangles=true()` by default) — convert polygons into bounding rectangles
- `tokenize.xsl` (or set parameter `tokenize=true()`) — perform (very basic!) whitespace tokenization

## Parameters
You can set the following parameters when calling `page2tei-0.xsl` (via command line or via an oXygen scenario; in oXygen, the parameters should be marked as “XPath“):

- rs (default: `true()`): create `rs type="..."` for person/place/org (default) or `persName` etc.
- tokenize (default: `false()`): Whether to run white space tokenization
- combine (default: `false()`): Whether to combine entities over line breaks
- ab (default: `false()`): If false(), region types that correspond to valid TEI elements will be returned as
this element; types that do not correspond to a TEI element will be returned as
tei:ab[@type]. If set to true(), all region types (except for paragraph, heading) will be
returned as tei:ab.
- word-coordinates (default: `false()`): If true(), export the (estimated) word coordinates to the facsimile section.
- bounding-rectangles (default: `true()`): Whether to create bounding rectangles from polygons (default: true())
- withoutBaseline (default: `false()`): Whether to export lines without baseline or not
- withoutTextline (default: `false()`): Whether to export regions without text lines


## Contributors
- @tboenig
- @peterstadler
- @tillgrallert

---

Some contributions to this software were created within the scope of a project funded by the German BMBF, project ID 16TOA015A.

122 changes: 122 additions & 0 deletions combine-continued.xsl
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl"
xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:local="local"
xmlns="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="#all"
version="3.0">
<xd:doc scope="stylesheet">
<xd:desc>
<xd:p><xd:b>Created on:</xd:b> 2022-08-03</xd:p>
<xd:p><xd:b>Author:</xd:b> [email protected]</xd:p>
<xd:p>combine neighbouring tei:rs[@continued] separated by a tei:lb</xd:p>
</xd:desc>
</xd:doc>

<!--<xsl:template match="/">
<xsl:apply-templates mode="continued" />
</xsl:template>-->

<xd:doc>
<xd:desc>
<xd:p>Combine continued elements (e.g. rs)</xd:p>
<xd:p>This works on an element that contains (more than one) continued element (there may be just one element
if the continuation happens across region borders).</xd:p>
</xd:desc>
</xd:doc>
<xsl:template match="tei:*[count(tei:*[@continued = 'true']) gt 1]" mode="continued">
<xsl:copy>
<xsl:apply-templates select="@*" mode="continued" />
<xsl:for-each-group select="node()"
group-starting-with="tei:*[
@continued eq 'true'
and normalize-space() != ''
and (
normalize-space(preceding::text()[1]) != ''
or preceding::text()[1][not(preceding-sibling::*)]
or preceding-sibling::*[1][not(@continued = 'true') and not(self::tei:lb)]
)
]">

<xsl:choose>
<xsl:when test="current-group()[1][@continued eq 'true' and tei:abbr]">
<!-- we assume there is exactly 2 choice with one lb in between, so no multi-line abbreviations:
1=choice, 2=text(), 3=lb, 4=text(), 5=choice -->
<choice>
<expan>
<xsl:sequence select="current-group()[1]/tei:expan/node()" />
</expan>
<abbr>
<xsl:sequence select="current-group()[1]/tei:abbr/node()" />
<xsl:sequence select="current-group()[3]" />
<xsl:sequence select="current-group()[4]/tei:abbr/node()" />
</abbr>
</choice>
<xsl:apply-templates select="current-group()[position() gt 4]" mode="continued" />
</xsl:when>
<xsl:when test="current-group()[1][@continued eq 'true'] and count(current-group()[@continued = 'true']) gt 1">
<xsl:variable
name="final"
select="
(
current-group()[
position() gt 1
and @continued = 'true'
and node()
][last()],
current-group()[4]
)[1]"
/>
<xsl:try>
<xsl:variable name="last" select="index-of(current-group(), $final)[1]"/>

<xsl:element name="{local-name()}">
<xsl:apply-templates select="@*" mode="continued" />
<xsl:apply-templates select="current-group()[position() le $last]" mode="rs-continued" />
</xsl:element>
<xsl:apply-templates select="current-group()[position() gt $last]" mode="continued" />
<xsl:catch>
<xsl:message select="current-group()" />
</xsl:catch>
</xsl:try>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()" mode="continued" />
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>

<xd:doc>
<xd:desc>lb will be returned unaltered</xd:desc>
</xd:doc>
<xsl:template match="tei:lb" mode="rs-continued">
<lb>
<xsl:sequence select="@*" />
</lb>
</xsl:template>

<xd:doc>
<xd:desc>For continued rs, only return the content</xd:desc>
</xd:doc>
<xsl:template match="*" mode="rs-continued">
<xsl:apply-templates select="node()" mode="continued" />
</xsl:template>

<xd:doc>
<xd:desc>Remove @continued</xd:desc>
</xd:doc>
<xsl:template match="@continued" mode="continued"/>

<xd:doc>
<xd:desc>Default</xd:desc>
</xd:doc>
<xsl:template match="@* | node()" mode="continued">
<xsl:copy>
<xsl:apply-templates select="@* | node()" mode="#current" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
45 changes: 45 additions & 0 deletions combine-hi.xsl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
xmlns="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="#all"
version="3.0">

<xsl:template match="*[tei:hi]" mode="combine-hi">
<xsl:copy>
<xsl:sequence select="@*" />
<xsl:for-each-group select="node()"
group-starting-with="tei:hi[
@style != preceding-sibling::tei:hi[1]/@style
or not(preceding-sibling::tei:hi)
or normalize-space(string-join(preceding-sibling::tei:hi[1]/following-sibling::node() intersect preceding-sibling::node())) != ''
]">
<xsl:choose>
<xsl:when test="not(current-group()[self::tei:hi])">
<xsl:sequence select="current-group()" />
</xsl:when>
<xsl:otherwise>
<xsl:variable name="lastHi" select="index-of(current-group(), current-group()[self::tei:hi][last()])"/>

<hi style="{current-group()[1]/@style}">
<xsl:apply-templates select="current-group()[position() le $lastHi]" mode="do-combine-hi" />
</hi>
<xsl:sequence select="current-group()[position() gt $lastHi]" />
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>

<xsl:template match="tei:hi" mode="do-combine-hi">
<xsl:apply-templates mode="combine-hi"/>
</xsl:template>

<xsl:template match="@* | node()" mode="combine-hi do-combine-hi">
<xsl:copy>
<xsl:apply-templates select="@* | node()" mode="#current"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Loading