Skip to content

Latest commit

 

History

History
285 lines (180 loc) · 8.87 KB

notes.md

File metadata and controls

285 lines (180 loc) · 8.87 KB

Text extraction from EPUB files

Test environment: local desktop machine running Linux Mint 20.1 Ulyssa, MATE edition.

Tika-python

See installation instructions here:

https://github.com/chrismattmann/tika-python

The Airgap Environment Setup looks useful for running Tika in an environment without Internet access (haven't tried this myself).

Tested version: tika-2.6.0

Issues

Font names

Extracted text is followed by what looks like a list of font names, separated by newlines. E.g. for berk011veel01_01.epub:

Charis SIL Bold Italic

::
::

Charis SIL Small Caps

BUT if I use Tika directly from the command line this doesn't happen:

java -jar ~/tika/tika-app-2.3.0.jar -t /home/johan/kb/epub-tekstextractie/DBNL_EPUBS_moderneromans/berk011veel01_01.epub  > berk.txt

Looking at the tika-python code I noticed the "service" parameter:

https://github.com/chrismattmann/tika-python/blob/master/tika/parser.py#L74

So I changed the call to:

parsed = parser.from_file(fileIn, service='text')

After this change the font names are not reported anymore!

Submitted issue for this:

chrismattmann/tika-python#389

Footnotes, index

dhae007euro01_01.epub: contains both.

Table of Contents, landmarks

Additional tests on some Standard Ebooks (https://standardebooks.org/) files also showed extraction of text from Table of Contents and Landmarks.

E.g. this one:

https://standardebooks.org/ebooks/e-e-smith/the-skylark-of-space

Also happens when using Tika directly:

java -jar ~/tika/tika-app-2.3.0.jar -t /home/johan/kb/epub-accessibility/standard-ebooks/e-e-smith_the-skylark-of-space.epub > skylark.txt

Output originates from toc.xhtml.

Colophon text

DBNL books do contain colophon text (e.g. berk011veel01_01.epub), which is also included in the extraction result.

Text encoding?

What happens if extracted content has a decoding that is not UTF-8? Or are the text strings returned by Tika UTF-8 by default? Important, because EPUB allows both UTF-8 and UTF-16.

For EPUB 2 (https://idpf.org/epub/20/spec/OPS_2.0.1_draft.htm):

Publications may use the entire Unicode character set, using UTF-8 or UTF-16 encodings

EPUB 3 (https://www.w3.org/TR/epub-33/#sec-xml-constraints):

Any publication resource that is an XML-based media type [rfc2046] (...) MUST be encoded in UTF-8 or UTF-16 [unicode], with UTF-8 as the RECOMMENDED encoding.

Possible test: create test EPUB with some UTF-16 encoded resources.

Xhtml output

It is possible to extract the text to XHTML output instead of unformatted text. This is done by changing the call to Tika's parser function to:

parsed = parser.from_file(fileIn, xmlContent=True)

Even though this does preserve the internal document structure, it's still not that straightforward to identify things like footnotes, because they're not explicitly tagged. See below example:

<div class="voetnoten"><a class="footnote-link zz_voetnootcijfer" href="dhae007euro01_01-0003.xhtml#n001T" id="n001">1</a>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</div>

Here the footnote is wrapped inside a div element, where the value of the class attribute identifies it as a footnote. But the class values are not in any way standardized, and there are no controlled vocabularies for this. So the implementation will vary from one publisher to another. See also:

https://stackoverflow.com/questions/18162068/semantic-elements-for-footnote-list-and-content

and:

https://www.davidmacd.com/blog/html51-footnotes.html

Also, in the tested (2.6.0) version of Tika-python the resulting output is not well-formed XHTML, as it also includes metadata. This looks like a bug, so I reported this here:

chrismattmann/tika-python#389 (comment)

Textract

Extract text from any document. No muss. No fuss.

https://github.com/deanmalmgren/textract

This uses Ebooklib for EPUB:

https://github.com/aerkalov/ebooklib

Installation:

pip install textract

This installed Textract v1.6.5. In my case this gave me some errors:

ERROR: launchpadlib 1.10.13 requires testresources, which is not installed.
ERROR: pdfx 1.4.1 has requirement chardet==4.0.0, but you'll have chardet 3.0.4 which is incompatible.
ERROR: pdfx 1.4.1 has requirement pdfminer.six==20201018, but you'll have pdfminer-six 20191110 which is incompatible.

Because of these dependency conflicts I re-installed in a virtual environment as described here.

Basic usage:

import textract

fileIn = "e-e-smith_the-skylark-of-space.epub"
fileOut = "test.txt"
content = textract.process(fileIn, encoding='utf-8').decode()

with open(fileOut, 'w', encoding='utf-8') as fout:
                fout.write(content)

BUT for DBNL books "content" is empty in most cases (zero-byte bytes object) or just a few words. It did work OK with the Standard Ebooks examples I tried, no idea why.

Submitted an issue:

deanmalmgren/textract#455

Word counts

King Lear: 28442 words with Tika, 18621 with Textract!

Demo scripts

Iterate over all files with .epub extension in input directory, and write extracted text to text files in output directory. Also writes summary file with word count for each EPUB.

Tika script

Usage:

python3 extract-tika.py [-h] [--trim] dirIn dirOut

positional arguments:

  • dirIn: directory with input EPUB files
  • dirOut: output directory
  • -h, --help: show help message and exit

Textract script

python3 extract-textract.py [-h] dirIn dirOut

positional arguments:

  • dirIn: directory with input EPUB files
  • dirOut: output directory
  • -h, --help: show help message and exit

Ebooklib script

python3 extract-ebooklib.py [-h] dirIn dirOut

positional arguments:

  • dirIn: directory with input EPUB files
  • dirOut: output directory
  • -h, --help: show help message and exit

Examples

Tika:

python3 ./textExtractDemo/scripts/extract-tika.py DBNL_EPUBS_moderneromans/ out-dbnl/

Textract:

python3 ./textExtractDemo/scripts/extract-textract.py DBNL_EPUBS_moderneromans/ out-dbnl/

Ebooklib:

python3 ./textExtractDemo/scripts/extract-ebooklib.py DBNL_EPUBS_moderneromans/ out-dbnl/

Word counts DBNL

fileName Tika Textract Ebooklib
eern001lief01_01.epub 25450 1 25446
spro002mure01_01.epub 50553 0 50549
berk011veel01_01.epub 67978 0 67974
sche034drie01_01.epub 203853 3 203352
jous010supe01_01.epub 202495 0 202491
dele035wegv01_01.epub 76536 0 76530
verv017eerl01_01.epub 33844 0 33840
dhae007euro01_01.epub 394455 2 394400
gomm002uurw01_01.epub 43754 0 43731
gang009lalb01_01.epub 28453 4 28381
geel005bloe01_01.epub 76316 0 76312
hart008droo02_01.epub 77283 0 77279
eede003vand04_01.epub 120481 6 120310
meij031tuss02_01.epub 145678 4 145665
maas013blau01_01.epub 55099 0 55093

Word counts Standard Ebooks

fileName Tika Textract Ebooklib
william-shakespeare_king-lear.epub 28442 18621 28430
david-garnett_lady-into-fox.epub 25240 25223 25228
joseph-conrad_heart-of-darkness.epub 38717 38698 38705
anthony-trollope_the-dukes-children.epub 223014 222995 223002
agatha-christie_the-mysterious-affair-at-styles.epub 57401 57229 57271
edgar-allan-poe_the-narrative-of-arthur-gordon-pym-of-nantucket.epub 71931 71837 71863
p-g-wodehouse_short-fiction.epub 212224 212182 212212
robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub 26370 26345 26358
h-g-wells_the-time-machine.epub 33044 33024 33032
thorstein-veblen_the-theory-of-the-leisure-class.epub 106537 106515 106525

Test if all output is valid UTF-8

Use isutf8 tool1:

isutf8 ./out-dbnl/*.txt
isutf8 ./out-se/*.txt

OK!

Some differences

  • Tika output contains image placeholders with alt-text descriptions([image: DBNL]) that do not appear in output of either Ebooklib or Textract.
  • Tika inserts newlines between sections; Ebooklib and Textract don't
  • Tika and Ebooklib output may contain leading whitespace characters (spaces or tabs). Textract strips these away.
  • Textract fails to extract any text at all for most of the DBNL EPUBs (and in the best case only just a few words). Submitted an issue for this.
  • Textract does a better job for the Standard Ebooks EPUBS. But even there, looking at the King Lear EPUB (Standard Ebooks), the word count for the Textract output is about 10 thousand words lower than the Tika/Ebooklib outputs! Not entirely sure why this is, but cursory look reveals that e.g. Table of Contents is missing from Textract output, even though it is included in the Tika and Ebooklib output.

Footnotes

  1. To install, run sudo apt-get install moreutils