This repo contains scripts and resources for automated quality assessement of e-books (for now only EPUB; PDF may follow later). Scripts require Python 3.x and do not work with Python 2.x!
- Epubcheck Python wrapper (
pip install epubcheck
) (tested with v. 4.2.6) - tika-python (
pip install tika
) - pandas (
pip install pandas
) - matplotlib (
pip install matplotlib
) - python-tabulate (
pip install tabulate
)
This script recursively walks through a directory tree, and runs Epubcheck for each EPUB file (identified by its file extension). It then extracts all validation error and warning codes, removing duplicate codes, and writes them to a comma-delimited text file. Note that the script only reports on unique errors and warnings. For example, if an EPUB contains multiple missing referenced resources (error code RSC-007
), any duplicate instances are removed.
The script also reports some basic metadata (identifier, author, title, publisher) and a word count for each file. The word count can be a useful heuristic for identifying EPUBs that contain only images without any actual text (particularly common for illustrated childrens books of some publishers). For these books the word count is typically less than 1000.
python3 extract.py rootDir prefixOut
The script generates two output files (the names are based on the user-specified value of prefixOut):
-
A comma-delimited text file ($prefixOut.csv) with, for each EPUB, the following columns:
- fileName: full path to file
- identifier: identifier
- title: title
- author: author name
- publisher: publisher name
- epubVersion: EPUB version string
- epubStatus: EpubCheck validation outcome
- noErrors: number of unique errors reported by EpubCheck
- noWarnings: number of unique warnings reported by EpubCheck
- errors: space-delimited list of unique errors reported by EpubCheck
- warnings: space-delimited list of unique warnings reported by EpubCheck
- wordCount: word count (based on extracted text with Apache Tika)
Errors and warnings are reported as codes; the meaning of these codes can be found in EpubCheck's default MessageBundle.properties file.
-
A text file ($prefixOut_ec.txt) with the full Epubcheck output of all proceessed files.
python3 report.py inputFile dirOut
Here inputFile is the CSV file produced by extract.py, and dirOut is the name of a directory where all output is written.
- report.md: report in Markdown format
- report.html: report in HTML format
- csv: directory with CSV files (description can be found in the report, which also links to these files)
Runs DAISY Ace tool on all EPUBs in a directory.
github-markdown-css by Sindre Sorhus, released under the MIT license.