Improved document conversion

Current setup

Office and PDF documents are currently converted to HTML, using wvHtml, xlhtml, pdftohtml, and unrtf. Other formats are not currently supported.

The documents are converted when the link to the ("as HTML") is followed. Converted text is stored on the filesystem by way of a cache, so subsequent link clients will just directly serve up the document.

The system works pretty well, except for some occasional bugs in the conversion software, which cause hangs or empty HTML versions of the software.

Problems

Conversion software bugs leading to corrupt data
Conversions not pretty (e.g. preserving images in documents nicely, etc)
Quite limited range of supported source conversions (but 99% of those used are supported, i.e. doc and pdf)

Alternative

There's an alternative system used by US FOI site MuckRock, which displays the documents in a nice viewer.

Their system uses the (currently free) journalist source document system documentcloud.org. The software to do this is open source, and available at https://github.com/documentcloud.

The main components are:

docsplit, a ruby frontend for OpenOffice (document conversion), Tesseract (OCR), pdftk (split single PDF into one-per page), graphicsmagick (thumbails/images of pages)
DocumentViewer from NYT. Most importantly, supports annotations on the document (e.g. this senate bill)

Benefits:

Much nicer-looking conversions
Reasonably good interface for navigating around documents
OCRed text wherever text extraction not possible
Full support for all supported OpenOffice formats
All documents converted to PDF as part of process
Annotations possible

Drawbacks:

Requires new code
Likely to be higher processing overheads (needs a running headless OpenOffice, and always requires PDF extraction step)
Is it indexable by search engines? The NYT blog post promises to fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved document conversion

Current setup

Problems

Alternative

Clone this wiki locally