Skip to content

Improved document conversion

sebbacon edited this page Jan 11, 2011 · 10 revisions

Current setup

Office and PDF documents are currently converted to HTML, using wvHtml, xlhtml, pdftohtml, and unrtf. Other formats are not currently supported.

The documents are converted when the link to the ("as HTML") is followed. Converted text is stored on the filesystem by way of a cache, so subsequent link clients will just directly serve up the document.

The system works pretty well, except for some occasional bugs in the conversion software, which cause hangs or empty HTML versions of the software.

Problems

  • Conversion software bugs leading to corrupt data
  • Conversions not pretty (e.g. preserving images in documents nicely, etc)
  • Quite limited range of supported source conversions (but 99% of those used are supported, i.e. doc and pdf)

Alternative

There's an alternative system used by US FOI site MuckRock, which displays the documents in a nice viewer.

Their system uses the (currently free) journalist source document system documentcloud.org. The software to do this is open source, and available at https://github.com/documentcloud.

The main components are:

  • docsplit, a ruby frontend for OpenOffice (document conversion), Tesseract (OCR), pdftk (split single PDF into one-per page), graphicsmagick (thumbails/images of pages)
  • DocumentViewer from NYT. Most importantly, supports annotations on the document (e.g. this senate bill)

Benefits:

  • Much nicer-looking conversions
  • Reasonably good interface for navigating around documents
  • OCRed text wherever text extraction not possible
  • Full support for all supported OpenOffice formats
  • All documents converted to PDF as part of process
  • Annotations possible

Drawbacks:

  • Requires new code
  • Likely to be higher processing overheads (needs a running headless OpenOffice, and always requires PDF extraction step)
  • Is it indexable by search engines? The NYT blog post promises to fix this.
Clone this wiki locally