Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 697 Bytes

README.md

File metadata and controls

8 lines (5 loc) · 697 Bytes

text-extractor

Extracts text from Office and PDFs files, using POI and PDFxStream, as a very, very tiny alternative to Apache Tika

This library, obviously, NO replaces Apache Tika. Only extracts text from Word, Excel, RTF and PDF files. It's based on the code found on the blog article Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java but using the last Apache POI and PDFxStream versions (06/10/2015).

  • org.apache.poi, 3.12
  • com.snowtide.pdfxstream, 3.1.2