In this project, we are going to batch-convert pdf files to text and extract data without using PyPDF2/4.
We're going to achieve that by:
- Using PDFtoText converter from XPdf to convert pdf files to text files
- Using regular expressions to extract data
- Performing data cleaning using pandas
- Exporting to Excel file
Short Answer: I got this error:
TypeError: object of type 'IndirectObject' has no len()
Long Answer: If PyPDF4 had worked I would never have had a chance to explore other ways. I looked on StackOverflow however couldn't find a solution for this error. Obviously, there had to be someone with the same problem but there's no solution.
I was not willing to manually copy and paste the information from 52 of my payslips. Isn't that what programs are used for?
Table of Contents
-
Pandas
Check out the requirements.txt
Converting PDF to text using Xpdf's pdftotext is really simple.
Using this command-line tool we can batch-convert PDFs to text files.
pdftotext source.pdf dest.txt
Script Link: parse_payslips.py