TextHarvestor
is a powerful Python script that automates the extraction of text from various document formats, including PowerPoint, PDF, Word, and plain text files. The extracted content is then compiled into a single, organized Word document for easy reference and analysis.
- Extracts text from PowerPoint (.pptx), PDF (.pdf), Word (.docx, .doc), and Text (.txt) files.
- Compiles all extracted text into a single organized Word document.
- Includes robust error handling and logging mechanisms.
- Progress bar for tracking the extraction process.
- Python 3.6+
- Pip (Python package manager)
-
Clone the repository:
git clone https://github.com/csb21jb/TextHarvester.git cd TextHarvester
-
Install dependencies:
pip3 install python-pptx pdfplumber python-docx tqdm colorama
pip3 install extract-msg
pip3 install textract --no-deps
-
Place the Python script (e.g.,
TextHarvester.py
) in a directory containing the documents you want to extract text from. -
Run the Python script:
python3 TextHarvester.py
-
The extracted text will be saved in a file named
combined_output.docx
.
Here's an example of the expected output structure in the combined_output.docx
file: