Skip to content

TextHarvester is a powerful Python script that automates the extraction of text from various document formats, including PowerPoint, PDF, Word, and plain text files. The extracted content is then compiled into a single, organized Word document for easy reference and analysis.

Notifications You must be signed in to change notification settings

csb21jb/TextHarvester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

image

TextHarvester

Python License

Overview

TextHarvestor is a powerful Python script that automates the extraction of text from various document formats, including PowerPoint, PDF, Word, and plain text files. The extracted content is then compiled into a single, organized Word document for easy reference and analysis.

Features

  • Extracts text from PowerPoint (.pptx), PDF (.pdf), Word (.docx, .doc), and Text (.txt) files.
  • Compiles all extracted text into a single organized Word document.
  • Includes robust error handling and logging mechanisms.
  • Progress bar for tracking the extraction process.

Installation

Prerequisites

  • Python 3.6+
  • Pip (Python package manager)

Setup Instructions

  1. Clone the repository:

    git clone https://github.com/csb21jb/TextHarvester.git
    cd TextHarvester
  2. Install dependencies:

    pip3 install python-pptx pdfplumber python-docx tqdm colorama
    pip3 install extract-msg
    pip3 install textract --no-deps

Usage

  1. Place the Python script (e.g., TextHarvester.py) in a directory containing the documents you want to extract text from.

  2. Run the Python script:

    python3 TextHarvester.py
  3. The extracted text will be saved in a file named combined_output.docx.

Example Output

Here's an example of the expected output structure in the combined_output.docx file:

Recording.2024-05-08.212052.mp4

About

TextHarvester is a powerful Python script that automates the extraction of text from various document formats, including PowerPoint, PDF, Word, and plain text files. The extracted content is then compiled into a single, organized Word document for easy reference and analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages