OpenAI based PDF summarizer
To use this -
- git clone
- source .venv/bin/activate
- pip install -r requirements.txt
- Set OPENAI_API_KEY to your API key
- python summarizer.py
Enjoy. The code has comments and assumptions. feel free to change it. For example, I assume 16k contenxt window; and use gpt-3.5-turbo-1106 which has a 16k context window. I assume 8k chunk size; which is sufficient to contain my preable, context, previous summaries and next chunk for hierarchical summarization.
Summarization is performed as a hierarchical chunking.
- Pypdf is used to extract text from PDF
- It is then chunked with auto_chunker from https://github.com/VectifyAI/LargeDocumentSummarization
- It is then fed to openAI model for summarization as follows
- First pass: chunk1 + context
- 2nd pass: summary of first pass as context + next chunk
- It is then successively done until all chunks are summarized
- Final summary send to output window
App front end is basic Gradio - no biggie. Feel free to modify anything to your needs.
Next up: More extensive summarization of PDF to include tables, images etc.
Sample program to read images from PDF and convert them to base64encoded uses saample abc.pdf
Sample program to read an image, convert it to base64encoded, send it to OpenAI GPT4V for summarization of image uses sample CT.png