framework for building multi-modal first document retriever
PSI King - King of the Senses from Psychonauts 2
- a
Document
contains a list of nodes (document.nodes
) - each node can be one of the following types
TextNode
ImageNode
TableNode
- schemas are defined here
- detailed descriptions are available here
Document Ingestion Flow example:
- (Doc) Collection -> Extraction -> Transformation -> Index(?)
- Extraction: read file into
Document
instance - Transformation: merging/chunking/filtering
- Index: Embedding & inserting into searchable DB
- Extraction: read file into
- Reading PDF files and indexing into qdrant DB for retrieval
- data: real-life pdf files from
allganize-RAG-Evaluation-Dataset-KO
- parse using docling & pdf2image
- models:
Visualized_BGE
(bge-m3) + Qdrant/BM42 (all_miniLM_L6_v2_with_attentions) - db: qdrant (dense + sparse)
- data: real-life pdf files from
- ingestion pipeline: notebook (3_3_allganize_ingestion_multimodal_hybrid)
- use 'finance' domain PDF files
- pgvector docs experiments
- use mecab-ko +
textsearch_ko
to enable koreantsvector
calculation
- use mecab-ko +
- qdrant docs experiments
- build qdrant with cjk language support for korean tokenization
- A lot of the structure of this project was inspired by llama-index
- document parsing heavily utilizes docling
PSI King
is a character from Psychonauts 2
History of this framework's development is recorded below