Creating and Storing Embeddings for PostgreSQL Data
Ingestion Process
- Data Sources: The data sources used to build the knowledge base for a RAG architecture are foundational. They need to be comprehensive, high-quality sources that accurately cover the domains and topics the system will be queried on. This process typically involves selecting a relevant subset of an enterprise’s structured and unstructured data repositories that meet your use case requirements with input from subject matter experts (SMEs).
- Data Cleaning: Raw data is often noisy, containing irrelevant content, outdated information, and duplicate data. This creates challenges for RAG implementation, where models are unable to retrieve relevant and accurate information from their knowledge base, negatively impacting generation. For example, enterprise knowledge in Jira or Confluence often contains user comments and a history of version changes that would not be relevant to store in the knowledge base. Effective data cleaning techniques, such as filtering and deduplication, are crucial before feeding data into the vector store.
- Privacy/PII: Enterprise datasets often contain sensitive and private information. As part of the data preparation process, enterprises need to define how this data will be treated based on their use case and potential end user. In an internal use case, it may be acceptable for LLMs to incorporate information about individuals, for example, querying, “Who is the sales rep for the Walmart account?” However, for external use cases, exposing information about individuals could result in privacy violations. Even with guardrails in place, adversarial attacks could result in unexpected leaks from training data. Ensuring that PII elements are treated appropriately with detection, filtering, redaction, and substitution with synthetic data where appropriate can protect privacy while maintaining data utility and protecting against potential compliance issues.
- Text Extraction: Enterprise data comes in various formats, including PDFs, PowerPoint presentations, and images. Extracting clean, usable text from these unstructured and semi-structured sources is crucial for building comprehensive knowledge bases. The approach to text extraction can vary based on the document's structure, modalities, and complexity. Simple cases might be addressed with standard text extraction tools, while more complex documents could require a combination of automated tools and human annotation.
- Text Normalisation: Data from multiple sources often lacks consistency in aspects like spelling, abbreviations, numeric formats, and referencing styles. This can cause the same concepts to be treated as distinct entities and poorly matched by the model. Applying normalization rules to standardize spelling, grammar, measurements, and general nomenclature is essential to get the most utility out of your text data.
- Chunking Strategy: Following the above steps, documents need to be split into shorter "chunks" or passages that the retrieval component can match to queries and pass to the language model. The objective is to break documents into retrievable units that maintain complete, relevant context around key information. Common methods include fixed-size chunking, document-based chunking, and semantic chunking. In general, human assessment of whether data should sit within an existing chunk or form a new chunk is still considered the gold standard, and an emerging, more advanced method known as "agent chunking” attempts to mimic this human behavior. The ideal chunk size balances having sufficient context with efficiency, and methods like summarization or hierarchical chunking can also be useful for long documents.
- Entity Recognition & Tagging: While the chunks derived from your knowledge bases form the core of your vector store, enriching these chunks with metadata like source details, topics, and key entities across your data can significantly improve a RAG model's accuracy. Named Entity Recognition (NER) for people, organizations, products, concepts, and entity linking can help the model connect passages and enhance retrieval relevance. This can be done systematically using a data annotation platform with automated techniques and human-in-the-loop validation to ensure annotation accuracy and consistency, including domain experts where required.