Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add support for knowledge graphs #129

Open
MB-Finski opened this issue Jan 9, 2025 · 1 comment
Open

Feature: Add support for knowledge graphs #129

MB-Finski opened this issue Jan 9, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@MB-Finski
Copy link

MB-Finski commented Jan 9, 2025

The traditional RAG approach has difficulty in extracting complex relationships or overarching themes from the source material due to chunking and later only retrieving some of these chunks. This limits the usefulness of RAG for more complex and in-depth topics that cannot be solved by retrieving only chunks of the source material. Also, real-world data sources may contain conflicting and unreliable information which may confuse an LLM trying to generate an answer without being aware of the broader context.

Knowledge graphs can help solve this issue by incrementally building a graph structure from the source data where each edge in the graph represents a contextual relationships between separate sets of facts or topics. This methodology allows not only retrieving the relevant chunk of source material but also the relevant context for that source material.

Potential use cases

  • An expert RAG agent on a constrained topic with complex factual relationships
    • A domain expert on scientific or technical topics with relevant literature as source material
    • This is something I've personally found that traditional RAG is almost useless in since the relationships between different source documents are actually the interesting bits of information
    • A support bot based on 'dirty' datasets like old/existing support logs where some/most of the information may be unreliable or dated
  • A "project-based" expert RAG
    • Suppose a team works on a project (or separate projects) adding material to the RAG data source over time. Some of this material may be conflicting and ambiguous or even change as the project progresses. This is the type of real-world data in which graph rag would excel in finding over-arching relationships and themes. Given suitable source material (besides files these could be todo-lists, calendar entries etc.), you could ask the context chat questions like: What is the current state of the project? What is currently the most important blocking issue for the completion of the project?

Difficulties/Limitations

  • Inserting data into a knowledge graph requires LLM processing
    • Is the back-end app currently able to access LLM providers?
    • Considerably more processing is required for building the knowledge graphs vs building a vector db which limits using knowledge graphs for all users' data.
  • Seamlessly updating existing information in knowledge graphs may not be possible currently (Incremental indexing (adding new content) microsoft/graphrag#741), although knowledge graphs can apparently deal with conflicting information quite well even when it is added incrementally.

P.S. I'm working 50% currently so I'm available for discussions or just catching up!

EDIT: A good basic explanation of the concepts involved: https://www.youtube.com/watch?v=6vG_amAshTk

@kyteinsky
Copy link
Contributor

hey, long time!

Nice to see knowledge graphs make a comeback. I do agree with the use cases and benefits over the current system.

This methodology allows not only retrieving the relevant chunk of source material but also the relevant context for that source material.

The vector db based RAG system was supposed to do this but alas. Let's see how graph-rag performs.
We should do a good comparison between both the approaches. The one in the video is good but the conclusion was not appropriate since the short output is a result of the prompt when the point of the test was the documents bits/relations retrieved.

A support bot based on 'dirty' datasets like old/existing support logs where some/most of the information may be unreliable or dated

That would be sweet, like comparing the documentation with the support tickets and trusting the documentation on conflict.

Inserting data into a knowledge graph requires LLM processing
Considerably more processing is required for building the knowledge graphs vs building a vector db which limits using knowledge graphs for all users' data.

This is what is most concerning about this method. Embedding generation is super fast but in graph-rag, we need to run prompts on every chunk and some more which wouldn't just be prompt processing but output generation too so it would be much slower. Add the file changes/creations/deletions on top of that for graph update, although this shouldn't be that bad ig.
Prompts from fast-graphrag: https://github.com/circlemind-ai/fast-graphrag/blob/main/fast_graphrag/_prompt.py
The prompts from microsoft seem a bit large but is around ~2k tokens so not that much, leaves plenty of space for actual text in a 8k - 1k context window.

Is the back-end app currently able to access LLM providers?

It uses the text-to-text task processing api in the server as default, which accepts a text input and generates a response. In addition to that we still support llama and ctransformers locally (in container) with a config change.

Seamlessly updating existing information in knowledge graphs may not be possible currently (Incremental indexing (adding new content) microsoft/graphrag#741), although knowledge graphs can apparently deal with conflicting information quite well even when it is added incrementally.

The issue mentions that document additions are possible now, no deletions it seems. Someone also mentioned https://github.com/circlemind-ai/fast-graphrag which is faster and more accurate to microsoft's implementation according to their benchmark (https://github.com/circlemind-ai/fast-graphrag/blob/main/benchmarks/README.md) so that should not be an issue if we use this.

P.S. I'm working 50% currently so I'm available for discussions or just catching up!

🚀 let's find a common time when we can meet. When are you usually free? And allow me some time to find more interested people.

@kyteinsky kyteinsky added the enhancement New feature or request label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants