-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Graph sampling with Neo4j #766
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Hi team, I see this PR is still in draft stage, but I have a question regarding the necessity of implementing graph sampling at this point. From the definition of GraphRAG, it appears to combine a vector database with a knowledge graph (KG). After extracting the KG's relations and entities, nodes are embedded. When a query is received, it is also embedded into a vector. The vector search then finds nodes with high similarity to the query and starts traversing these nodes to gather nearby nodes and relations for input to the LLM. However, I noticed that the current implementation in Camel does not include any embedding for the query or nodes, which seems to be a critical part of the GraphRAG approach (as outlined in Mircosoft's GraphRAG’s embedding code and PingCAP's graphrag-demo). Given that the vector search has already identified the relevant nodes, is it necessary to implement graph sampling at this stage? While it could potentially improve performance, should it be prioritized at the current phase, especially since no existing GraphRAG repositories appear to have integrated graph sampling yet? Thank you for your attention to this matter. |
return GraphElement(
nodes=list(nodes.values()),
relationships=relationships,
source=self.element,
)
From kg_agent.py, I think each node's value and each relationship need to be vectorized. |
Follow up on previous comment: All RAG approaches have trade-offs. The current implementation offers significant advantages in terms of cost and latency. From a product perspective, I recommend implementing the graph-vector RAG approach as soon as possible, as it can help the LLM generate more comprehensive and accurate answers. |
Description
Implement graph sampling methods for GraphRAG pipeline, this is supported by Neo4j
Random walk with restart - taking random walks from a set of start nodes.
Common Neighbour Aware Random Walk - avoids getting caught in local loops. This is especially useful for graphs with solid dense regions.
refer: https://neo4j.com/docs/graph-data-science/current/management-ops/graph-creation/sampling/?utm_source=Google&utm_medium=PaidSearch&utm_campaign=Evergreen&utm_content=EMEA-Search-SEMCE-DSA-None-SEM-SEM-NonABM&utm_term=&utm_adgroup=DSA&gad_source=1&gbraid=0AAAAADk9OYoZ9TwCfgKACGs9UcIqy5mjS
Motivation and Context
close #737
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Implemented Tasks
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!