Merge remote-tracking branch 'origin/dev-minor' into change-default-b…

…ehaviour
SciPhi-AI · Oct 22, 2024 · b1fc134 · b1fc134
2 parents 0bdd504 + 6e5bc12
commit b1fc134
Show file tree

Hide file tree

Showing 22 changed files with 757 additions and 8 deletions.
diff --git a/docs/cookbooks/advanced-graphrag.mdx b/docs/cookbooks/advanced-graphrag.mdx
@@ -0,0 +1,135 @@
+---
+title: 'Advanced GraphRAG'
+description: 'Advanced GraphRAG Techniques with R2R'
+icon: 'chart-network'
+---
+
+
+## Advanced GraphRAG Techniques
+
+R2R supports advanced GraphRAG techniques that can be easily configured at runtime. This flexibility allows you to experiment with different SoTA strategies and optimize your RAG pipeline for specific use cases.
+
+<Note>
+
+Advanced GraphRAG techniques are still a beta feature in R2R.There may be limitations in observability and analytics when implementing them.
+
+Are we missing an important technique? If so, then please let us know at [email protected].
+
+</Note>
+
+
+### Prompt Tuning
+
+One way that we can improve upon GraphRAG's already impressive capabilities by tuning our prompts to a specific domain. When we create a knowledge graph, an LLM extracts the relationships between entities; but for very targeted domains, a general approach may fall short.
+
+To demonstrate this, we can run GraphRAG over the technical papers for the 2024 Nobel Prizes in chemistry, medicine, and physics. By tuning our prompts for GraphRAG, we attempt to understand our documents at a high level, and provide the LLM with a more pointed description.
+
+The following script, which utilizes the Python SDK, generates the tuned prompts and calls the knowledge graph creation process with these prompts at runtime:
+
+```python
+# Step 1: Tune the prompts for knowledge graph creation
+# Tune the entity description prompt
+entity_prompt_response = client.get_tuned_prompt(
+    prompt_name="graphrag_entity_description"
+)
+tuned_entity_prompt = entity_prompt_response['results']['tuned_prompt']
+
+# Tune the triples extraction prompt
+triples_prompt_response = client.get_tuned_prompt(
+    prompt_name="graphrag_triples_extraction_few_shot"
+)
+tuned_triples_prompt = triples_prompt_response['results']['tuned_prompt']
+
+# Step 2: Create the knowledge graph
+kg_settings = {
+    "kg_entity_description_prompt": tuned_entity_prompt
+}
+
+# Generate the initial graph
+graph_response = client.create_graph(
+    run_type="run",
+    kg_creation_settings=kg_settings
+)
+
+# Step 3: Clean up the graph by removing duplicate entities
+client.deduplicate_entities(
+    collection_id='122fdf6a-e116-546b-a8f6-e4cb2e2c0a09'
+)
+
+# Step 4: Tune and apply community reports prompt for graph enrichment
+community_prompt_response = client.get_tuned_prompt(
+    prompt_name="graphrag_community_reports"
+)
+tuned_community_prompt = community_prompt_response['results']['tuned_prompt']
+
+# Configure enrichment settings
+kg_enrichment_settings = {
+    "community_reports_prompt": tuned_community_prompt
+}
+
+# Enrich the graph with additional information
+client.enrich_graph(
+    run_type="run",
+    kg_enrichment_settings=kg_enrichment_settings
+)
+```
+
+For illustrative purposes, we look can look at the `graphrag_entity_description` prompt before and after prompt tuning. It's clear that with prompt tuning, we are able to capture the intent of the documents, giving us a more targeted prompt overall.
+
+<Tabs>
+<Tab title="Prompt after Prompt Tuning">
+```yaml
+Provide a comprehensive yet concise summary of the given entity, incorporating its description and associated triples:
+
+Entity Info:
+{entity_info}
+Triples:
+{triples_txt}
+
+Your summary should:
+1. Clearly define the entity's core concept or purpose
+2. Highlight key relationships or attributes from the triples
+3. Integrate any relevant information from the existing description
+4. Maintain a neutral, factual tone
+5. Be approximately 2-3 sentences long
+
+Ensure the summary is coherent, informative, and captures the essence of the entity within the context of the provided information.
+```
+
+</Tab>
+
+<Tab title="Prompt after Prompt Tuning">
+```yaml
+Provide a comprehensive yet concise summary of the given entity, focusing on its significance in the field of scientific research, while incorporating its description and associated triples:
+
+Entity Info:
+{entity_info}
+Triples:
+{triples_txt}
+
+Your summary should:
+1. Clearly define the entity's core concept or purpose within computational biology, artificial intelligence, and medicine
+2. Highlight key relationships or attributes from the triples that illustrate advancements in scientific understanding and reasoning
+3. Integrate any relevant information from the existing description, particularly breakthroughs and methodologies
+4. Maintain a neutral, factual tone
+5. Be approximately 2-3 sentences long
+
+Ensure the summary is coherent, informative, and captures the essence of the entity within the context of the provided information, emphasizing its impact on the field.
+```
+</Tab>
+
+</Tabs>
+
+After prompt tuning, we see an increase in the number of communities—after prompt tuning, these communities appear more focused and domain-specific with clearer thematic boundaries.
+
+Prompt tuning produces:
+- **More precise community separation:** GraphRAG alone produced a single `MicroRNA Research` Community, which GraphRAG with prompt tuning produced communities around `C. elegans MicroRNA Research`, `LET-7 MicroRNA`, and `miRNA-184 and EDICT Syndrome`.
+- **Enhanced domain focus:** Previously, we had a single community for `AI Researchers`, but with prompt tuning we create specialized communities such as `Hinton, Hopfield, and Deep Learning`, `Hochreiter and Schmidhuber`, and `Minksy and Papert's ANN Critique.`
+
+| Count       | GraphRAG | GraphRAG with Prompt Tuning |
+|-------------|----------|-----------------------------|
+| Entities    | 661      | 636                         |
+| Triples     | 509      | 503                         |
+| Communities | 29       | 41                          |
+
+Prompt tuning allow for us to generate communities that better reflect the natural organization of the domain knowledge while maintaining more precise technical and thematic boundaries between related concepts.
diff --git a/docs/documentation/python-sdk/graphrag.mdx b/docs/documentation/python-sdk/graphrag.mdx
@@ -446,6 +446,61 @@ client.delete_graph_for_collection(
   NOTE: Setting this flag to true will delete entities and triples for documents that are shared across multiple collections. Do not set this flag unless you are absolutely sure that you want to delete the entities and triples for all documents in the collection.
 </ParamField>
 
+## Get Tuned Prompt
+
+```python
+client.get_tuned_prompt(
+    prompt_name="graphrag_entity_description",
+    collection_id='122fdf6a-e116-546b-a8f6-e4cb2e2c0a09',
+    documents_offset=0,
+    documents_limit=100,
+    chunk_offset=0,
+    chunk_limit=100
+)
+```
+
+<AccordionGroup>
+  <Accordion title="Response">
+    <ResponseField name="response" type="dict">
+      The response containing the tuned prompt for GraphRAG.
+      ```bash
+      {
+        "results": {
+          "tuned_prompt": "string"
+        }
+      }
+      ```
+    </ResponseField>
+  </Accordion>
+</AccordionGroup>
+
+<ParamField path="prompt_name" type="str">
+  The name of the prompt to tune. Valid values include "graphrag_entity_description", "graphrag_triples_extraction_few_shot", and "graphrag_community_reports".
+</ParamField>
+
+<ParamField path="collection_id" type="Optional[Union[UUID, str]]">
+  The ID of the collection to tune the prompt for. If not provided, the default collection will be used.
+</ParamField>
+
+<ParamField path="documents_offset" type="Optional[int]">
+  The offset for pagination of documents. Defaults to 0.
+</ParamField>
+
+<ParamField path="documents_limit" type="Optional[int]">
+  The limit for pagination of documents. Defaults to 100. Controls how many documents are used for tuning.
+</ParamField>
+
+<ParamField path="chunk_offset" type="Optional[int]">
+  The offset for pagination of chunks within each document. Defaults to 0.
+</ParamField>
+
+<ParamField path="chunk_limit" type="Optional[int]">
+  The limit for pagination of chunks within each document. Defaults to 100. Controls how many chunks per document are used for tuning.
+</ParamField>
+
+The tuning process provides an LLM with chunks from each document in the collection. The relative sample size can therefore be controlled by adjusting the document and chunk limits.
+
+
 ## Search and RAG
 
 Please see the [Search and RAG](/documentation/python-sdk/retrieval) documentation for more information on how to perform search and RAG using Knowledge Graphs.

diff --git a/docs/mint.json b/docs/mint.json
@@ -407,8 +407,9 @@
         "cookbooks/walkthrough",
         "cookbooks/ingestion",
         "cookbooks/hybrid-search",
-        "cookbooks/graphrag",
         "cookbooks/advanced-rag",
+        "cookbooks/graphrag",
+        "cookbooks/advanced-graphrag",
         "cookbooks/agent",
         "cookbooks/orchestration",
         "cookbooks/web-dev"

diff --git a/py/alembic.ini b/py/alembic.ini
@@ -0,0 +1,37 @@
+[alembic]
+script_location = migrations
+sqlalchemy.url = postgresql+asyncpg://postgres:postgres@localhost/postgres
+
+[loggers]
+keys = root,sqlalchemy,alembic
+
+[handlers]
+keys = console
+
+[formatters]
+keys = generic
+
+[logger_root]
+level = WARN
+handlers = console
+qualname =
+
+[logger_sqlalchemy]
+level = WARN
+handlers =
+qualname = sqlalchemy.engine
+
+[logger_alembic]
+level = INFO
+handlers =
+qualname = alembic
+
+[handler_console]
+class = StreamHandler
+args = (sys.stderr,)
+level = NOTSET
+formatter = generic
+
+[formatter_generic]
+format = %(levelname)-5.5s [%(name)s] %(message)s
+datefmt = %H:%M:%S
diff --git a/py/core/base/api/models/__init__.py b/py/core/base/api/models/__init__.py
@@ -26,6 +26,7 @@
     WrappedKGEntitiesResponse,
     WrappedKGEntityDeduplicationResponse,
     WrappedKGTriplesResponse,
+    WrappedKGTunePromptResponse,
 )
 from shared.api.models.management.responses import (
     AnalyticsResponse,
@@ -93,6 +94,7 @@
     "WrappedKGEnrichmentResponse",
     "KGEntityDeduplicationResponse",
     "WrappedKGEntityDeduplicationResponse",
+    "WrappedKGTunePromptResponse",
     # Management Responses
     "PromptResponse",
     "ServerStats",

diff --git a/py/core/main/abstractions.py b/py/core/main/abstractions.py
@@ -42,6 +42,7 @@ class R2RPipes(BaseModel):
     kg_entity_deduplication_pipe: AsyncPipe
     kg_entity_deduplication_summary_pipe: AsyncPipe
     kg_community_summary_pipe: AsyncPipe
+    kg_prompt_tuning_pipe: AsyncPipe
     rag_pipe: AsyncPipe
     streaming_rag_pipe: AsyncPipe
     vector_storage_pipe: AsyncPipe

diff --git a/py/core/main/api/kg_router.py b/py/core/main/api/kg_router.py
@@ -14,6 +14,7 @@
     WrappedKGEntitiesResponse,
     WrappedKGEntityDeduplicationResponse,
     WrappedKGTriplesResponse,
+    WrappedKGTunePromptResponse,
 )
 from core.base.providers import OrchestrationProvider, Workflow
 from core.utils import generate_default_user_collection_id
@@ -389,6 +390,50 @@ async def deduplicate_entities(
                 "entity-deduplication", {"request": workflow_input}, {}
             )
 
+        @self.router.get("/tuned_prompt")
+        @self.base_endpoint
+        async def get_tuned_prompt(
+            prompt_name: str = Query(
+                ...,
+                description="The name of the prompt to tune. Valid options are 'kg_triples_extraction_prompt', 'kg_entity_description_prompt' and 'community_reports_prompt'.",
+            ),
+            collection_id: Optional[UUID] = Query(
+                None, description="Collection ID to retrieve communities from."
+            ),
+            documents_offset: Optional[int] = Query(
+                0, description="Offset for document pagination."
+            ),
+            documents_limit: Optional[int] = Query(
+                100, description="Limit for document pagination."
+            ),
+            chunks_offset: Optional[int] = Query(
+                0, description="Offset for chunk pagination."
+            ),
+            chunks_limit: Optional[int] = Query(
+                100, description="Limit for chunk pagination."
+            ),
+            auth_user=Depends(self.service.providers.auth.auth_wrapper),
+        ) -> WrappedKGTunePromptResponse:
+            """
+            Auto-tune the prompt for a specific collection.
+            """
+            if not auth_user.is_superuser:
+                logger.warning("Implement permission checks here.")
+
+            if not collection_id:
+                collection_id = generate_default_user_collection_id(
+                    auth_user.id
+                )
+
+            return await self.service.tune_prompt(
+                prompt_name=prompt_name,
+                collection_id=collection_id,
+                documents_offset=documents_offset,
+                documents_limit=documents_limit,
+                chunks_offset=chunks_offset,
+                chunks_limit=chunks_limit,
+            )
+
         @self.router.delete("/delete_graph_for_collection")
         @self.base_endpoint
         async def delete_graph_for_collection(

diff --git a/py/core/main/assembly/factory.py b/py/core/main/assembly/factory.py
@@ -395,6 +395,7 @@ def create_pipes(
         kg_entity_deduplication_pipe: Optional[AsyncPipe] = None,
         kg_entity_deduplication_summary_pipe: Optional[AsyncPipe] = None,
         kg_community_summary_pipe: Optional[AsyncPipe] = None,
+        kg_prompt_tuning_pipe: Optional[AsyncPipe] = None,
         *args,
         **kwargs,
     ) -> R2RPipes:
@@ -433,6 +434,8 @@ def create_pipes(
             ),
             kg_community_summary_pipe=kg_community_summary_pipe
             or self.create_kg_community_summary_pipe(*args, **kwargs),
+            kg_prompt_tuning_pipe=kg_prompt_tuning_pipe
+            or self.create_kg_prompt_tuning_pipe(*args, **kwargs),
         )
 
     def create_parsing_pipe(self, *args, **kwargs) -> Any:
@@ -677,6 +680,16 @@ def create_kg_entity_deduplication_summary_pipe(
             ),
         )
 
+    def create_kg_prompt_tuning_pipe(self, *args, **kwargs) -> Any:
+        from core.pipes import KGPromptTuningPipe
+
+        return KGPromptTuningPipe(
+            kg_provider=self.providers.kg,
+            llm_provider=self.providers.llm,
+            prompt_provider=self.providers.prompt,
+            config=AsyncPipe.PipeConfig(name="kg_prompt_tuning_pipe"),
+        )
+
 
 class R2RPipelineFactory:
     def __init__(self, config: R2RConfig, pipes: R2RPipes):