support graphrag v0.4.0

KylinMountain · Nov 8, 2024 · beceaf2 · beceaf2
1 parent 8a8f862
commit beceaf2
Show file tree

Hide file tree

Showing 31 changed files with 1,613 additions and 43 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,25 @@
+FROM python:3.10-slim-buster
+
+ENV PYTHONDONTWRITEBYTECODE 1
+ENV PYTHONUNBUFFERED 1
+ENV CARGO_HOME=/root/.cargo
+ENV PATH=$CARGO_HOME/bin:$PATH
+
+RUN apt update && apt install -y \
+    curl \
+    build-essential \
+    && curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \
+    && cargo --version \
+    && rustc --version
+
+WORKDIR /app
+
+COPY requirements.txt .
+
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY . .
+
+EXPOSE 20213
+
+CMD ["uvicorn", "webserver.main:app", "--host", "0.0.0.0", "--port", "20213"]
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,7 @@
     MIT License
 
     Copyright (c) Microsoft Corporation.
+    Copyright (c) KylinMountain.
 
     Permission is hereby granted, free of charge, to any person obtaining a copy
     of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,3 +1,110 @@
+# GraphRAG customized by KylinMountain
+- I have added websever to support streaming output immediately.
+- I have fixed error when using local embedding service like LM Studio
+- I have fixed index error after prompt tune
+- I have fixed the strategy not loaded when setting entity extraction using NLTK.
+- I have added advice question api
+- I have added reference link to the entity、report or relationship refered in output, you can access it.
+- Support any desktop application or web application compatible with OpenAI SDK.
+- Support docker deploy. you can get the docker kylinmountain/graphrag-server:0.3.1
+
+# GraphRAG 定制版
+- 我添加了Web服务器，以支持真即时流式输出。
+- 我修复了使用本地嵌入服务（如LM Studio）时的错误。
+- 我修复了提示调整后索引错误的问题。
+- 我修复了在使用NLTK设置实体提取时策略未加载的问题。
+- 我添加了建议问题API。
+- 我添加了实体或者关系等链接到输出中，你可以直接点击访问参考实体、关系、数据源或者报告。
+- 支持任意兼容OpenAI大模型桌面应用或者Web应用UI接入。
+- 增加Docker构建，最新版本0.3.1, kylinmountain/graphrag-server:0.3.1
+
+![image](https://github.com/user-attachments/assets/c251d434-4925-4012-88e7-f3b2ff40471f)
+
+
+![image](https://github.com/user-attachments/assets/ab7a8d2e-aeec-4a0c-afb9-97086b9c7b2a)
+
+# 如何安装How to install
+你可以使用Docker安装，也可以拉取本项目使用。You can install by docker or pull this repo.
+## 拉取源码 Pull the source code
+- 克隆本项目 Clone the repo
+```
+git clone https://github.com/KylinMountain/graphrag.git
+cd graphrag
+```
+- 建立虚拟环境 Create virtual env
+```
+conda create -n graphrag python=3.10
+conda activate graphrag
+```
+- 安装poetry Install poetry
+```
+curl -sSL https://install.python-poetry.org | python3 -
+```
+- 安装依赖 Install dependencies
+```
+poetry install
+pip install -r webserver/requirements.txt
+```
+或者 or
+```
+pip install -r requirements.txt
+```
+- 初始化GraphRAG Initialize GraphRAG
+```
+poetry run poe index --init --root .
+# 或者 or 
+python -m graphrag.index --init --root . 
+```
+- 创建input文件夹 Create Input Foler
+```
+mkdir input
+```
+- 配置settings.yaml Config settings.yaml
+按照GraphRAG官方配置文档配置 [GraphRAG Configuration](https://microsoft.github.io/graphrag/posts/config/json_yaml/)
+- 配置webserver Config webserver
+
+你可能需要配置以下设置，但默认即可支持本地运行。 You may need config the following item, but you can use the default param.
+```yaml
+    server_host: str = "http://localhost"
+    server_port: int = 20213
+    data: str = (
+        "./output"
+    )
+    lancedb_uri: str = (
+        "./lancedb"
+    )
+```
+- 启动web serevr
+```bash
+python -m webserver.main
+```
+更多的参考配置，可以访问[公众号文章](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzI0OTAzNTEwMw==&action=getalbum&album_id=3429606151455670272&uin=&key=&devicetype=iMac+MacBookPro17%2C1+OSX+OSX+14.4+build(23E214)&version=13080710&lang=zh_CN&nettype=WIFI&ascene=0&fontScale=100)和[B站视频](https://www.bilibili.com/video/BV113v8e6EZn)
+
+## 使用Docker安装 Install by docker
+- 拉取镜像 pull the docker image
+```
+docker pull kylinmountain/graphrag-server:0.3.1
+```
+启动 Start
+在启动前 你可以创建output、input、prompts和settings.yaml等目录或者文件
+Before start, you can create output、input、prompts and settings.yaml etc.
+```
+docker run -v ./output:/app/output \
+           -v ./input:/app/input \
+           -v ./prompts:/app/prompts \
+           -v ./settings.yaml:/app/settings.yaml \
+           -v ./lancedb:/app/lancedb -p 20213:20213 kylinmountain/graphrag-server:0.3.1
+
+```
+- 索引 Index
+```
+docker run kylinmountain/graphrag-server:0.3.1 python -m graphrag.index --root .
+```
+
+
+
+
+-------
 # GraphRAG
 
 👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator) <br/>

diff --git a/graphrag/query/structured_search/drift_search/action.py b/graphrag/query/structured_search/drift_search/action.py
@@ -7,6 +7,7 @@
 import logging
 from typing import Any
 
+from graphrag.llm.openai.utils import try_parse_json_object
 from graphrag.query.llm.text_utils import num_tokens
 
 log = logging.getLogger(__name__)
@@ -71,7 +72,7 @@ async def asearch(self, search_engine: Any, global_query: str, scorer: Any = Non
         )
 
         try:
-            response = json.loads(search_result.response)
+            _, response = try_parse_json_object(search_result.response)
         except json.JSONDecodeError:
             error_message = "Failed to parse search response"
             log.exception("%s: %s", error_message, search_result.response)
@@ -198,7 +199,7 @@ def from_primer_response(
         # If response is a string, attempt to parse as JSON
         if isinstance(response, str):
             try:
-                parsed_response = json.loads(response)
+                _, parsed_response = try_parse_json_object(response)
                 if isinstance(parsed_response, dict):
                     return cls.from_primer_response(query, parsed_response)
                 error_message = "Parsed response must be a dictionary."

diff --git a/graphrag/query/structured_search/drift_search/primer.py b/graphrag/query/structured_search/drift_search/primer.py
@@ -14,6 +14,7 @@
 from tqdm.asyncio import tqdm_asyncio
 
 from graphrag.config.models.drift_config import DRIFTSearchConfig
+from graphrag.llm.openai.utils import try_parse_json_object
 from graphrag.model import CommunityReport
 from graphrag.query.llm.base import BaseTextEmbedding
 from graphrag.query.llm.oai.chat_openai import ChatOpenAI
@@ -139,7 +140,7 @@ async def decompose_query(
             messages, response_format={"type": "json_object"}
         )
 
-        parsed_response = json.loads(response)
+        _, parsed_response = try_parse_json_object(response)
         token_ct = num_tokens(prompt + response, self.token_encoder)
 
         return parsed_response, token_ct

diff --git a/graphrag/query/structured_search/global_search/map_system_prompt.py b/graphrag/query/structured_search/global_search/map_system_prompt.py
@@ -23,22 +23,21 @@
 The response should be JSON formatted as follows:
 {{
     "points": [
-        {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
-        {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
+        {{"description": "Description of point 1 [^Data:Reports(report id)][^Data:Reports(report id)]", "score": score_value}},
+        {{"description": "Description of point 2 [^Data:Reports(report id)][^Data:Reports(report id)]", "score": score_value}}
     ]
 }}
 
 The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
 
 Points supported by data should list the relevant reports as references as follows:
-"This is an example sentence supported by data references [Data: Reports (report ids)]"
+"This is an example sentence supported by data references [^Data:Reports(report id)][^Data:Reports(report id)]"
 
 **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
 
 For example:
-"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
-
-where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
+"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Reports(2)] [^Data:Reports(7)] [^Data:Reports(34)] [^Data:Reports(46)] [^Data:Reports(64,+more)]. He is also CEO of company X [^Data:Reports(1)] [^Data:Reports(3)]"
+where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
 
 Do not include information where the supporting evidence for it is not provided.
 
@@ -80,3 +79,4 @@
     ]
 }}
 """
+
diff --git a/graphrag/query/structured_search/global_search/reduce_system_prompt.py b/graphrag/query/structured_search/global_search/reduce_system_prompt.py
@@ -25,11 +25,11 @@
 
 The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
 
-**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
+**References should be listed with a single record ID per citation**, with each citation containing only one record ID. For example, [^Data:Relationships(38)] [^Data:Relationships(55)], instead of [^Data:Relationships(38, 55)].
 
 For example:
 
-"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
+"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Reports(2)] [^Data:Reports(7)] [^Data:Reports(34)] [^Data:Reports(46)] [^Data:Reports(64,+more)]. He is also CEO of company X [^Data:Reports(1)] [^Data:Reports(3)]"
 
 where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
 
@@ -60,11 +60,11 @@
 
 The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
 
-**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
+**References should be listed with a single record ID per citation**, with each citation containing only one record ID. For example, [^Data:Relationships(38)] [^Data:Relationships(55)], instead of [^Data:Relationships(38, 55)].
 
 For example:
 
-"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
+"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Reports(2)] [^Data:Reports(7)] [^Data:Reports(34)] [^Data:Reports(46)] [^Data:Reports(64,+more)]. He is also CEO of company X [^Data:Reports(1)] [^Data:Reports(3)]"
 
 where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
 

diff --git a/graphrag/query/structured_search/local_search/system_prompt.py b/graphrag/query/structured_search/local_search/system_prompt.py
@@ -17,13 +17,15 @@
 
 Points supported by data should list their data references as follows:
 
-"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
+"This is an example sentence supported by multiple data references [^Data:<dataset name>(record id)] [^Data:<dataset name>(record id)]."
 
-Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
+The <dataset name> should be one of Entities, Relationships, Claims, Sources, Reports.
+
+**References should be listed with a single record ID per citation**, with each citation containing only one record ID. For example, [^Data:Relationships(38)] [^Data:Relationships(55)], instead of [^Data:Relationships(38, 55)].
 
 For example:
 
-"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
+"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Sources(15)] [^Data:Sources(16)] [^Data:Reports(1)] [^Data:Entities(5)] [^Data:Entities(7)] [^Data:Relationships(23)] [^Data:Claims(2)] [^Data:Claims(7)] [^Data:Claims(34)] [^Data:Claims(46)] [^Data:Claims(64,+more)]."
 
 where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
 
@@ -39,31 +41,4 @@
 
 {context_data}
 
-
----Goal---
-
-Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
-
-If you don't know the answer, just say so. Do not make anything up.
-
-Points supported by data should list their data references as follows:
-
-"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
-
-Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
-
-For example:
-
-"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
-
-where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
-
-Do not include information where the supporting evidence for it is not provided.
-
-
----Target response length and format---
-
-{response_type}
-
-Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
 """