Skip to content

Commit

Permalink
support graphrag v0.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
KylinMountain committed Nov 8, 2024
1 parent 8a8f862 commit beceaf2
Show file tree
Hide file tree
Showing 31 changed files with 1,613 additions and 43 deletions.
25 changes: 25 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM python:3.10-slim-buster

ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
ENV CARGO_HOME=/root/.cargo
ENV PATH=$CARGO_HOME/bin:$PATH

RUN apt update && apt install -y \
curl \
build-essential \
&& curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \
&& cargo --version \
&& rustc --version

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 20213

CMD ["uvicorn", "webserver.main:app", "--host", "0.0.0.0", "--port", "20213"]
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
MIT License

Copyright (c) Microsoft Corporation.
Copyright (c) KylinMountain.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
107 changes: 107 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,110 @@
# GraphRAG customized by KylinMountain
- I have added websever to support streaming output immediately.
- I have fixed error when using local embedding service like LM Studio
- I have fixed index error after prompt tune
- I have fixed the strategy not loaded when setting entity extraction using NLTK.
- I have added advice question api
- I have added reference link to the entity、report or relationship refered in output, you can access it.
- Support any desktop application or web application compatible with OpenAI SDK.
- Support docker deploy. you can get the docker kylinmountain/graphrag-server:0.3.1

# GraphRAG 定制版
- 我添加了Web服务器,以支持真即时流式输出。
- 我修复了使用本地嵌入服务(如LM Studio)时的错误。
- 我修复了提示调整后索引错误的问题。
- 我修复了在使用NLTK设置实体提取时策略未加载的问题。
- 我添加了建议问题API。
- 我添加了实体或者关系等链接到输出中,你可以直接点击访问参考实体、关系、数据源或者报告。
- 支持任意兼容OpenAI大模型桌面应用或者Web应用UI接入。
- 增加Docker构建,最新版本0.3.1, kylinmountain/graphrag-server:0.3.1

![image](https://github.com/user-attachments/assets/c251d434-4925-4012-88e7-f3b2ff40471f)


![image](https://github.com/user-attachments/assets/ab7a8d2e-aeec-4a0c-afb9-97086b9c7b2a)

# 如何安装How to install
你可以使用Docker安装,也可以拉取本项目使用。You can install by docker or pull this repo.
## 拉取源码 Pull the source code
- 克隆本项目 Clone the repo
```
git clone https://github.com/KylinMountain/graphrag.git
cd graphrag
```
- 建立虚拟环境 Create virtual env
```
conda create -n graphrag python=3.10
conda activate graphrag
```
- 安装poetry Install poetry
```
curl -sSL https://install.python-poetry.org | python3 -
```
- 安装依赖 Install dependencies
```
poetry install
pip install -r webserver/requirements.txt
```
或者 or
```
pip install -r requirements.txt
```
- 初始化GraphRAG Initialize GraphRAG
```
poetry run poe index --init --root .
# 或者 or
python -m graphrag.index --init --root .
```
- 创建input文件夹 Create Input Foler
```
mkdir input
```
- 配置settings.yaml Config settings.yaml
按照GraphRAG官方配置文档配置 [GraphRAG Configuration](https://microsoft.github.io/graphrag/posts/config/json_yaml/)
- 配置webserver Config webserver

你可能需要配置以下设置,但默认即可支持本地运行。 You may need config the following item, but you can use the default param.
```yaml
server_host: str = "http://localhost"
server_port: int = 20213
data: str = (
"./output"
)
lancedb_uri: str = (
"./lancedb"
)
```
- 启动web serevr
```bash
python -m webserver.main
```
更多的参考配置,可以访问[公众号文章](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzI0OTAzNTEwMw==&action=getalbum&album_id=3429606151455670272&uin=&key=&devicetype=iMac+MacBookPro17%2C1+OSX+OSX+14.4+build(23E214)&version=13080710&lang=zh_CN&nettype=WIFI&ascene=0&fontScale=100)[B站视频](https://www.bilibili.com/video/BV113v8e6EZn)

## 使用Docker安装 Install by docker
- 拉取镜像 pull the docker image
```
docker pull kylinmountain/graphrag-server:0.3.1
```
启动 Start
在启动前 你可以创建output、input、prompts和settings.yaml等目录或者文件
Before start, you can create output、input、prompts and settings.yaml etc.
```
docker run -v ./output:/app/output \
-v ./input:/app/input \
-v ./prompts:/app/prompts \
-v ./settings.yaml:/app/settings.yaml \
-v ./lancedb:/app/lancedb -p 20213:20213 kylinmountain/graphrag-server:0.3.1
```
- 索引 Index
```
docker run kylinmountain/graphrag-server:0.3.1 python -m graphrag.index --root .
```




-------
# GraphRAG

👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator) <br/>
Expand Down
5 changes: 3 additions & 2 deletions graphrag/query/structured_search/drift_search/action.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import logging
from typing import Any

from graphrag.llm.openai.utils import try_parse_json_object
from graphrag.query.llm.text_utils import num_tokens

log = logging.getLogger(__name__)
Expand Down Expand Up @@ -71,7 +72,7 @@ async def asearch(self, search_engine: Any, global_query: str, scorer: Any = Non
)

try:
response = json.loads(search_result.response)
_, response = try_parse_json_object(search_result.response)
except json.JSONDecodeError:
error_message = "Failed to parse search response"
log.exception("%s: %s", error_message, search_result.response)
Expand Down Expand Up @@ -198,7 +199,7 @@ def from_primer_response(
# If response is a string, attempt to parse as JSON
if isinstance(response, str):
try:
parsed_response = json.loads(response)
_, parsed_response = try_parse_json_object(response)
if isinstance(parsed_response, dict):
return cls.from_primer_response(query, parsed_response)
error_message = "Parsed response must be a dictionary."
Expand Down
3 changes: 2 additions & 1 deletion graphrag/query/structured_search/drift_search/primer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from tqdm.asyncio import tqdm_asyncio

from graphrag.config.models.drift_config import DRIFTSearchConfig
from graphrag.llm.openai.utils import try_parse_json_object
from graphrag.model import CommunityReport
from graphrag.query.llm.base import BaseTextEmbedding
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
Expand Down Expand Up @@ -139,7 +140,7 @@ async def decompose_query(
messages, response_format={"type": "json_object"}
)

parsed_response = json.loads(response)
_, parsed_response = try_parse_json_object(response)
token_ct = num_tokens(prompt + response, self.token_encoder)

return parsed_response, token_ct
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,22 +23,21 @@
The response should be JSON formatted as follows:
{{
"points": [
{{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
{{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
{{"description": "Description of point 1 [^Data:Reports(report id)][^Data:Reports(report id)]", "score": score_value}},
{{"description": "Description of point 2 [^Data:Reports(report id)][^Data:Reports(report id)]", "score": score_value}}
]
}}
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"
"This is an example sentence supported by data references [^Data:Reports(report id)][^Data:Reports(report id)]"
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Reports(2)] [^Data:Reports(7)] [^Data:Reports(34)] [^Data:Reports(46)] [^Data:Reports(64,+more)]. He is also CEO of company X [^Data:Reports(1)] [^Data:Reports(3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
Expand Down Expand Up @@ -80,3 +79,4 @@
]
}}
"""

Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@
The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
**References should be listed with a single record ID per citation**, with each citation containing only one record ID. For example, [^Data:Relationships(38)] [^Data:Relationships(55)], instead of [^Data:Relationships(38, 55)].
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Reports(2)] [^Data:Reports(7)] [^Data:Reports(34)] [^Data:Reports(46)] [^Data:Reports(64,+more)]. He is also CEO of company X [^Data:Reports(1)] [^Data:Reports(3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Expand Down Expand Up @@ -60,11 +60,11 @@
The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
**References should be listed with a single record ID per citation**, with each citation containing only one record ID. For example, [^Data:Relationships(38)] [^Data:Relationships(55)], instead of [^Data:Relationships(38, 55)].
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Reports(2)] [^Data:Reports(7)] [^Data:Reports(34)] [^Data:Reports(46)] [^Data:Reports(64,+more)]. He is also CEO of company X [^Data:Reports(1)] [^Data:Reports(3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Expand Down
35 changes: 5 additions & 30 deletions graphrag/query/structured_search/local_search/system_prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,15 @@
Points supported by data should list their data references as follows:
"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
"This is an example sentence supported by multiple data references [^Data:<dataset name>(record id)] [^Data:<dataset name>(record id)]."
Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
The <dataset name> should be one of Entities, Relationships, Claims, Sources, Reports.
**References should be listed with a single record ID per citation**, with each citation containing only one record ID. For example, [^Data:Relationships(38)] [^Data:Relationships(55)], instead of [^Data:Relationships(38, 55)].
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [^Data:Sources(15)] [^Data:Sources(16)] [^Data:Reports(1)] [^Data:Entities(5)] [^Data:Entities(7)] [^Data:Relationships(23)] [^Data:Claims(2)] [^Data:Claims(7)] [^Data:Claims(34)] [^Data:Claims(46)] [^Data:Claims(64,+more)]."
where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Expand All @@ -39,31 +41,4 @@
{context_data}
---Goal---
Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Points supported by data should list their data references as follows:
"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""
Loading

0 comments on commit beceaf2

Please sign in to comment.