Skip to content

Commit

Permalink
Merge branch 'microsoft:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
KylinMountain authored Nov 7, 2024
2 parents 4bd5f88 + baa261c commit 8a8f862
Show file tree
Hide file tree
Showing 24 changed files with 186 additions and 35 deletions.
7 changes: 5 additions & 2 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ name: Python CI
on:
push:
branches:
- "**/main" # Matches branches like feature/main
- "main" # Matches the main branch
- "**/main" # match branches like feature/main
- "main" # match the main branch
pull_request:
types:
- opened
Expand All @@ -13,6 +13,9 @@ on:
branches:
- "**/main"
- "main"
paths-ignore:
- "**/*.md"
- ".semversioner/**"

permissions:
contents: read
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/python-integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ name: Python Integration Tests
on:
push:
branches:
- "**/main" # Matches branches like feature/main
- "main" # Matches the main branch
- "**/main" # match branches like feature/main
- "main" # match the main branch
pull_request:
types:
- opened
Expand All @@ -13,6 +13,9 @@ on:
branches:
- "**/main"
- "main"
paths-ignore:
- "**/*.md"
- ".semversioner/**"

permissions:
contents: read
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/python-notebook-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ name: Python Notebook Tests
on:
push:
branches:
- "**/main" # Matches branches like feature/main
- "main" # Matches the main branch
- "**/main" # match branches like feature/main
- "main" # match the main branch
pull_request:
types:
- opened
Expand All @@ -13,6 +13,9 @@ on:
branches:
- "**/main"
- "main"
paths-ignore:
- "**/*.md"
- ".semversioner/**"

permissions:
contents: read
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/python-smoke-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ name: Python Smoke Tests
on:
push:
branches:
- "**/main" # Matches branches like feature/main
- "main" # Matches the main branch
- "**/main" # match branches like feature/main
- "main" # match the main branch
pull_request:
types:
- opened
Expand All @@ -13,6 +13,9 @@ on:
branches:
- "**/main"
- "main"
paths-ignore:
- "**/*.md"
- ".semversioner/**"

permissions:
contents: read
Expand Down
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241031001404444046.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "add visualization guide to doc site"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241106094228896260.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "fix streaming output error"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241106184714830526.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Allow some cicd jobs to skip PRs dedicated to doc updates only."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241106193551070554.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Fix a file paths issue in the viz guide."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241106225803494336.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Fix optional covariates update in incremental indexing"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241106232311738461.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Raise error on empty deltas for inc indexing"
}
5 changes: 4 additions & 1 deletion dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@ onclick
pymdownx
linenums
twemoji
Gephi
gephi
Gephi's

# Verbs
binarize
Expand Down Expand Up @@ -183,4 +186,4 @@ kwds
astrotechnician
epitheg
unspooled
unnavigated
unnavigated
39 changes: 19 additions & 20 deletions docs/blog_posts.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,34 @@
<div class="grid cards" markdown>

- [:octicons-arrow-right-24: **GraphRAG: Unlocking LLM discovery on narrative private data**](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)

***

<h6>Published February 13, 2024
<div class="grid cards" markdown>

By [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect; [Steven Truitt](https://www.microsoft.com/en-us/research/people/steventruitt/), Principal Program Manager</h6>
- [:octicons-arrow-right-24: __GraphRAG: Unlocking LLM discovery on narrative private data__](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)

- [:octicons-arrow-right-24: **GraphRAG: New tool for complex data discovery now on GitHub**](https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/)
---
<h6>Published February 13, 2024

***
By [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect; [Steven Truitt](https://www.microsoft.com/en-us/research/people/steventruitt/), Principal Program Manager</h6>


<h6>Published July 2, 2024
- [:octicons-arrow-right-24: __GraphRAG: New tool for complex data discovery now on GitHub__](https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/)

By [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; [Steven Truitt](https://www.microsoft.com/en-us/research/people/steventruitt/), Principal Program Manager; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>
---
<h6>Published July 2, 2024

- [:octicons-arrow-right-24: **GraphRAG auto-tuning provides rapid adaptation to new domains**](https://www.microsoft.com/en-us/research/blog/graphrag-auto-tuning-provides-rapid-adaptation-to-new-domains/)
By [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; [Steven Truitt](https://www.microsoft.com/en-us/research/people/steventruitt/), Principal Program Manager; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>

***

<h6>Published September 9, 2024
- [:octicons-arrow-right-24: __GraphRAG auto-tuning provides rapid adaptation to new domains__](https://www.microsoft.com/en-us/research/blog/graphrag-auto-tuning-provides-rapid-adaptation-to-new-domains/)

By [Alonso Guevara Fernández](https://www.microsoft.com/en-us/research/people/alonsog/), Sr. Software Engineer; Katy Smith, Data Scientist II; [Joshua Bradley](https://www.microsoft.com/en-us/research/people/joshbradley/), Senior Data Scientist; [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; [Sarah Smith](https://www.microsoft.com/en-us/research/people/smithsarah/), Senior Program Manager; [Ben Cutler](https://www.microsoft.com/en-us/research/people/bcutler/), Senior Director; [Steven Truitt](https://www.microsoft.com/en-us/research/people/steventruitt/), Principal Program Manager; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect
---
<h6>Published September 9, 2024

- [:octicons-arrow-right-24: **Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency**](https://www.microsoft.com/en-us/research/blog/introducing-drift-search-combining-global-and-local-search-methods-to-improve-quality-and-efficiency/)
By [Alonso Guevara Fernández](https://www.microsoft.com/en-us/research/people/alonsog/), Sr. Software Engineer; Katy Smith, Data Scientist II; [Joshua Bradley](https://www.microsoft.com/en-us/research/people/joshbradley/), Senior Data Scientist; [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; [Sarah Smith](https://www.microsoft.com/en-us/research/people/smithsarah/), Senior Program Manager; [Ben Cutler](https://www.microsoft.com/en-us/research/people/bcutler/), Senior Director; [Steven Truitt](https://www.microsoft.com/en-us/research/people/steventruitt/), Principal Program Manager; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>

***
- [:octicons-arrow-right-24: __Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency__](https://www.microsoft.com/en-us/research/blog/introducing-drift-search-combining-global-and-local-search-methods-to-improve-quality-and-efficiency/)

<h6>Published October 31, 2024
---
<h6>Published October 31, 2024

By Julian Whiting , Senior Machine Learning Engineer; Zachary Hills , Senior Software Engineer; [Alonso Guevara Fernández](https://www.microsoft.com/en-us/research/people/alonsog/), Sr. Software Engineer; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; Adam Bradley , Managing Partner, Strategic Research; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect
By Julian Whiting, Senior Machine Learning Engineer; Zachary Hills , Senior Software Engineer; [Alonso Guevara Fernández](https://www.microsoft.com/en-us/research/people/alonsog/), Sr. Software Engineer; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; Adam Bradley , Managing Partner, Strategic Research; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>

</div>
</div>
3 changes: 3 additions & 0 deletions docs/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,6 @@ graphrag query \
```

Please refer to [Query Engine](query/overview.md) docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.

# Visualizing the Graph
Check out our [visualization guide](visualization_guide.md) for a more interactive experience in debugging and exploring the knowledge graph.
Binary file added docs/img/viz_guide/gephi-appearance-pane.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/viz_guide/gephi-layout-pane.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/query/drift_search.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal) builds up
<p align="center">
<img src="../../img/drift-search-diagram.png" alt="Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process." align="center" />
</p>
<p align="center"><i>
Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user’s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step. C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive.</i></p>
<p align="center">
<p align="center"><i><small>
Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user’s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step. C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive.</small></i></p>


DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query’s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions.

Expand Down
100 changes: 100 additions & 0 deletions docs/visualization_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Visualizing and Debugging Your Knowledge Graph

The following step-by-step guide walks through the process to visualize a knowledge graph after it's been constructed by graphrag. Note that some of the settings recommended below are based on our own experience of what works well. Feel free to change and explore other settings for a better visualization experience!

## 1. Run the Pipeline
Before building an index, please review your `settings.yaml` configuration file and ensure that graphml snapshots is enabled.
```yaml
snapshots:
graphml: true
```
(Optional) To support other visualization tools and exploration, additional parameters can be enabled that provide access to vector embeddings.
```yaml
embed_graph:
enabled: true # will generate node2vec embeddings for nodes
umap:
enabled: true # will generate UMAP embeddings for nodes
```
After running the indexing pipeline over your data, there will be an output folder (defined by the `storage.base_dir` setting).

- **Output Folder**: Contains artifacts from the LLM’s indexing pass.

## 2. Locate the Knowledge Graph
In the output folder, look for a file named `merged_graph.graphml`. graphml is a standard [file format](http://graphml.graphdrawing.org) supported by many visualization tools. We recommend trying [Gephi](https://gephi.org).

## 3. Open the Graph in Gephi
1. Install and open Gephi
2. Navigate to the `output` folder containing the various parquet files.
3. Import the `merged_graph.graphml` file into Gephi. This will result in a fairly plain view of the undirected graph nodes and edges.

<p align="center">
<img src="../img/viz_guide/gephi-initial-graph-example.png" alt="A basic graph visualization by Gephi" width="300"/>
</p>

## 4. Install the Leiden Algorithm Plugin
1. Go to `Tools` -> `Plugins`.
2. Search for "Leiden Algorithm".
3. Click `Install` and restart Gephi.

## 5. Run Statistics
1. In the `Statistics` tab on the right, click `Run` for `Average Degree` and `Leiden Algorithm`.

<p align="center">
<img src="../img/viz_guide/gephi-network-overview-settings.png" alt="A view of Gephi's network overview settings" width="300"/>
</p>

2. For the Leiden Algorithm, adjust the settings:
- **Quality function**: Modularity
- **Resolution**: 1

## 6. Color the Graph by Clusters
1. Go to the `Appearance` pane in the upper left side of Gephi.

<p align="center">
<img src="../img/viz_guide/gephi-appearance-pane.png" alt="A view of Gephi's appearance pane" width="500"/>
</p>

2. Select `Nodes`, then `Partition`, and click the color palette icon in the upper right.
3. Choose `Cluster` from the dropdown.
4. Click the `Palette...` hyperlink, then `Generate...`.
5. Uncheck `Limit number of colors`, click `Generate`, and then `Ok`.
6. Click `Apply` to color the graph. This will color the graph based on the partitions discovered by Leiden.

## 7. Resize Nodes by Degree Centrality
1. In the `Appearance` pane in the upper left, select `Nodes` -> `Ranking`
2. Select the `Sizing` icon in the upper right.
2. Choose `Degree` and set:
- **Min**: 10
- **Max**: 150
3. Click `Apply`.

## 8. Layout the Graph
1. In the `Layout` tab in the lower left, select `OpenORD`.

<p align="center">
<img src="../img/viz_guide/gephi-layout-pane.png" alt="A view of Gephi's layout pane" width="400"/>
</p>

2. Set `Liquid` and `Expansion` stages to 50, and everything else to 0.
3. Click `Run` and monitor the progress.

## 9. Run ForceAtlas2
1. Select `Force Atlas 2` in the layout options.

<p align="center">
<img src="../img/viz_guide/gephi-layout-forceatlas2-pane.png" alt="A view of Gephi's ForceAtlas2 layout pane" width="400"/>
</p>

2. Adjust the settings:
- **Scaling**: 15
- **Dissuade Hubs**: checked
- **LinLog mode**: uncheck
- **Prevent Overlap**: checked
3. Click `Run` and wait.
4. Press `Stop` when it looks like the graph nodes have settled and no longer change position significantly.

## 10. Add Text Labels (Optional)
1. Turn on text labels in the appropriate section.
2. Configure and resize them as needed.

Your final graph should now be visually organized and ready for analysis!
2 changes: 1 addition & 1 deletion graphrag/api/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ async def local_search_streaming(
reporter.info(f"Vector Store Args: {redact(vector_store_args)}") # type: ignore

description_embedding_store = _get_embedding_store(
conf_args=vector_store_args, # type: ignore
config_args=vector_store_args, # type: ignore
container_suffix="entity-description",
)

Expand Down
5 changes: 5 additions & 0 deletions graphrag/index/run/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,11 @@ async def run_pipeline_with_config(
if is_update_run and update_index_storage:
delta_dataset = await get_delta_docs(dataset, storage)

# Fail on empty delta dataset
if delta_dataset.new_inputs.empty:
error_msg = "Incremental Indexing Error: No new documents to process."
raise ValueError(error_msg)

delta_storage = update_index_storage.child("delta")

# Run the pipeline on the new documents
Expand Down
8 changes: 6 additions & 2 deletions graphrag/index/update/incremental_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,12 @@ async def update_dataframe_outputs(
)

# Merge final covariates
progress_reporter.info("Updating Final Covariates")
await _update_covariates(dataframe_dict, storage, update_storage)
if (
await storage.has("create_final_covariates.parquet")
and "create_final_covariates" in dataframe_dict
):
progress_reporter.info("Updating Final Covariates")
await _update_covariates(dataframe_dict, storage, update_storage)

# Merge final nodes and update community ids
progress_reporter.info("Updating Final Nodes")
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ nav:
- Microsoft Research Blog: "blog_posts.md"
- Extras:
- CLI: "cli.md"
- Visualization Guide: "visualization_guide.md"
- Operation Dulce:
- About: "data/operation_dulce/ABOUT.md"
- Document: "data/operation_dulce/Operation Dulce v2 1 1.md"
Expand Down

0 comments on commit 8a8f862

Please sign in to comment.