forked from microsoft/graphrag
-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Chore/dependency cleanup (microsoft#1169) * fix dependencies with deptry * change order in pyproject.toml * fix * Dependency updates and cleanup * Future required --------- Co-authored-by: Florian Maas <[email protected]> * Bump path-to-regexp from 6.2.1 to 6.3.0 in /docsite (microsoft#1130) Bumps [path-to-regexp](https://github.com/pillarjs/path-to-regexp) from 6.2.1 to 6.3.0. - [Release notes](https://github.com/pillarjs/path-to-regexp/releases) - [Changelog](https://github.com/pillarjs/path-to-regexp/blob/master/History.md) - [Commits](pillarjs/path-to-regexp@v6.2.1...v6.3.0) --- updated-dependencies: - dependency-name: path-to-regexp dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Collapse create final relationships (microsoft#1158) * Collapse pre/post embedding workflows * Semver * Fix smoke tests --------- Co-authored-by: Alonso Guevara <[email protected]> * Bump JamesIves/github-pages-deploy-action from 4.6.3 to 4.6.4 (microsoft#1104) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.6.3 to 4.6.4. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](JamesIves/github-pages-deploy-action@v4.6.3...v4.6.4) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alonso Guevara <[email protected]> * Release v0.3.6 (microsoft#1172) * Remove redundant code from error-handling code in GlobalSearch (microsoft#1170) * remove a redundant retry * semver * formatting --------- Co-authored-by: Alonso Guevara <[email protected]> * Incremental indexing/update old outputs (microsoft#1155) * Create entypoint for cli and api (microsoft#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Incremental indexing/file delta (microsoft#1123) * Calculate new inputs and deleted inputs on update * Semver * Clear ruff checks * Fix pyright * Fix PyRight * Ruff again * Update Final Entities merging in new and existing entities from delta * Update formatting * Pyright * Ruff * Fix for pyright * Yet Another Pyright test * Pyright * Format * Collapse create_final_nodes (microsoft#1171) * Collapse create_final_nodes * Update smoke tests * Typo --------- Co-authored-by: Alonso Guevara <[email protected]> * Fix typo in documentation for customizability (microsoft#1160) Corrected a misspelling of 'customizability' in the env_vars.md documentation. This change ensures clarity and accuracy in the description of input data handling configurations. Co-authored-by: Alonso Guevara <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Alonso Guevara <[email protected]> Co-authored-by: Florian Maas <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nathan Evans <[email protected]> Co-authored-by: Chris Trevino <[email protected]> Co-authored-by: JunHo Kim (김준호) <[email protected]>
- Loading branch information
1 parent
e02425c
commit 35ad5e2
Showing
28 changed files
with
1,789 additions
and
1,654 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -58,7 +58,7 @@ jobs: | |
run: find docsite/_site | ||
|
||
- name: Deploy to GitHub Pages | ||
uses: JamesIves/[email protected].3 | ||
uses: JamesIves/[email protected].4 | ||
with: | ||
branch: gh-pages | ||
folder: docsite/_site | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"changes": [ | ||
{ | ||
"description": "Collapse create_final_relationships.", | ||
"type": "patch" | ||
}, | ||
{ | ||
"description": "Dependency update and cleanup", | ||
"type": "patch" | ||
} | ||
], | ||
"created_at": "2024-09-20T00:09:13+00:00", | ||
"version": "0.3.6" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{ | ||
"type": "patch", | ||
"description": "Calculate new inputs and deleted inputs on update" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{ | ||
"type": "patch", | ||
"description": "Merge existing and new entities, updating values accordingly" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{ | ||
"type": "patch", | ||
"description": "Collapse create_final_relationships." | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{ | ||
"type": "patch", | ||
"description": "remove redundant error-handling code from global-search" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{ | ||
"type": "patch", | ||
"description": "Collapse create-final-nodes." | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) 2024 Microsoft Corporation. | ||
# Licensed under the MIT License | ||
|
||
"""Incremental Indexing main module definition.""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
# Copyright (c) 2024 Microsoft Corporation. | ||
# Licensed under the MIT License | ||
|
||
"""Dataframe operations and utils for Incremental Indexing.""" | ||
|
||
import os | ||
from dataclasses import dataclass | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
||
from graphrag.index.storage.typing import PipelineStorage | ||
from graphrag.utils.storage import _load_table_from_storage | ||
|
||
mergeable_outputs = [ | ||
"create_final_documents", | ||
"create_final_entities", | ||
"create_final_relationships", | ||
] | ||
|
||
|
||
@dataclass | ||
class InputDelta: | ||
"""Dataclass to hold the input delta. | ||
Attributes | ||
---------- | ||
new_inputs : pd.DataFrame | ||
The new inputs. | ||
deleted_inputs : pd.DataFrame | ||
The deleted inputs. | ||
""" | ||
|
||
new_inputs: pd.DataFrame | ||
deleted_inputs: pd.DataFrame | ||
|
||
|
||
async def get_delta_docs( | ||
input_dataset: pd.DataFrame, storage: PipelineStorage | ||
) -> InputDelta: | ||
"""Get the delta between the input dataset and the final documents. | ||
Parameters | ||
---------- | ||
input_dataset : pd.DataFrame | ||
The input dataset. | ||
storage : PipelineStorage | ||
The Pipeline storage. | ||
Returns | ||
------- | ||
InputDelta | ||
The input delta. With new inputs and deleted inputs. | ||
""" | ||
final_docs = await _load_table_from_storage( | ||
"create_final_documents.parquet", storage | ||
) | ||
|
||
# Select distinct title from final docs and from dataset | ||
previous_docs: list[str] = final_docs["title"].unique().tolist() | ||
dataset_docs: list[str] = input_dataset["title"].unique().tolist() | ||
|
||
# Get the new documents (using loc to ensure DataFrame) | ||
new_docs = input_dataset.loc[~input_dataset["title"].isin(previous_docs)] | ||
|
||
# Get the deleted documents (again using loc to ensure DataFrame) | ||
deleted_docs = final_docs.loc[~final_docs["title"].isin(dataset_docs)] | ||
|
||
return InputDelta(new_docs, deleted_docs) | ||
|
||
|
||
async def update_dataframe_outputs( | ||
dataframe_dict: dict[str, pd.DataFrame], | ||
storage: PipelineStorage, | ||
) -> None: | ||
"""Update the mergeable outputs. | ||
Parameters | ||
---------- | ||
dataframe_dict : dict[str, pd.DataFrame] | ||
The dictionary of dataframes. | ||
storage : PipelineStorage | ||
The storage used to store the dataframes. | ||
""" | ||
await _concat_dataframes("create_base_text_units", dataframe_dict, storage) | ||
await _concat_dataframes("create_final_documents", dataframe_dict, storage) | ||
|
||
old_entities = await _load_table_from_storage( | ||
"create_final_entities.parquet", storage | ||
) | ||
delta_entities = dataframe_dict["create_final_entities"] | ||
|
||
merged_entities_df, _ = _group_and_resolve_entities(old_entities, delta_entities) | ||
# Save the updated entities back to storage | ||
# TODO: Using _new in the mean time, to compare outputs without overwriting the original | ||
await storage.set( | ||
"create_final_entities_new.parquet", merged_entities_df.to_parquet() | ||
) | ||
|
||
|
||
async def _concat_dataframes(name, dataframe_dict, storage): | ||
"""Concatenate the dataframes. | ||
Parameters | ||
---------- | ||
name : str | ||
The name of the dataframe to concatenate. | ||
dataframe_dict : dict[str, pd.DataFrame] | ||
The dictionary of dataframes from a pipeline run. | ||
storage : PipelineStorage | ||
The storage used to store the dataframes. | ||
""" | ||
old_df = await _load_table_from_storage(f"{name}.parquet", storage) | ||
delta_df = dataframe_dict[name] | ||
|
||
# Merge the final documents | ||
final_df = pd.concat([old_df, delta_df], copy=False) | ||
|
||
# TODO: Using _new in the mean time, to compare outputs without overwriting the original | ||
await storage.set(f"{name}_new.parquet", final_df.to_parquet()) | ||
|
||
|
||
def _group_and_resolve_entities( | ||
df_a: pd.DataFrame, df_b: pd.DataFrame | ||
) -> tuple[pd.DataFrame, dict]: | ||
"""Group and resolve entities. | ||
Parameters | ||
---------- | ||
df_a : pd.DataFrame | ||
The first dataframe. | ||
df_b : pd.DataFrame | ||
The second dataframe. | ||
Returns | ||
------- | ||
pd.DataFrame | ||
The resolved dataframe. | ||
dict | ||
The id mapping for existing entities. In the form of {df_b.id: df_a.id}. | ||
""" | ||
# If a name exists in A and B, make a dictionary for {B.id : A.id} | ||
merged = df_b[["id", "name"]].merge( | ||
df_a[["id", "name"]], | ||
on="name", | ||
suffixes=("_B", "_A"), | ||
copy=False, | ||
) | ||
id_mapping = dict(zip(merged["id_B"], merged["id_A"], strict=True)) | ||
|
||
# Concat A and B | ||
combined = pd.concat([df_a, df_b], copy=False) | ||
|
||
# Group by name and resolve conflicts | ||
aggregated = ( | ||
combined.groupby("name") | ||
.agg({ | ||
"id": "first", | ||
"type": "first", | ||
"human_readable_id": "first", | ||
"graph_embedding": "first", | ||
"description": lambda x: os.linesep.join(x.astype(str)), # Ensure str | ||
# Concatenate nd.array into a single list | ||
"text_unit_ids": lambda x: ",".join(str(i) for j in x.tolist() for i in j), | ||
# Keep only descriptions where the original value wasn't modified | ||
"description_embedding": lambda x: x.iloc[0] if len(x) == 1 else np.nan, | ||
}) | ||
.reset_index() | ||
) | ||
|
||
# Force the result into a DataFrame | ||
resolved: pd.DataFrame = pd.DataFrame(aggregated) | ||
|
||
# Recreate humand readable id with an autonumeric | ||
resolved["human_readable_id"] = range(len(resolved)) | ||
|
||
# Modify column order to keep consistency | ||
resolved = resolved.loc[ | ||
:, | ||
[ | ||
"id", | ||
"name", | ||
"description", | ||
"type", | ||
"human_readable_id", | ||
"graph_embedding", | ||
"text_unit_ids", | ||
"description_embedding", | ||
], | ||
] | ||
|
||
return resolved, id_mapping |
Oops, something went wrong.