Skip to content

Commit

Permalink
Merge 0.3.6 (#26)
Browse files Browse the repository at this point in the history
* Chore/dependency cleanup (microsoft#1169)

* fix dependencies with deptry

* change order in pyproject.toml

* fix

* Dependency updates and cleanup

* Future required

---------

Co-authored-by: Florian Maas <[email protected]>

* Bump path-to-regexp from 6.2.1 to 6.3.0 in /docsite (microsoft#1130)

Bumps [path-to-regexp](https://github.com/pillarjs/path-to-regexp) from 6.2.1 to 6.3.0.
- [Release notes](https://github.com/pillarjs/path-to-regexp/releases)
- [Changelog](https://github.com/pillarjs/path-to-regexp/blob/master/History.md)
- [Commits](pillarjs/path-to-regexp@v6.2.1...v6.3.0)

---
updated-dependencies:
- dependency-name: path-to-regexp
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Collapse create final relationships (microsoft#1158)

* Collapse pre/post embedding workflows

* Semver

* Fix smoke tests

---------

Co-authored-by: Alonso Guevara <[email protected]>

* Bump JamesIves/github-pages-deploy-action from 4.6.3 to 4.6.4 (microsoft#1104)

Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.6.3 to 4.6.4.
- [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases)
- [Commits](JamesIves/github-pages-deploy-action@v4.6.3...v4.6.4)

---
updated-dependencies:
- dependency-name: JamesIves/github-pages-deploy-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alonso Guevara <[email protected]>

* Release v0.3.6 (microsoft#1172)

* Remove redundant code from error-handling code in GlobalSearch (microsoft#1170)

* remove a redundant retry

* semver

* formatting

---------

Co-authored-by: Alonso Guevara <[email protected]>

* Incremental indexing/update old outputs (microsoft#1155)

* Create entypoint for cli and api (microsoft#1067)

* Add cli and api entrypoints for update index

* Semver

* Update docs

* Run tests on feature branch main

* Better /main handling in tests

* Incremental indexing/file delta (microsoft#1123)

* Calculate new inputs and deleted inputs on update

* Semver

* Clear ruff checks

* Fix pyright

* Fix PyRight

* Ruff again

* Update Final Entities merging in new and existing entities from delta

* Update formatting

* Pyright

* Ruff

* Fix for pyright

* Yet Another Pyright test

* Pyright

* Format

* Collapse create_final_nodes (microsoft#1171)

* Collapse create_final_nodes

* Update smoke tests

* Typo

---------

Co-authored-by: Alonso Guevara <[email protected]>

* Fix typo in documentation for customizability (microsoft#1160)

Corrected a misspelling of 'customizability' in the env_vars.md documentation. This change ensures clarity and accuracy in the description of input data handling configurations.

Co-authored-by: Alonso Guevara <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Alonso Guevara <[email protected]>
Co-authored-by: Florian Maas <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nathan Evans <[email protected]>
Co-authored-by: Chris Trevino <[email protected]>
Co-authored-by: JunHo Kim (김준호) <[email protected]>
  • Loading branch information
7 people authored Sep 23, 2024
1 parent e02425c commit 35ad5e2
Show file tree
Hide file tree
Showing 28 changed files with 1,789 additions and 1,654 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/gh-pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
run: find docsite/_site

- name: Deploy to GitHub Pages
uses: JamesIves/[email protected].3
uses: JamesIves/[email protected].4
with:
branch: gh-pages
folder: docsite/_site
Expand Down
14 changes: 14 additions & 0 deletions .semversioner/0.3.6.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"changes": [
{
"description": "Collapse create_final_relationships.",
"type": "patch"
},
{
"description": "Dependency update and cleanup",
"type": "patch"
}
],
"created_at": "2024-09-20T00:09:13+00:00",
"version": "0.3.6"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20240911201935470388.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Calculate new inputs and deleted inputs on update"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20240918221118566693.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Merge existing and new entities, updating values accordingly"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20240919003117336827.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Collapse create_final_relationships."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20240919223518903171.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "remove redundant error-handling code from global-search"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20240920000120463201.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Collapse create-final-nodes."
}
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,26 @@
# Changelog
Note: version releases in the 0.x.y range may introduce breaking changes.


## 0.3.6

- patch: Collapse create_final_relationships.
- patch: Dependency update and cleanup

## 0.3.5

- patch: Add compound verbs with tests infra.
- patch: Collapse create_final_communities.
- patch: Collapse create_final_text_units.
- patch: Covariate verb collapse.
- patch: Fix duplicates in community context builder
- patch: Fix prompt tune output path
- patch: Fix seed hardcoded init
- patch: Fix seeded random gen on clustering
- patch: Improve logging.
- patch: Set default values for cli parameters.
- patch: Use static output directories.

## 0.3.4

- patch: Deep copy txt units on local search to avoid race conditions
Expand Down
2 changes: 1 addition & 1 deletion docsite/posts/config/env_vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ If the embedding target is `all`, and you want to only embed a subset of these f

## Input Data

Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizeability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table.
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table.

## Base LLM Settings

Expand Down
6 changes: 3 additions & 3 deletions docsite/yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -2033,9 +2033,9 @@ __metadata:
linkType: hard

"path-to-regexp@npm:^6.2.1":
version: 6.2.1
resolution: "path-to-regexp@npm:6.2.1"
checksum: 1e266be712d1a08086ee77beab12a1804842ec635dfed44f9ee1ba960a0e01cec8063fb8c92561115cdc0ce73158cdc7766e353ffa039340b4a85b370084c4d4
version: 6.3.0
resolution: "path-to-regexp@npm:6.3.0"
checksum: 6822f686f01556d99538b350722ef761541ec0ce95ca40ce4c29e20a5b492fe8361961f57993c71b2418de12e604478dcf7c430de34b2c31a688363a7a944d9c
languageName: node
linkType: hard

Expand Down
60 changes: 42 additions & 18 deletions graphrag/index/run/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
from graphrag.index.typing import PipelineRunResult

# Register all verbs
from graphrag.index.update.dataframes import get_delta_docs, update_dataframe_outputs
from graphrag.index.verbs import * # noqa
from graphrag.index.workflows import (
VerbDefinitions,
Expand Down Expand Up @@ -111,9 +112,6 @@ async def run_pipeline_with_config(
else await _create_input(config.input, progress_reporter, root_dir)
)

if is_update_run:
# TODO: Filter dataset to only include new data (this should be done in the input module)
pass
post_process_steps = input_post_process_steps or _create_postprocess_steps(
config.input
)
Expand All @@ -123,21 +121,47 @@ async def run_pipeline_with_config(
msg = "No dataset provided!"
raise ValueError(msg)

async for table in run_pipeline(
workflows=workflows,
dataset=dataset,
storage=storage,
cache=cache,
callbacks=callbacks,
input_post_process_steps=post_process_steps,
memory_profile=memory_profile,
additional_verbs=additional_verbs,
additional_workflows=additional_workflows,
progress_reporter=progress_reporter,
emit=emit,
is_resume_run=is_resume_run,
):
yield table
if is_update_run:
delta_dataset = await get_delta_docs(dataset, storage)

delta_storage = storage.child("delta")

# Run the pipeline on the new documents
tables_dict = {}
async for table in run_pipeline(
workflows=workflows,
dataset=delta_dataset.new_inputs,
storage=delta_storage,
cache=cache,
callbacks=callbacks,
input_post_process_steps=post_process_steps,
memory_profile=memory_profile,
additional_verbs=additional_verbs,
additional_workflows=additional_workflows,
progress_reporter=progress_reporter,
emit=emit,
is_resume_run=False,
):
tables_dict[table.workflow] = table.result

await update_dataframe_outputs(tables_dict, storage)

else:
async for table in run_pipeline(
workflows=workflows,
dataset=dataset,
storage=storage,
cache=cache,
callbacks=callbacks,
input_post_process_steps=post_process_steps,
memory_profile=memory_profile,
additional_verbs=additional_verbs,
additional_workflows=additional_workflows,
progress_reporter=progress_reporter,
emit=emit,
is_resume_run=is_resume_run,
):
yield table


async def run_pipeline(
Expand Down
4 changes: 4 additions & 0 deletions graphrag/index/update/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""Incremental Indexing main module definition."""
192 changes: 192 additions & 0 deletions graphrag/index/update/dataframes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""Dataframe operations and utils for Incremental Indexing."""

import os
from dataclasses import dataclass

import numpy as np
import pandas as pd

from graphrag.index.storage.typing import PipelineStorage
from graphrag.utils.storage import _load_table_from_storage

mergeable_outputs = [
"create_final_documents",
"create_final_entities",
"create_final_relationships",
]


@dataclass
class InputDelta:
"""Dataclass to hold the input delta.
Attributes
----------
new_inputs : pd.DataFrame
The new inputs.
deleted_inputs : pd.DataFrame
The deleted inputs.
"""

new_inputs: pd.DataFrame
deleted_inputs: pd.DataFrame


async def get_delta_docs(
input_dataset: pd.DataFrame, storage: PipelineStorage
) -> InputDelta:
"""Get the delta between the input dataset and the final documents.
Parameters
----------
input_dataset : pd.DataFrame
The input dataset.
storage : PipelineStorage
The Pipeline storage.
Returns
-------
InputDelta
The input delta. With new inputs and deleted inputs.
"""
final_docs = await _load_table_from_storage(
"create_final_documents.parquet", storage
)

# Select distinct title from final docs and from dataset
previous_docs: list[str] = final_docs["title"].unique().tolist()
dataset_docs: list[str] = input_dataset["title"].unique().tolist()

# Get the new documents (using loc to ensure DataFrame)
new_docs = input_dataset.loc[~input_dataset["title"].isin(previous_docs)]

# Get the deleted documents (again using loc to ensure DataFrame)
deleted_docs = final_docs.loc[~final_docs["title"].isin(dataset_docs)]

return InputDelta(new_docs, deleted_docs)


async def update_dataframe_outputs(
dataframe_dict: dict[str, pd.DataFrame],
storage: PipelineStorage,
) -> None:
"""Update the mergeable outputs.
Parameters
----------
dataframe_dict : dict[str, pd.DataFrame]
The dictionary of dataframes.
storage : PipelineStorage
The storage used to store the dataframes.
"""
await _concat_dataframes("create_base_text_units", dataframe_dict, storage)
await _concat_dataframes("create_final_documents", dataframe_dict, storage)

old_entities = await _load_table_from_storage(
"create_final_entities.parquet", storage
)
delta_entities = dataframe_dict["create_final_entities"]

merged_entities_df, _ = _group_and_resolve_entities(old_entities, delta_entities)
# Save the updated entities back to storage
# TODO: Using _new in the mean time, to compare outputs without overwriting the original
await storage.set(
"create_final_entities_new.parquet", merged_entities_df.to_parquet()
)


async def _concat_dataframes(name, dataframe_dict, storage):
"""Concatenate the dataframes.
Parameters
----------
name : str
The name of the dataframe to concatenate.
dataframe_dict : dict[str, pd.DataFrame]
The dictionary of dataframes from a pipeline run.
storage : PipelineStorage
The storage used to store the dataframes.
"""
old_df = await _load_table_from_storage(f"{name}.parquet", storage)
delta_df = dataframe_dict[name]

# Merge the final documents
final_df = pd.concat([old_df, delta_df], copy=False)

# TODO: Using _new in the mean time, to compare outputs without overwriting the original
await storage.set(f"{name}_new.parquet", final_df.to_parquet())


def _group_and_resolve_entities(
df_a: pd.DataFrame, df_b: pd.DataFrame
) -> tuple[pd.DataFrame, dict]:
"""Group and resolve entities.
Parameters
----------
df_a : pd.DataFrame
The first dataframe.
df_b : pd.DataFrame
The second dataframe.
Returns
-------
pd.DataFrame
The resolved dataframe.
dict
The id mapping for existing entities. In the form of {df_b.id: df_a.id}.
"""
# If a name exists in A and B, make a dictionary for {B.id : A.id}
merged = df_b[["id", "name"]].merge(
df_a[["id", "name"]],
on="name",
suffixes=("_B", "_A"),
copy=False,
)
id_mapping = dict(zip(merged["id_B"], merged["id_A"], strict=True))

# Concat A and B
combined = pd.concat([df_a, df_b], copy=False)

# Group by name and resolve conflicts
aggregated = (
combined.groupby("name")
.agg({
"id": "first",
"type": "first",
"human_readable_id": "first",
"graph_embedding": "first",
"description": lambda x: os.linesep.join(x.astype(str)), # Ensure str
# Concatenate nd.array into a single list
"text_unit_ids": lambda x: ",".join(str(i) for j in x.tolist() for i in j),
# Keep only descriptions where the original value wasn't modified
"description_embedding": lambda x: x.iloc[0] if len(x) == 1 else np.nan,
})
.reset_index()
)

# Force the result into a DataFrame
resolved: pd.DataFrame = pd.DataFrame(aggregated)

# Recreate humand readable id with an autonumeric
resolved["human_readable_id"] = range(len(resolved))

# Modify column order to keep consistency
resolved = resolved.loc[
:,
[
"id",
"name",
"description",
"type",
"human_readable_id",
"graph_embedding",
"text_unit_ids",
"description_embedding",
],
]

return resolved, id_mapping
Loading

0 comments on commit 35ad5e2

Please sign in to comment.