Feature/kg builder (neo4j#91)

* Pipeline (neo4j#81) * First draft of pipeline/component architecture. Example using the RAG pipeline. * More complex implementation of pipeline to deal with branching and aggregations - no async yet * Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default * Test RAG with new Pipeline implementation * File refactoring * Pipeline orchestration with async support * Import sorting * Pipeline rerun + exception on cyclic graph (for now) * Mypy * Python version compat * Rename process->run for Components for consistency with Pipeline * Move components test in the example folder - add some tests * Race condition fix - documentation - ruff * Fix import sorting * mypy on tests * Mark test as async * Tests were not testing... * Ability to create Pipeline templates * Ruff * Future + header * Renaming + update import structure to make it more compatible with rest of the repo * Check input parameters before starting the pipeline * Introduce output model for component - Validate pipeline before running - More unit tests * Import.. * Finally installed pre-commit hooks... * Finally installed pre-commit hooks... * Finally installed pre-commit hooks... and struggling with pydantic.. * Mypy on examples * Add missing header * Update doc * Fix import in doc * Update changelog * Update docs/source/user_guide_pipeline.rst Co-authored-by: willtai <[email protected]> * Refactor tests folder to match src structure * Move exceptions to separate file and rename them to make it clearer they are related to pipeline * Mypy * Rename def => config * Introduce generic type to remove most of the "type: ignore" comments * Remove unnecessary comment * Ruff * Document and test is_cyclic method * Remove find_all method from store (simplify data retrieval) * value is not a list anymore (or, if it is, it's on purpose) * Remove comments, fix example in doc * Remove core directory - move files to /pipeline * Expose stores from pipeline subpackage * Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input * Component subclasses can return DataModel * Add note on async + schema to illustrate parameter propagation --------- Co-authored-by: willtai <[email protected]> * Pipeline (neo4j#81) * First draft of pipeline/component architecture. Example using the RAG pipeline. * More complex implementation of pipeline to deal with branching and aggregations - no async yet * Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default * Test RAG with new Pipeline implementation * File refactoring * Pipeline orchestration with async support * Import sorting * Pipeline rerun + exception on cyclic graph (for now) * Mypy * Python version compat * Rename process->run for Components for consistency with Pipeline * Move components test in the example folder - add some tests * Race condition fix - documentation - ruff * Fix import sorting * mypy on tests * Mark test as async * Tests were not testing... * Ability to create Pipeline templates * Ruff * Future + header * Renaming + update import structure to make it more compatible with rest of the repo * Check input parameters before starting the pipeline * Introduce output model for component - Validate pipeline before running - More unit tests * Import.. * Finally installed pre-commit hooks... * Finally installed pre-commit hooks... * Finally installed pre-commit hooks... and struggling with pydantic.. * Mypy on examples * Add missing header * Update doc * Fix import in doc * Update changelog * Update docs/source/user_guide_pipeline.rst Co-authored-by: willtai <[email protected]> * Refactor tests folder to match src structure * Move exceptions to separate file and rename them to make it clearer they are related to pipeline * Mypy * Rename def => config * Introduce generic type to remove most of the "type: ignore" comments * Remove unnecessary comment * Ruff * Document and test is_cyclic method * Remove find_all method from store (simplify data retrieval) * value is not a list anymore (or, if it is, it's on purpose) * Remove comments, fix example in doc * Remove core directory - move files to /pipeline * Expose stores from pipeline subpackage * Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input * Component subclasses can return DataModel * Add note on async + schema to illustrate parameter propagation --------- Co-authored-by: willtai <[email protected]> * Adds a Text Splitter (neo4j#82) * Added text splitter adapter class * Added copyright header to new files * Added __future__ import to text_splitters.py for backwards compatibility of type hints * Moved text splitter file and tests * Split text splitter adapter into 2 adapters * Added optional metadata to text chunks * Fixed typos * Moved text splitters inside of the components folder * Fixed Component import * Added a TextChunkEmbedder (neo4j#87) * Added a TextChunkEmbedder * Added the copyright header to test_embedder.py * Updated test_text_chunk_embedder_run * Adds a knowledge graph writer (neo4j#83) * Added copyright header to new files * Added copyright header to kg_writer.py * Added __future__ import to kg_writer.py for backwards compatibility of type hints * Added E2E test for Neo4jWriter * Added a copyright header to test_kg_builder_e2e.py * Added upsert_vector test for relationship embeddings * Moved KG writer and its tests * Moved Neo4jGraph and associated objects to a new file * Renamed KG builder fixture * Added unit tests for KG writer * Split upsert_vector into 2 functions * Fixed broken cypher query strings * Removed embedding creation from Neo4jWriter * Fixed setup_neo4j_for_kg_construction fixture * Added KGWriterModel class * Fixed minor mistake in test_weaviate_e2e.py * Renamed kg_construction folder to components * Updated unit tests with new folder structure * Fixed broken import * Fixed copyright headers * Added missing docstrings * Fixed typo * Add documentation for pipeline exceptions (neo4j#90) * Fixes and refactors the KG writer component (neo4j#92) * Fixes and refactors the KG writer component * Fixed mypy error * Made start_node_id and end_node_id parameters in UPSERT_RELATIONSHIP_QUERY * Add schema for kg builder (neo4j#88) * Add schema for kg builder and tests * Fixed mypy checks * Reverted kg builder example with schema * Revert to List and Dict due to Python3.8 issue with using get_type_hints * Added properties to Entity and Relation * Add test for missing properties * Fix type annotations in test * Add property types * Refactored entity, relation, and property types * Unused import * Moved schema to components/ (neo4j#96) * Add entity / Relation extraction component (neo4j#85) * Pipeline (neo4j#81) * First draft of pipeline/component architecture. Example using the RAG pipeline. * More complex implementation of pipeline to deal with branching and aggregations - no async yet * Introduce Store to add flexibility as where to store pipeline results - Only return the leaf components results by default * Test RAG with new Pipeline implementation * File refactoring * Pipeline orchestration with async support * Import sorting * Pipeline rerun + exception on cyclic graph (for now) * Mypy * Python version compat * Rename process->run for Components for consistency with Pipeline * Move components test in the example folder - add some tests * Race condition fix - documentation - ruff * Fix import sorting * mypy on tests * Mark test as async * Tests were not testing... * Ability to create Pipeline templates * Ruff * Future + header * Renaming + update import structure to make it more compatible with rest of the repo * Check input parameters before starting the pipeline * Introduce output model for component - Validate pipeline before running - More unit tests * Import.. * Finally installed pre-commit hooks... * Finally installed pre-commit hooks... * Finally installed pre-commit hooks... and struggling with pydantic.. * Mypy on examples * Add missing header * Update doc * Fix import in doc * Update changelog * Update docs/source/user_guide_pipeline.rst Co-authored-by: willtai <[email protected]> * Refactor tests folder to match src structure * Move exceptions to separate file and rename them to make it clearer they are related to pipeline * Mypy * Rename def => config * Introduce generic type to remove most of the "type: ignore" comments * Remove unnecessary comment * Ruff * Document and test is_cyclic method * Remove find_all method from store (simplify data retrieval) * value is not a list anymore (or, if it is, it's on purpose) * Remove comments, fix example in doc * Remove core directory - move files to /pipeline * Expose stores from pipeline subpackage * Ability to pass the full output of one component to the next one - useful when a component accepts a pydantic model as input * Component subclasses can return DataModel * Add note on async + schema to illustrate parameter propagation --------- Co-authored-by: willtai <[email protected]> * Entity / Relation extraction component * Adds a Text Splitter (neo4j#82) * Added text splitter adapter class * Added copyright header to new files * Added __future__ import to text_splitters.py for backwards compatibility of type hints * Moved text splitter file and tests * Split text splitter adapter into 2 adapters * Added optional metadata to text chunks * Fixed typos * Moved text splitters inside of the components folder * Fixed Component import * Add tests * Keep it simple: remove deps to jinja for now * Update example with existing components * log config in example * Fix tests * Rm unused import * Add copyright headers * Rm debug code * Try and fix tests * Unused import * get_type_hints is failing for python 3.8/3.9, even when using __future__ annotations => back to the typing.Dict annotation which is compatible with all python versions * Return model is also conditioned to the existence of the run method => should raise an error if run is not implemented? * Log when we do not raise exception to keep track of the failure * Update prompt to match new KGwriter expected type * Fix test * Fix type for `examples` * Use SchemaConfig as input for the ER Extractor component * The "base" EntityRelationExtractor is an ABC that must be subclassed * Make node IDs unique across several runs of the pipeline by prefixing them with a timestamp * Option to build lexical graph in the ERExtractor component * Fix one test * Fix some more tests * Fix some more tests * Remove "type: ignore" comments --------- Co-authored-by: willtai <[email protected]> Co-authored-by: Alex Thomas <[email protected]> * Update lock file after merge * Remove pipeline/components folder (again) * Updated component docs (neo4j#99) * Updated component docs * Removed weaviate test update * Updated pipeline user guide with link to components in the API section * Feature/kg builder e2e tests (neo4j#98) * End to end tests for KG builder pipeline * Adding chunk embedder to the pipeline and e2e tests * Fix how the chunk embedding is saved * Fix e2e tests * Fix mypy * mypy stuff :'( * WIP: update e2e tests * Check counts also here * Enable e2e tests on this PR only * Fix e2e tests (was not mocking the correct method for Embedder) * Revert CI to normal * Updated CHANGLOG and set max-parallel: 1 for E2E tests in pr-e2e-tests.yaml --------- Co-authored-by: willtai <[email protected]> Co-authored-by: Alex Thomas <[email protected]> Co-authored-by: willtai <[email protected]>
alexthomas93 · Aug 13, 2024 · cc48eef · cc48eef
1 parent 242c77c
commit cc48eef
Show file tree

Hide file tree

Showing 57 changed files with 6,498 additions and 1,204 deletions.
diff --git a/.github/workflows/pr-e2e-tests.yaml b/.github/workflows/pr-e2e-tests.yaml
@@ -13,6 +13,8 @@ concurrency:
 jobs:
   e2e-tests:
     runs-on: ubuntu-latest
+    strategy:
+      max-parallel: 1
     strategy:
       matrix:
         python-version: ['3.8', '3.12']

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,9 @@
 
 ### Added
 - Add optional custom_prompt arg to the Text2CypherRetriever class.
+- Introduced support for Component/Pipeline flexible architecture.
+- Added new components for knowledge graph construction, including text splitters, schema builders, entity-relation extractors, and Neo4j writers.
+- Implemented end-to-end tests for the new knowledge graph builder pipeline.
 
 ### Changed
 - `GraphRAG.search` method first parameter has been renamed `query_text` (was `query`) for consistency with the retrievers interface.

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -3,14 +3,74 @@
 API Documentation
 #################
 
+.. _components-section:
+
+**********
+Components
+**********
+
+KGWriter
+========
+
+.. autoclass:: neo4j_genai.components.kg_writer.KGWriter
+    :members: run
+
+Neo4jWriter
+===========
+
+.. autoclass:: neo4j_genai.components.kg_writer.Neo4jWriter
+    :members: run
+
+TextSplitter
+============
+
+.. autoclass:: neo4j_genai.components.text_splitters.base.TextSplitter
+    :members: run
+
+LangChainTextSplitterAdapter
+============================
+
+.. autoclass:: neo4j_genai.components.text_splitters.langchain.LangChainTextSplitterAdapter
+    :members: run
+
+LlamaIndexTextSplitterAdapter
+=============================
+
+.. autoclass:: neo4j_genai.components.text_splitters.llamaindex.LlamaIndexTextSplitterAdapter
+    :members: run
+
+TextChunkEmbedder
+=================
+
+.. autoclass:: neo4j_genai.components.embedder.TextChunkEmbedder
+    :members: run
+
+SchemaBuilder
+=============
+
+.. autoclass:: neo4j_genai.components.schema.SchemaBuilder
+    :members: run
+
+EntityRelationExtractor
+=======================
+
+.. autoclass:: neo4j_genai.components.entity_relation_extractor.EntityRelationExtractor
+    :members: run
+
+LLMEntityRelationExtractor
+==========================
+
+.. autoclass:: neo4j_genai.components.entity_relation_extractor.LLMEntityRelationExtractor
+    :members: run
+
 .. _retrievers-section:
 
 **********
 Retrievers
 **********
 
 RetrieverInterface
-===================
+==================
 
 .. autoclass:: neo4j_genai.retrievers.base.Retriever
     :members:
@@ -70,39 +130,39 @@ PineconeNeo4jRetriever
     :members: search
 
 
-**********
+********
 Embedder
-**********
+********
 
 .. autoclass:: neo4j_genai.embedder.Embedder
     :members:
 
 SentenceTransformerEmbeddings
 ================================
 
-.. autoclass:: neo4j_genai.embeddings.SentenceTransformerEmbeddings
+.. autoclass:: neo4j_genai.embeddings.sentence_transformers.SentenceTransformerEmbeddings
     :members:
 
 **********
 Generation
 **********
 
 LLMInterface
-======================
+============
 
 .. autoclass:: neo4j_genai.llm.LLMInterface
     :members:
 
 
 OpenAILLM
-======================
+=========
 
 .. autoclass:: neo4j_genai.llm.OpenAILLM
     :members:
 
 
 PromptTemplate
-======================
+==============
 
 .. autoclass:: neo4j_genai.generation.prompts.PromptTemplate
     :members:
@@ -125,6 +185,8 @@ Database Interaction
 
 .. autofunction:: neo4j_genai.indexes.upsert_vector
 
+.. autofunction:: neo4j_genai.indexes.upsert_vector_on_relationship
+
 
 ******
 Errors
@@ -157,6 +219,12 @@ Errors
 
   * :class:`neo4j_genai.exceptions.LLMGenerationError`
 
+  * :class:`neo4j_genai.pipeline.exceptions.PipelineDefinitionError`
+
+  * :class:`neo4j_genai.pipeline.exceptions.PipelineMissingDependencyError`
+
+  * :class:`neo4j_genai.pipeline.exceptions.PipelineStatusUpdateError`
+
 
 Neo4jGenAiError
 ===============
@@ -222,7 +290,7 @@ Neo4jVersionError
 
 
 Text2CypherRetrievalError
-==========================
+=========================
 
 .. autoclass:: neo4j_genai.exceptions.Text2CypherRetrievalError
    :show-inheritance:
@@ -236,21 +304,42 @@ SchemaFetchError
 
 
 RagInitializationError
-==========================
+======================
 
 .. autoclass:: neo4j_genai.exceptions.RagInitializationError
    :show-inheritance:
 
 
 PromptMissingInputError
-==========================
+=======================
 
 .. autoclass:: neo4j_genai.exceptions.PromptMissingInputError
    :show-inheritance:
 
 
 LLMGenerationError
-==========================
+==================
 
 .. autoclass:: neo4j_genai.exceptions.LLMGenerationError
    :show-inheritance:
+
+
+PipelineDefinitionError
+=======================
+
+.. autoclass:: neo4j_genai.pipeline.exceptions.PipelineDefinitionError
+   :show-inheritance:
+
+
+PipelineMissingDependencyError
+==============================
+
+.. autoclass:: neo4j_genai.pipeline.exceptions.PipelineMissingDependencyError
+   :show-inheritance:
+
+
+PipelineStatusUpdateError
+=========================
+
+.. autoclass:: neo4j_genai.pipeline.exceptions.PipelineStatusUpdateError
+   :show-inheritance:
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -30,7 +30,8 @@ Python versions supported:
 Topics
 ******
 
-+ :ref:`user-guide`
++ :ref:`user-guide-rag`
++ :ref:`user-guide-pipeline`
 + :ref:`api-documentation`
 + :ref:`types-documentation`
 
@@ -39,7 +40,8 @@ Topics
     :caption: Contents:
     :hidden:
 
-    user_guide.rst
+    user_guide_rag.rst
+    user_guide_pipeline.rst
     api.rst
     types.rst
 

diff --git a/docs/source/types.rst b/docs/source/types.rst
@@ -5,30 +5,80 @@ Types
 *****
 
 RawSearchResult
-==================
+===============
 
 .. autoclass:: neo4j_genai.types.RawSearchResult
 
 
 RetrieverResult
-==================
+===============
 
 .. autoclass:: neo4j_genai.types.RetrieverResult
 
 
 RetrieverResultItem
-====================
+===================
 
 .. autoclass:: neo4j_genai.types.RetrieverResultItem
 
 
 LLMResponse
-====================
+===========
 
 .. autoclass:: neo4j_genai.llm.types.LLMResponse
 
 
 RagResultModel
-====================
+==============
 
 .. autoclass:: neo4j_genai.generation.types.RagResultModel
+
+TextChunk
+=========
+
+.. autoclass:: neo4j_genai.components.types.TextChunk
+
+TextChunks
+==========
+
+.. autoclass:: neo4j_genai.components.types.TextChunks
+
+Neo4jNode
+=========
+
+.. autoclass:: neo4j_genai.components.types.Neo4jNode
+
+Neo4jRelationship
+=================
+
+.. autoclass:: neo4j_genai.components.types.Neo4jRelationship
+
+Neo4jGraph
+==========
+
+.. autoclass:: neo4j_genai.components.types.Neo4jGraph
+
+KGWriterModel
+=============
+
+.. autoclass:: neo4j_genai.components.kg_writer.KGWriterModel
+
+SchemaProperty
+==============
+
+.. autoclass:: neo4j_genai.components.schema.SchemaProperty
+
+SchemaEntity
+============
+
+.. autoclass:: neo4j_genai.components.schema.SchemaEntity
+
+SchemaRelation
+==============
+
+.. autoclass:: neo4j_genai.components.schema.SchemaEntity
+
+SchemaConfig
+============
+
+.. autoclass:: neo4j_genai.components.schema.SchemaConfig