Merge branch 'lt/dataframe' into 'main'

Introduce Pandas DataFrame base view See merge request deepsense.ai/g-internal/db-ally!74
deepsense-ai · Mar 28, 2024 · 380c4b7 · 380c4b7
2 parents c396073 + 77131f1
commit 380c4b7
Show file tree

Hide file tree

Showing 9 changed files with 354 additions and 19 deletions.
diff --git a/docs/about/roadmap.md b/docs/about/roadmap.md
@@ -26,7 +26,7 @@ Below you can find a list of planned integrations.
 ### Data sources
 
 - [x] Sqlalchemy
-- [ ] Pandas DataFrame
+- [x] Pandas DataFrame
 - [ ] HTTP REST Endpoints
 - [ ] GraphQL Endpoints
 

diff --git a/docs/how-to/custom_views.md b/docs/how-to/custom_views.md
@@ -1,4 +1,4 @@
-# How-To: Write Custom Views
+# How-To: Use custom data sources with db-ally
 
 !!! note
     This is an advanced topic. If you're looking to create a view that retrieves data from an SQL database, please refer to the [SQL Views](sql_views.md) guide instead.

diff --git a/docs/how-to/pandas_views.md b/docs/how-to/pandas_views.md
@@ -0,0 +1,96 @@
+# How To: Use Pandas DataFrames with db-ally
+
+In this guide, you will learn how to write [views](../concepts/views.md) that use [Pandas](https://pandas.pydata.org/) DataFrames as their data source. You will understand how to define such a view, create filters that operate on the DataFrame, and register it while providing it with the source DataFrame.
+
+The example used in this guide is a DataFrame containing information about candidates. The DataFrame includes columns such as `id`, `name`, `country`, `years_of_experience`. This is the same use case as the one in the [Quickstart](../quickstart/index.md) and [Custom Views](./custom_views.md) guides. Please feel free to compare the different approaches.
+
+## The DataFrame
+Here is an example of a DataFrame containing information about candidates:
+
+```python
+import pandas as pd
+
+CANDIDATE_DATA = pd.DataFrame.from_records([
+    {"id": 1, "name": "John Doe", "position": "Data Scientist", "years_of_experience": 2, "country": "France"},
+    {"id": 2, "name": "Jane Doe", "position": "Data Engineer", "years_of_experience": 3, "country": "France"},
+    {"id": 3, "name": "Alice Smith", "position": "Machine Learning Engineer", "years_of_experience": 4, "country": "Germany"},
+    {"id": 4, "name": "Bob Smith", "position": "Data Scientist", "years_of_experience": 5, "country": "Germany"},
+    {"id": 5, "name": "Janka Jankowska", "position": "Data Scientist", "years_of_experience": 3, "country": "Poland"},
+])
+```
+
+## View Definition
+Views operating on Pandas DataFrames are defined by subclassing the `DataFrameBaseView` class:
+
+```python
+from dbally import decorators, DataFrameBaseView
+
+class CandidateView(DataFrameBaseView):
+    """
+    View for retrieving information about candidates.
+    """
+```
+
+Typically, a view contains one or more filters that operate on the DataFrame. In the case of views inheriting from `DataFrameBaseView`, filters are expected to return a [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object that can be used as a [boolean index](https://pandas.pydata.org/pandas-docs/version/2.1/user_guide/indexing.html#boolean-indexing) for the original DataFrame. In other words, the filter should return a boolean `Series` with the same length as the original DataFrame where `True` values denote rows that should be included in the result and `False` values indicate rows that should be omitted.
+
+Typically, such `Series` are created automatically by using logical operations on the DataFrame columns, such as `==`, `>`, `<`, `&` (for "and"), `|` (for "or"), and `~` (for "not"). For instance, `df.years_of_experience > 5` will return a boolean `Series` with `True` values for rows where the `years_of_experience` column is greater than 5.
+
+As always, the LLM will choose the best filter to apply based on the query it receives and will combine multiple filters if necessary.
+
+Here are two filters that operate on the DataFrame - one filters candidates with at least a certain number of years of experience and another filters candidates from a specific country:
+
+```python
+@decorators.view_filter()
+def at_least_experience(self, years: int) -> pd.Series:
+    """
+    Filters candidates with at least `years` of experience.
+    """
+    return self.df.years_of_experience >= years
+
+@decorators.view_filter()
+def from_country(self, country: str) -> pd.Series:
+    """
+    Filters candidates from a specific country.
+    """
+    return self.df.country == country
+```
+
+As you see the DataFrame object is accessed via the `self.df` attribute. This attribute is automatically set by the `DataFrameBaseView` class and contains the DataFrame provided when the view is registered.
+
+Here is an example of a more advanced filter that filters candidates considered for a senior data scientist position. It uses the `&` operator to combine two conditions:
+
+```python
+@decorators.view_filter()
+def senior_data_scientist_position(self) -> pd.Series:
+    """
+    Filters candidates that can be fit for a senior data scientist position.
+    """
+    return self.df.position.isin(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]) \
+        & (self.df.years_of_experience >= 3)
+```
+
+# Registering the View
+To use the view, you need to create a [Collection](../concepts/collections.md) and register the view with it. This is done in the same manner as registering other types of views, but you need to provide the view with the DataFrame on which it should operate:
+
+```python
+import dbally
+
+collection = dbally.create_collection("recruitment")
+collection.add(CandidateView, lambda: CandidateView(CANDIDATE_DATA))
+
+result = await collection.ask("Find me French candidates suitable for a senior data scientist position.")
+
+print(f"Retrieved {len(result.results)} candidates:")
+for candidate in result.results:
+    print(candidate)
+```
+
+This code will return a list of French candidates eligible for a senior data scientist position and display them:
+
+```
+Retrieved 1 candidates:
+{'id': 2, 'name': 'Jane Doe', 'position': 'Data Engineer', 'years_of_experience': 3, 'country': 'France'}
+```
+
+## Full Example
+You can access the complete example here: [pandas_views_code.py](pandas_views_code.py)
diff --git a/docs/how-to/pandas_views_code.py b/docs/how-to/pandas_views_code.py
@@ -0,0 +1,64 @@
+# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring, missing-class-docstring, missing-raises-doc
+import dbally
+import os
+import asyncio
+from dataclasses import dataclass
+from typing import Iterable, Callable, Any
+import pandas as pd
+
+from dbally import decorators, DataFrameBaseView
+from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler
+
+dbally.use_openai_llm(
+    openai_api_key=os.environ["OPENAI_API_KEY"],
+    model_name="gpt-3.5-turbo",
+)
+
+class CandidateView(DataFrameBaseView):
+    """
+    View for retrieving information about candidates.
+    """
+    @decorators.view_filter()
+    def at_least_experience(self, years: int) -> pd.Series:
+        """
+        Filters candidates with at least `years` of experience.
+        """
+        return self.df.years_of_experience >= years
+
+    @decorators.view_filter()
+    def from_country(self, country: str) -> pd.Series:
+        """
+        Filters candidates from a specific country.
+        """
+        return self.df.country == country
+
+    @decorators.view_filter()
+    def senior_data_scientist_position(self) -> pd.Series:
+        """
+        Filters candidates that can be considered for a senior data scientist position.
+        """
+        return self.df.position.isin(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]) \
+            & (self.df.years_of_experience >= 3)
+
+CANDIDATE_DATA = pd.DataFrame.from_records([
+    {"id": 1, "name": "John Doe", "position": "Data Scientist", "years_of_experience": 2, "country": "France"},
+    {"id": 2, "name": "Jane Doe", "position": "Data Engineer", "years_of_experience": 3, "country": "France"},
+    {"id": 3, "name": "Alice Smith", "position": "Machine Learning Engineer", "years_of_experience": 4, "country": "Germany"},
+    {"id": 4, "name": "Bob Smith", "position": "Data Scientist", "years_of_experience": 5, "country": "Germany"},
+    {"id": 5, "name": "Janka Jankowska", "position": "Data Scientist", "years_of_experience": 3, "country": "Poland"},
+])
+
+async def main():
+    collection = dbally.create_collection("recruitment")
+    dbally.use_event_handler(CLIEventHandler())
+    collection.add(CandidateView, lambda: CandidateView(CANDIDATE_DATA))
+
+    result = await collection.ask("Find me French candidates suitable for a senior data scientist position.")
+
+    print(f"Retrieved {len(result.results)} candidates:")
+    for candidate in result.results:
+        print(candidate)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -13,11 +13,13 @@ nav:
       - concepts/similarity_indexes.md
       - concepts/nl_responder.md
   - How-to:
+      - Using data sources:
+        - how-to/sql_views.md
+        - how-to/pandas_views.md
+        - how-to/custom_views.md
       - how-to/log_runs_to_langsmith.md
       - how-to/create_custom_event_handler.md
-      - how-to/sql_views.md
       - how-to/openai_assistants_integration.md
-      - how-to/custom_views.md
   - API Reference:
       - reference/collection.md
       - reference/event_handler.md

diff --git a/src/dbally/__init__.py b/src/dbally/__init__.py
@@ -3,6 +3,7 @@
 from dbally.views import decorators
 from dbally.views.base import AbstractBaseView
 from dbally.views.methods_base import MethodsBaseView
+from dbally.views.pandas_base import DataFrameBaseView
 from dbally.views.sqlalchemy_base import SqlAlchemyBaseView
 
 from .__version__ import __version__
@@ -17,4 +18,5 @@
     "MethodsBaseView",
     "SqlAlchemyBaseView",
     "AbstractBaseView",
+    "DataFrameBaseView",
 ]
diff --git a/src/dbally/views/pandas_base.py b/src/dbally/views/pandas_base.py
@@ -0,0 +1,86 @@
+import asyncio
+import time
+from functools import reduce
+
+import pandas as pd
+
+from dbally.data_models.execution_result import ExecutionResult
+from dbally.iql import IQLQuery, syntax
+from dbally.views.methods_base import MethodsBaseView
+
+
+class DataFrameBaseView(MethodsBaseView):
+    """
+    Base class for views that use Pandas DataFrames to store and filter data.
+
+    The views take a Pandas DataFrame as input and apply filters to it. The filters are defined as methods
+    that return a Pandas Series representing a boolean mask to be applied to the DataFrame.
+    """
+
+    def __init__(self, df: pd.DataFrame) -> None:
+        """
+        Initializes the view with the input DataFrame.
+
+        :param df: Pandas DataFrame with the data to be filtered
+        """
+        super().__init__()
+        self.df = df
+
+        # The mask to be applied to the dataframe to filter the data
+        self._filter_mask: pd.Series = None
+
+    async def apply_filters(self, filters: IQLQuery) -> None:
+        """
+        Applies the chosen filters to the view.
+
+        :param filters: IQLQuery object representing the filters to apply
+        """
+        self._filter_mask = await self.build_filter_node(filters.root)
+
+    async def build_filter_node(self, node: syntax.Node) -> pd.Series:
+        """
+        Converts a filter node from the IQLQuery to a Pandas Series representing
+        boolean mask to be applied to the dataframe.
+
+        :param node: IQLQuery node representing the filter or logical operator
+
+        :return: Pandas Series representing the boolean mask
+
+        :raises ValueError: If the node type is not supported
+        """
+        if isinstance(node, syntax.FunctionCall):
+            return await self.call_filter_method(node)
+        if isinstance(node, syntax.And):  # logical AND
+            children = await asyncio.gather(*[self.build_filter_node(child) for child in node.children])
+            return reduce(lambda x, y: x & y, children)
+        if isinstance(node, syntax.Or):  # logical OR
+            children = await asyncio.gather(*[self.build_filter_node(child) for child in node.children])
+            return reduce(lambda x, y: x | y, children)
+        if isinstance(node, syntax.Not):
+            child = await self.build_filter_node(node.child)
+            return ~child
+        raise ValueError(f"Unsupported grammar: {node}")
+
+    def execute(self, dry_run: bool = False) -> ExecutionResult:
+        """
+        Executes the view and returns the results. The results are filtered based on the applied filters.
+
+        :param dry_run: If True, the method will only return the mask that would be applied to the dataframe
+
+        :return: ExecutionResult object with the results and context information
+        """
+        start_time = time.time()
+        filtered_data = pd.DataFrame.empty
+
+        if not dry_run:
+            filtered_data = self.df
+            if self._filter_mask is not None:
+                filtered_data = filtered_data.loc[self._filter_mask]
+
+        return ExecutionResult(
+            results=filtered_data.to_dict(orient="records"),
+            execution_time=time.time() - start_time,
+            context={
+                "filter_mask": self._filter_mask,
+            },
+        )
diff --git a/tests/unit/views/test_pandas_base.py b/tests/unit/views/test_pandas_base.py
@@ -0,0 +1,99 @@
+# pylint: disable=missing-docstring, missing-return-doc, missing-param-doc, disallowed-name
+
+import pandas as pd
+
+from dbally.iql import IQLQuery
+from dbally.views.decorators import view_filter
+from dbally.views.pandas_base import DataFrameBaseView
+
+MOCK_DATA = [
+    {"name": "Alice", "city": "London", "year": 2020, "age": 30},
+    {"name": "Bob", "city": "Paris", "year": 2020, "age": 25},
+    {"name": "Charlie", "city": "London", "year": 2021, "age": 35},
+    {"name": "David", "city": "Paris", "year": 2021, "age": 40},
+    {"name": "Eve", "city": "Berlin", "year": 2020, "age": 45},
+]
+
+MOCK_DATA_BERLIN_OR_LONDON = [
+    {"name": "Alice", "city": "London", "year": 2020, "age": 30},
+    {"name": "Charlie", "city": "London", "year": 2021, "age": 35},
+    {"name": "Eve", "city": "Berlin", "year": 2020, "age": 45},
+]
+
+MOCK_DATA_PARIS_2020 = [
+    {"name": "Bob", "city": "Paris", "year": 2020, "age": 25},
+]
+
+MOCK_DATA_NOT_PARIS_2020 = [
+    {"name": "Alice", "city": "London", "year": 2020, "age": 30},
+    {"name": "Charlie", "city": "London", "year": 2021, "age": 35},
+    {"name": "David", "city": "Paris", "year": 2021, "age": 40},
+    {"name": "Eve", "city": "Berlin", "year": 2020, "age": 45},
+]
+
+
+class MockDataFrameView(DataFrameBaseView):
+    """
+    Mock class for testing the DataFrameBaseView
+    """
+
+    @view_filter()
+    def filter_city(self, city: str) -> pd.Series:
+        return self.df["city"] == city
+
+    @view_filter()
+    def filter_year(self, year: int) -> pd.Series:
+        return self.df["year"] == year
+
+    @view_filter()
+    def filter_age(self, age: int) -> pd.Series:
+        return self.df["age"] == age
+
+    @view_filter()
+    def filter_name(self, name: str) -> pd.Series:
+        return self.df["name"] == name
+
+
+async def test_filter_or() -> None:
+    """
+    Test that the filtering the DataFrame with logical OR works correctly
+    """
+    mock_view = MockDataFrameView(pd.DataFrame.from_records(MOCK_DATA))
+    query = await IQLQuery.parse(
+        'filter_city("Berlin") or filter_city("London")',
+        allowed_functions=mock_view.list_filters(),
+    )
+    await mock_view.apply_filters(query)
+    result = mock_view.execute()
+    assert result.results == MOCK_DATA_BERLIN_OR_LONDON
+    assert result.context["filter_mask"].tolist() == [True, False, True, False, True]
+
+
+async def test_filter_and() -> None:
+    """
+    Test that the filtering the DataFrame with logical AND works correctly
+    """
+    mock_view = MockDataFrameView(pd.DataFrame.from_records(MOCK_DATA))
+    query = await IQLQuery.parse(
+        'filter_city("Paris") and filter_year(2020)',
+        allowed_functions=mock_view.list_filters(),
+    )
+    await mock_view.apply_filters(query)
+    result = mock_view.execute()
+    assert result.results == MOCK_DATA_PARIS_2020
+    assert result.context["filter_mask"].tolist() == [False, True, False, False, False]
+
+
+async def test_filter_not() -> None:
+    """
+    Test that the filtering the DataFrame with logical NOT works correctly
+    """
+    mock_view = MockDataFrameView(pd.DataFrame.from_records(MOCK_DATA))
+    query = await IQLQuery.parse(
+        'not (filter_city("Paris") and filter_year(2020))',
+        allowed_functions=mock_view.list_filters(),
+    )
+    await mock_view.apply_filters(query)
+    result = mock_view.execute()
+    assert result.results == MOCK_DATA_NOT_PARIS_2020
+    assert result.context["filter_mask"].tolist() == [True, False, True, True, True]