Skip to content

Commit

Permalink
Merge branch 'lt/dataframe' into 'main'
Browse files Browse the repository at this point in the history
Introduce Pandas DataFrame base view

See merge request deepsense.ai/g-internal/db-ally!74
  • Loading branch information
ludwiktrammer committed Mar 28, 2024
2 parents c396073 + 77131f1 commit 380c4b7
Show file tree
Hide file tree
Showing 9 changed files with 354 additions and 19 deletions.
2 changes: 1 addition & 1 deletion docs/about/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Below you can find a list of planned integrations.
### Data sources

- [x] Sqlalchemy
- [ ] Pandas DataFrame
- [x] Pandas DataFrame
- [ ] HTTP REST Endpoints
- [ ] GraphQL Endpoints

Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/custom_views.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How-To: Write Custom Views
# How-To: Use custom data sources with db-ally

!!! note
This is an advanced topic. If you're looking to create a view that retrieves data from an SQL database, please refer to the [SQL Views](sql_views.md) guide instead.
Expand Down
96 changes: 96 additions & 0 deletions docs/how-to/pandas_views.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# How To: Use Pandas DataFrames with db-ally

In this guide, you will learn how to write [views](../concepts/views.md) that use [Pandas](https://pandas.pydata.org/) DataFrames as their data source. You will understand how to define such a view, create filters that operate on the DataFrame, and register it while providing it with the source DataFrame.

The example used in this guide is a DataFrame containing information about candidates. The DataFrame includes columns such as `id`, `name`, `country`, `years_of_experience`. This is the same use case as the one in the [Quickstart](../quickstart/index.md) and [Custom Views](./custom_views.md) guides. Please feel free to compare the different approaches.

## The DataFrame
Here is an example of a DataFrame containing information about candidates:

```python
import pandas as pd

CANDIDATE_DATA = pd.DataFrame.from_records([
{"id": 1, "name": "John Doe", "position": "Data Scientist", "years_of_experience": 2, "country": "France"},
{"id": 2, "name": "Jane Doe", "position": "Data Engineer", "years_of_experience": 3, "country": "France"},
{"id": 3, "name": "Alice Smith", "position": "Machine Learning Engineer", "years_of_experience": 4, "country": "Germany"},
{"id": 4, "name": "Bob Smith", "position": "Data Scientist", "years_of_experience": 5, "country": "Germany"},
{"id": 5, "name": "Janka Jankowska", "position": "Data Scientist", "years_of_experience": 3, "country": "Poland"},
])
```

## View Definition
Views operating on Pandas DataFrames are defined by subclassing the `DataFrameBaseView` class:

```python
from dbally import decorators, DataFrameBaseView

class CandidateView(DataFrameBaseView):
"""
View for retrieving information about candidates.
"""
```

Typically, a view contains one or more filters that operate on the DataFrame. In the case of views inheriting from `DataFrameBaseView`, filters are expected to return a [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object that can be used as a [boolean index](https://pandas.pydata.org/pandas-docs/version/2.1/user_guide/indexing.html#boolean-indexing) for the original DataFrame. In other words, the filter should return a boolean `Series` with the same length as the original DataFrame where `True` values denote rows that should be included in the result and `False` values indicate rows that should be omitted.

Typically, such `Series` are created automatically by using logical operations on the DataFrame columns, such as `==`, `>`, `<`, `&` (for "and"), `|` (for "or"), and `~` (for "not"). For instance, `df.years_of_experience > 5` will return a boolean `Series` with `True` values for rows where the `years_of_experience` column is greater than 5.

As always, the LLM will choose the best filter to apply based on the query it receives and will combine multiple filters if necessary.

Here are two filters that operate on the DataFrame - one filters candidates with at least a certain number of years of experience and another filters candidates from a specific country:

```python
@decorators.view_filter()
def at_least_experience(self, years: int) -> pd.Series:
"""
Filters candidates with at least `years` of experience.
"""
return self.df.years_of_experience >= years

@decorators.view_filter()
def from_country(self, country: str) -> pd.Series:
"""
Filters candidates from a specific country.
"""
return self.df.country == country
```

As you see the DataFrame object is accessed via the `self.df` attribute. This attribute is automatically set by the `DataFrameBaseView` class and contains the DataFrame provided when the view is registered.

Here is an example of a more advanced filter that filters candidates considered for a senior data scientist position. It uses the `&` operator to combine two conditions:

```python
@decorators.view_filter()
def senior_data_scientist_position(self) -> pd.Series:
"""
Filters candidates that can be fit for a senior data scientist position.
"""
return self.df.position.isin(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]) \
& (self.df.years_of_experience >= 3)
```

# Registering the View
To use the view, you need to create a [Collection](../concepts/collections.md) and register the view with it. This is done in the same manner as registering other types of views, but you need to provide the view with the DataFrame on which it should operate:

```python
import dbally

collection = dbally.create_collection("recruitment")
collection.add(CandidateView, lambda: CandidateView(CANDIDATE_DATA))

result = await collection.ask("Find me French candidates suitable for a senior data scientist position.")

print(f"Retrieved {len(result.results)} candidates:")
for candidate in result.results:
print(candidate)
```

This code will return a list of French candidates eligible for a senior data scientist position and display them:

```
Retrieved 1 candidates:
{'id': 2, 'name': 'Jane Doe', 'position': 'Data Engineer', 'years_of_experience': 3, 'country': 'France'}
```

## Full Example
You can access the complete example here: [pandas_views_code.py](pandas_views_code.py)
64 changes: 64 additions & 0 deletions docs/how-to/pandas_views_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring, missing-class-docstring, missing-raises-doc
import dbally
import os
import asyncio
from dataclasses import dataclass
from typing import Iterable, Callable, Any
import pandas as pd

from dbally import decorators, DataFrameBaseView
from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler

dbally.use_openai_llm(
openai_api_key=os.environ["OPENAI_API_KEY"],
model_name="gpt-3.5-turbo",
)

class CandidateView(DataFrameBaseView):
"""
View for retrieving information about candidates.
"""
@decorators.view_filter()
def at_least_experience(self, years: int) -> pd.Series:
"""
Filters candidates with at least `years` of experience.
"""
return self.df.years_of_experience >= years

@decorators.view_filter()
def from_country(self, country: str) -> pd.Series:
"""
Filters candidates from a specific country.
"""
return self.df.country == country

@decorators.view_filter()
def senior_data_scientist_position(self) -> pd.Series:
"""
Filters candidates that can be considered for a senior data scientist position.
"""
return self.df.position.isin(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]) \
& (self.df.years_of_experience >= 3)

CANDIDATE_DATA = pd.DataFrame.from_records([
{"id": 1, "name": "John Doe", "position": "Data Scientist", "years_of_experience": 2, "country": "France"},
{"id": 2, "name": "Jane Doe", "position": "Data Engineer", "years_of_experience": 3, "country": "France"},
{"id": 3, "name": "Alice Smith", "position": "Machine Learning Engineer", "years_of_experience": 4, "country": "Germany"},
{"id": 4, "name": "Bob Smith", "position": "Data Scientist", "years_of_experience": 5, "country": "Germany"},
{"id": 5, "name": "Janka Jankowska", "position": "Data Scientist", "years_of_experience": 3, "country": "Poland"},
])

async def main():
collection = dbally.create_collection("recruitment")
dbally.use_event_handler(CLIEventHandler())
collection.add(CandidateView, lambda: CandidateView(CANDIDATE_DATA))

result = await collection.ask("Find me French candidates suitable for a senior data scientist position.")

print(f"Retrieved {len(result.results)} candidates:")
for candidate in result.results:
print(candidate)


if __name__ == "__main__":
asyncio.run(main())
6 changes: 4 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,13 @@ nav:
- concepts/similarity_indexes.md
- concepts/nl_responder.md
- How-to:
- Using data sources:
- how-to/sql_views.md
- how-to/pandas_views.md
- how-to/custom_views.md
- how-to/log_runs_to_langsmith.md
- how-to/create_custom_event_handler.md
- how-to/sql_views.md
- how-to/openai_assistants_integration.md
- how-to/custom_views.md
- API Reference:
- reference/collection.md
- reference/event_handler.md
Expand Down
2 changes: 2 additions & 0 deletions src/dbally/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from dbally.views import decorators
from dbally.views.base import AbstractBaseView
from dbally.views.methods_base import MethodsBaseView
from dbally.views.pandas_base import DataFrameBaseView
from dbally.views.sqlalchemy_base import SqlAlchemyBaseView

from .__version__ import __version__
Expand All @@ -17,4 +18,5 @@
"MethodsBaseView",
"SqlAlchemyBaseView",
"AbstractBaseView",
"DataFrameBaseView",
]
86 changes: 86 additions & 0 deletions src/dbally/views/pandas_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import asyncio
import time
from functools import reduce

import pandas as pd

from dbally.data_models.execution_result import ExecutionResult
from dbally.iql import IQLQuery, syntax
from dbally.views.methods_base import MethodsBaseView


class DataFrameBaseView(MethodsBaseView):
"""
Base class for views that use Pandas DataFrames to store and filter data.
The views take a Pandas DataFrame as input and apply filters to it. The filters are defined as methods
that return a Pandas Series representing a boolean mask to be applied to the DataFrame.
"""

def __init__(self, df: pd.DataFrame) -> None:
"""
Initializes the view with the input DataFrame.
:param df: Pandas DataFrame with the data to be filtered
"""
super().__init__()
self.df = df

# The mask to be applied to the dataframe to filter the data
self._filter_mask: pd.Series = None

async def apply_filters(self, filters: IQLQuery) -> None:
"""
Applies the chosen filters to the view.
:param filters: IQLQuery object representing the filters to apply
"""
self._filter_mask = await self.build_filter_node(filters.root)

async def build_filter_node(self, node: syntax.Node) -> pd.Series:
"""
Converts a filter node from the IQLQuery to a Pandas Series representing
boolean mask to be applied to the dataframe.
:param node: IQLQuery node representing the filter or logical operator
:return: Pandas Series representing the boolean mask
:raises ValueError: If the node type is not supported
"""
if isinstance(node, syntax.FunctionCall):
return await self.call_filter_method(node)
if isinstance(node, syntax.And): # logical AND
children = await asyncio.gather(*[self.build_filter_node(child) for child in node.children])
return reduce(lambda x, y: x & y, children)
if isinstance(node, syntax.Or): # logical OR
children = await asyncio.gather(*[self.build_filter_node(child) for child in node.children])
return reduce(lambda x, y: x | y, children)
if isinstance(node, syntax.Not):
child = await self.build_filter_node(node.child)
return ~child
raise ValueError(f"Unsupported grammar: {node}")

def execute(self, dry_run: bool = False) -> ExecutionResult:
"""
Executes the view and returns the results. The results are filtered based on the applied filters.
:param dry_run: If True, the method will only return the mask that would be applied to the dataframe
:return: ExecutionResult object with the results and context information
"""
start_time = time.time()
filtered_data = pd.DataFrame.empty

if not dry_run:
filtered_data = self.df
if self._filter_mask is not None:
filtered_data = filtered_data.loc[self._filter_mask]

return ExecutionResult(
results=filtered_data.to_dict(orient="records"),
execution_time=time.time() - start_time,
context={
"filter_mask": self._filter_mask,
},
)
99 changes: 99 additions & 0 deletions tests/unit/views/test_pandas_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# pylint: disable=missing-docstring, missing-return-doc, missing-param-doc, disallowed-name

import pandas as pd

from dbally.iql import IQLQuery
from dbally.views.decorators import view_filter
from dbally.views.pandas_base import DataFrameBaseView

MOCK_DATA = [
{"name": "Alice", "city": "London", "year": 2020, "age": 30},
{"name": "Bob", "city": "Paris", "year": 2020, "age": 25},
{"name": "Charlie", "city": "London", "year": 2021, "age": 35},
{"name": "David", "city": "Paris", "year": 2021, "age": 40},
{"name": "Eve", "city": "Berlin", "year": 2020, "age": 45},
]

MOCK_DATA_BERLIN_OR_LONDON = [
{"name": "Alice", "city": "London", "year": 2020, "age": 30},
{"name": "Charlie", "city": "London", "year": 2021, "age": 35},
{"name": "Eve", "city": "Berlin", "year": 2020, "age": 45},
]

MOCK_DATA_PARIS_2020 = [
{"name": "Bob", "city": "Paris", "year": 2020, "age": 25},
]

MOCK_DATA_NOT_PARIS_2020 = [
{"name": "Alice", "city": "London", "year": 2020, "age": 30},
{"name": "Charlie", "city": "London", "year": 2021, "age": 35},
{"name": "David", "city": "Paris", "year": 2021, "age": 40},
{"name": "Eve", "city": "Berlin", "year": 2020, "age": 45},
]


class MockDataFrameView(DataFrameBaseView):
"""
Mock class for testing the DataFrameBaseView
"""

@view_filter()
def filter_city(self, city: str) -> pd.Series:
return self.df["city"] == city

@view_filter()
def filter_year(self, year: int) -> pd.Series:
return self.df["year"] == year

@view_filter()
def filter_age(self, age: int) -> pd.Series:
return self.df["age"] == age

@view_filter()
def filter_name(self, name: str) -> pd.Series:
return self.df["name"] == name


async def test_filter_or() -> None:
"""
Test that the filtering the DataFrame with logical OR works correctly
"""
mock_view = MockDataFrameView(pd.DataFrame.from_records(MOCK_DATA))
query = await IQLQuery.parse(
'filter_city("Berlin") or filter_city("London")',
allowed_functions=mock_view.list_filters(),
)
await mock_view.apply_filters(query)
result = mock_view.execute()
assert result.results == MOCK_DATA_BERLIN_OR_LONDON
assert result.context["filter_mask"].tolist() == [True, False, True, False, True]


async def test_filter_and() -> None:
"""
Test that the filtering the DataFrame with logical AND works correctly
"""
mock_view = MockDataFrameView(pd.DataFrame.from_records(MOCK_DATA))
query = await IQLQuery.parse(
'filter_city("Paris") and filter_year(2020)',
allowed_functions=mock_view.list_filters(),
)
await mock_view.apply_filters(query)
result = mock_view.execute()
assert result.results == MOCK_DATA_PARIS_2020
assert result.context["filter_mask"].tolist() == [False, True, False, False, False]


async def test_filter_not() -> None:
"""
Test that the filtering the DataFrame with logical NOT works correctly
"""
mock_view = MockDataFrameView(pd.DataFrame.from_records(MOCK_DATA))
query = await IQLQuery.parse(
'not (filter_city("Paris") and filter_year(2020))',
allowed_functions=mock_view.list_filters(),
)
await mock_view.apply_filters(query)
result = mock_view.execute()
assert result.results == MOCK_DATA_NOT_PARIS_2020
assert result.context["filter_mask"].tolist() == [True, False, True, True, True]
Loading

0 comments on commit 380c4b7

Please sign in to comment.