Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(aggregations): structured view aggregations #87

Merged
merged 12 commits into from
Sep 2, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The benefits of db-ally can be described in terms of its four main characteristi

## Quickstart

In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views) and **filters**. A list of possible filters is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.
In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views), **filters** and **aggregations**. A list of possible filters and aggregations is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.

This is a basic implementation of a db-ally view for an example HR application, which retrieves candidates from an SQL database:

Expand All @@ -60,8 +60,10 @@ class CandidateView(SqlAlchemyBaseView):
"""
return Candidate.country == country

engine = create_engine('sqlite:///examples/recruiting/data/candidates.db')

llm = LiteLLM(model_name="gpt-3.5-turbo")
engine = create_engine("sqlite:///examples/recruiting/data/candidates.db")

my_collection = create_collection("collection_name", llm)
my_collection.add(CandidateView, lambda: CandidateView(engine))

Expand Down
2 changes: 1 addition & 1 deletion docs/about/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Below you can find a list of planned features and integrations.

## Planned Features

- [ ] **Support analytical queries**: support for exposing operations beyond filtering.
- [x] **Support analytical queries**: support for exposing operations beyond filtering.
- [x] **Few-shot prompting configuration**: allow users to configure the few-shot prompting in View definition to
improve IQL generation accuracy.
- [ ] **Request contextualization**: allow to provide extra context for db-ally runs, such as user asking the question.
Expand Down
39 changes: 36 additions & 3 deletions docs/concepts/iql.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,45 @@
# Concept: IQL

Intermediate Query Language (IQL) is a simple language that serves as an abstraction layer between natural language and data source-specific query syntax, such as SQL. With db-ally's [structured views](./structured_views.md), LLM utilizes IQL to express complex queries in a simplified way.
Intermediate Query Language (IQL) is a simple language that serves as an abstraction layer between natural language and data source-specific query syntax, such as SQL. With db-ally's [structured views](structured_views.md), LLM utilizes IQL to express complex queries in a simplified way. IQL allows developers to model operations such as filtering and aggregation on the underlying data.

## Filtering

For instance, an LLM might generate an IQL query like this when asked "Find me French candidates suitable for a senior data scientist position":

```python
from_country("France") AND senior_data_scientist_position()
```
from_country('France') AND senior_data_scientist_position()

The capabilities made available to the AI model via IQL differ between projects. Developers control these by defining special [views](structured_views.md). db-ally automatically exposes special methods defined in structured views, known as "filters", via IQL. For instance, the expression above suggests that the specific project contains a view that includes the `from_country` and `senior_data_scientist_position` methods (and possibly others that the LLM did not choose to use for this particular question). Additionally, the LLM can use boolean operators (`AND`, `OR`, `NOT`) to combine individual filters into more complex expressions.

## Aggregation

Similar to filtering, developers can define special methods in [structured views](structured_views.md) that perform aggregation. These methods are also exposed to the LLM via IQL. For example, an LLM might generate the following IQL query when asked "What's the average salary for each country?":

```python
average_salary_by_country()
```

The capabilities made available to the AI model via IQL differ between projects. Developers control these by defining special [Views](structured_views.md). db-ally automatically exposes special methods defined in structured views, known as "filters", via IQL. For instance, the expression above suggests that the specific project contains a view that includes the `from_country` and `senior_data_scientist_position` methods (and possibly others that the LLM did not choose to use for this particular question). Additionally, the LLM can use Boolean operators (`and`,`or`, `not`) to combine individual filters into more complex expressions.
The `average_salary_by_country` groups candidates by country and calculates the average salary for each group.

The aggregation IQL call has access to the raw query, so it can perform even more complex aggregations. Like grouping different columns, or applying a custom functions. We can ask db-ally to generate candidates raport with the following IQL query:

```python
candidate_report()
```

In this case, the `candidate_report` method is defined in a structured view, and it performs a series of aggregations and calculations to produce a report with the average salary, number of candiates, and other metrics, by country.

## Operation chaining

Some queries require filtering and aggregation. For example, to calculate the average salary for a data scientist in the US, we first need to filter the data to include only US candidates who are senior specialists, and then calculate the average salary. In this case, db-ally will first generate an IQL query to filter the data, and then another IQL query to calculate the average salary.

```python
from_country("USA") AND senior_data_scientist_position()
```

```python
average_salary()
```

In this case, db-ally will execute queries sequentially to build a single query plan to execute on the data source.
6 changes: 3 additions & 3 deletions docs/concepts/structured_views.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Structured views are a type of [view](../concepts/views.md), which provide a way

Given different natural language queries, a db-ally view will produce different responses while maintaining a consistent data structure. This consistency offers a reliable interface for integration - the code consuming responses from a particular structured view knows what data structure to expect and can utilize this knowledge when displaying or processing the data. This feature of db-ally makes it stand out in terms of reliability and stability compared to standard text-to-SQL approaches.

Each structured view can contain one or more filters, which the LLM may decide to choose and apply to the extracted data so that it meets the criteria specified in the natural language query. Given such a query, LLM chooses which filters to use, provides arguments to the filters, and connects the filters with Boolean operators. The LLM expresses these filter combinations using a special language called [IQL](iql.md), in which the defined view filters provide a layer of abstraction between the LLM and the raw syntax used to query the data source (e.g., SQL).
Each structured view can contain one or more **filters** or **aggregations**, which the LLM may decide to choose and apply to the extracted data so that it meets the criteria specified in the natural language query. Given such a query, LLM chooses which filters to use, provides arguments to the filters, and connects the filters with boolean operators. For aggregations, the LLM selects an appropriate aggregation method and applies it to the data. The LLM expresses these filter combinations and aggregation using a special language called [IQL](iql.md), in which the defined view filters and aggregations provide a layer of abstraction between the LLM and the raw syntax used to query the data source (e.g., SQL).

!!! example
For instance, this is a simple [view that uses SQLAlchemy](../how-to/views/sql.md) to select data from specific columns in a SQL database. It contains a single filter, that the LLM may optionally use to control which table rows to fetch:
Expand All @@ -18,14 +18,14 @@ Each structured view can contain one or more “filters”, which the LLM may de
A view for retrieving candidates from the database.
"""

def get_select(self):
def get_select(self) -> Select:
"""
Defines which columns to select
"""
return sqlalchemy.select(Candidate.id, Candidate.name, Candidate.country)

@decorators.view_filter()
def from_country(self, country: str):
def from_country(self, country: str) -> ColumnElement:
"""
Filter candidates from a specific country.
"""
Expand Down
10 changes: 6 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ hide:
</style>

<div align="center" markdown="span">
![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/mp/update-logo/docs/assets/banner-light.svg#only-light){ width="30%" }
![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/mp/update-logo/docs/assets/banner-dark.svg#only-dark){ width="30%" }
![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/main/docs/assets/banner-light.svg#only-light){ width="30%" }
![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/main/docs/assets/banner-dark.svg#only-dark){ width="30%" }
</div>

<p align="center">
Expand Down Expand Up @@ -49,7 +49,7 @@ The benefits of db-ally can be described in terms of its four main characteristi

## Quickstart

In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views) and **filters**. A list of possible filters is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.
In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views), **filters** and **aggregations**. A list of possible filters and aggregations is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.

This is a basic implementation of a db-ally view for an example HR application, which retrieves candidates from an SQL database:

Expand All @@ -76,8 +76,10 @@ class CandidateView(SqlAlchemyBaseView):
"""
return Candidate.country == country

engine = create_engine('sqlite:///examples/recruiting/data/candidates.db')

llm = LiteLLM(model_name="gpt-3.5-turbo")
engine = create_engine("sqlite:///examples/recruiting/data/candidates.db")

my_collection = create_collection("collection_name", llm)
my_collection.add(CandidateView, lambda: CandidateView(engine))

Expand Down
93 changes: 93 additions & 0 deletions docs/quickstart/aggregations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Quickstart: Aggregations

This guide is a continuation of the [Intro](./intro.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 1 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/intro.py){:target="_blank"}.

In this guide, we will add aggregations to our view so that we can calculate some general metrics about the candidates.
micpst marked this conversation as resolved.
Show resolved Hide resolved

## View Definition

To add aggregations to our [structured view](../concepts/structured_views.md), we'll define new methods. These methods will allow the LLM model to perform calculations and summarize data across multiple rows. Let's add three aggregation methods to our `CandidateView`:

```python
class CandidateView(SqlAlchemyBaseView):
"""
A view for retrieving candidates from the database.
"""

def get_select(self) -> sqlalchemy.Select:
"""
Creates the initial SqlAlchemy select object, which will be used to build the query.
"""
return sqlalchemy.select(Candidate)

@decorators.view_aggregation()
def average_years_of_experience(self) -> sqlalchemy.Select:
"""
Calculates the average years of experience of candidates.
"""
return self.select.with_only_columns(
sqlalchemy.func.avg(Candidate.years_of_experience).label("average_years_of_experience")
)

@decorators.view_aggregation()
def positions_per_country(self) -> sqlalchemy.Select:
"""
Returns the number of candidates per position per country.
"""
return (
self.select.with_only_columns(
sqlalchemy.func.count(Candidate.position).label("number_of_positions"),
Candidate.position,
Candidate.country,
)
.group_by(Candidate.position, Candidate.country)
.order_by(sqlalchemy.desc("number_of_positions"))
)

@decorators.view_aggregation()
def candidates_per_country(self) -> sqlalchemy.Select:
"""
Returns the number of candidates per country.
"""
return (
self.select.with_only_columns(
sqlalchemy.func.count(Candidate.id).label("number_of_candidates"),
Candidate.country,
)
.group_by(Candidate.country)
)
```

By setting up these aggregations, you enable the LLM to calculate metrics about the average years of experience, the number of candidates per position per country, and the top universities based on the number of candidates.

## Query Execution

Having already defined and registered the view with the collection, we can now execute the query:

```python
result = await collection.ask("What is the average years of experience of candidates?")
print(result.results)
```

This will return the average years of experience of candidates.

<details>
<summary>The expected output</summary>
```
The generated SQL query is: SELECT avg(candidates.years_of_experience) AS average_years_of_experience
FROM candidates

Number of rows: 1
{'average_years_of_experience': 4.98}
```
</details>

Feel free to try other questions like: "What's the distribution of candidates across different positions and countries?" or "How many candidates are from China?".

## Full Example

Access the full example on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/aggregations.py){:target="_blank"}.

## Next Steps

Explore [Quickstart Part 3: Semantic Similarity](./semantic-similarity.md) to expand on the example and learn about using semantic similarity.
4 changes: 2 additions & 2 deletions docs/quickstart/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Candidate = Base.classes.candidates

## View Definition

To use db-ally, define the views you want to use. A [structured view](../concepts/structured_views.md) is a class that specifies what to select from the database and includes methods that the AI model can use to filter rows. These methods are known as "filters".
To use db-ally, define the views you want to use. A [structured view](../concepts/structured_views.md) is a class that specifies what to select from the database and includes methods that the AI model can use to filter rows. These methods are known as **filters**.

```python
from dbally import decorators, SqlAlchemyBaseView
Expand Down Expand Up @@ -174,4 +174,4 @@ Access the full example on [GitHub](https://github.com/deepsense-ai/db-ally/blob

## Next Steps

Explore [Quickstart Part 2: Semantic Similarity](./semantic-similarity.md) to expand on the example and learn about using semantic similarity.
Explore [Quickstart Part 2: Semantic Similarity](./semantic-similarity.md) to expand on the example and learn about using semantic similarity.
3 changes: 2 additions & 1 deletion docs/quickstart/multiple-views.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Quickstart: Multiple Views

This guide continues from [Semantic Similarity](./semantic-similarity.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 2 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/semantic_similarity.py){:target="_blank"}.
This guide continues from [Semantic Similarity](./semantic-similarity.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 3 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/semantic_similarity.py){:target="_blank"}.

The guide illustrates how to use multiple views to handle queries requiring different types of data. `CandidateView` and `JobView` are used as examples.

Expand Down Expand Up @@ -28,6 +28,7 @@ jobs_data = pd.DataFrame.from_records([
{"title": "Machine Learning Engineer", "company": "Company C", "location": "Berlin", "salary": 90000},
{"title": "Data Scientist", "company": "Company D", "location": "London", "salary": 110000},
{"title": "Data Scientist", "company": "Company E", "location": "Warsaw", "salary": 80000},
{"title": "Data Scientist", "company": "Company F", "location": "Warsaw", "salary": 100000},
])
```

Expand Down
4 changes: 2 additions & 2 deletions docs/quickstart/semantic-similarity.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Quickstart: Semantic Similarity

This guide is a continuation of the [Intro](./index.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 1 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/intro.py){:target="_blank"}.
This guide is a continuation of the [Aggregations](./aggregations.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 2 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/aggregations.py){:target="_blank"}.

This guide will demonstrate how to use semantic similarity to handle queries in which the filter values are similar to those in the database, without requiring an exact match. We will use filtering by country as an example.

Expand Down Expand Up @@ -150,4 +150,4 @@ To see the full example, you can find the code on [GitHub](https://github.com/de

## Next Steps

Explore [Quickstart Part 3: Multiple Views](./multiple-views.md) to learn how to run queries with multiple views and display the results based on the view that was used to fetch the data.
Explore [Quickstart Part 4: Multiple Views](./multiple-views.md) to learn how to run queries with multiple views and display the results.
Loading
Loading