docs(aggregations): structured view aggregations (#87)

deepsense-ai · Sep 2, 2024 · 9f6b5df · 9f6b5df
1 parent 64086a6
commit 9f6b5df
Show file tree

Hide file tree

Showing 13 changed files with 377 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -33,7 +33,7 @@ The benefits of db-ally can be described in terms of its four main characteristi
 
 ## Quickstart
 
-In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views) and **filters**. A list of possible filters is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.
+In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views), **filters** and **aggregations**. A list of possible filters and aggregations is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.
 
 This is a basic implementation of a db-ally view for an example HR application, which retrieves candidates from an SQL database:
 
@@ -60,8 +60,10 @@ class CandidateView(SqlAlchemyBaseView):
         """
         return Candidate.country == country
 
-engine = create_engine('sqlite:///examples/recruiting/data/candidates.db')
+
 llm = LiteLLM(model_name="gpt-3.5-turbo")
+engine = create_engine("sqlite:///examples/recruiting/data/candidates.db")
+
 my_collection = create_collection("collection_name", llm)
 my_collection.add(CandidateView, lambda: CandidateView(engine))
 

diff --git a/docs/about/roadmap.md b/docs/about/roadmap.md
@@ -9,7 +9,7 @@ Below you can find a list of planned features and integrations.
 
 ## Planned Features
 
-- [ ] **Support analytical queries**: support for exposing operations beyond filtering.
+- [x] **Support analytical queries**: support for exposing operations beyond filtering.
 - [x] **Few-shot prompting configuration**: allow users to configure the few-shot prompting in View definition to
     improve IQL generation accuracy.
 - [ ] **Request contextualization**: allow to provide extra context for db-ally runs, such as user asking the question.

diff --git a/docs/concepts/iql.md b/docs/concepts/iql.md
@@ -1,12 +1,45 @@
 # Concept: IQL
 
-Intermediate Query Language (IQL) is a simple language that serves as an abstraction layer between natural language and data source-specific query syntax, such as SQL. With db-ally's [structured views](./structured_views.md), LLM utilizes IQL to express complex queries in a simplified way.
+Intermediate Query Language (IQL) is a simple language that serves as an abstraction layer between natural language and data source-specific query syntax, such as SQL. With db-ally's [structured views](structured_views.md), LLM utilizes IQL to express complex queries in a simplified way. IQL allows developers to model operations such as filtering and aggregation on the underlying data.
+
+## Filtering
 
 For instance, an LLM might generate an IQL query like this when asked "Find me French candidates suitable for a senior data scientist position":
 
+```python
+from_country("France") AND senior_data_scientist_position()
 ```
-from_country('France') AND senior_data_scientist_position()
+
+The capabilities made available to the AI model via IQL differ between projects. Developers control these by defining special [views](structured_views.md). db-ally automatically exposes special methods defined in structured views, known as "filters", via IQL. For instance, the expression above suggests that the specific project contains a view that includes the `from_country` and `senior_data_scientist_position` methods (and possibly others that the LLM did not choose to use for this particular question). Additionally, the LLM can use boolean operators (`AND`, `OR`, `NOT`) to combine individual filters into more complex expressions.
+
+## Aggregation
+
+Similar to filtering, developers can define special methods in [structured views](structured_views.md) that perform aggregation. These methods are also exposed to the LLM via IQL. For example, an LLM might generate the following IQL query when asked "What's the average salary for each country?":
+
+```python
+average_salary_by_country()
 ```
 
-The capabilities made available to the AI model via IQL differ between projects. Developers control these by defining special [Views](structured_views.md). db-ally automatically exposes special methods defined in structured views, known as "filters", via IQL. For instance, the expression above suggests that the specific project contains a view that includes the `from_country` and `senior_data_scientist_position` methods (and possibly others that the LLM did not choose to use for this particular question). Additionally, the LLM can use Boolean operators (`and`,`or`, `not`) to combine individual filters into more complex expressions.
+The `average_salary_by_country` groups candidates by country and calculates the average salary for each group.
+
+The aggregation IQL call has access to the raw query, so it can perform even more complex aggregations. Like grouping different columns, or applying a custom functions. We can ask db-ally to generate candidates raport with the following IQL query:
+
+```python
+candidate_report()
+```
+
+In this case, the `candidate_report` method is defined in a structured view, and it performs a series of aggregations and calculations to produce a report with the average salary, number of candiates, and other metrics, by country.
+
+## Operation chaining
+
+Some queries require filtering and aggregation. For example, to calculate the average salary for a data scientist in the US, we first need to filter the data to include only US candidates who are senior specialists, and then calculate the average salary. In this case, db-ally will first generate an IQL query to filter the data, and then another IQL query to calculate the average salary.
+
+```python
+from_country("USA") AND senior_data_scientist_position()
+```
+
+```python
+average_salary()
+```
 
+In this case, db-ally will execute queries sequentially to build a single query plan to execute on the data source.
diff --git a/docs/concepts/structured_views.md b/docs/concepts/structured_views.md
@@ -7,7 +7,7 @@ Structured views are a type of [view](../concepts/views.md), which provide a way
 
 Given different natural language queries, a db-ally view will produce different responses while maintaining a consistent data structure. This consistency offers a reliable interface for integration - the code consuming responses from a particular structured view knows what data structure to expect and can utilize this knowledge when displaying or processing the data. This feature of db-ally makes it stand out in terms of reliability and stability compared to standard text-to-SQL approaches.
 
-Each structured view can contain one or more “filters”, which the LLM may decide to choose and apply to the extracted data so that it meets the criteria specified in the natural language query. Given such a query, LLM chooses which filters to use, provides arguments to the filters, and connects the filters with Boolean operators. The LLM expresses these filter combinations using a special language called [IQL](iql.md), in which the defined view filters provide a layer of abstraction between the LLM and the raw syntax used to query the data source (e.g., SQL).
+Each structured view can contain one or more **filters** or **aggregations**, which the LLM may decide to choose and apply to the extracted data so that it meets the criteria specified in the natural language query. Given such a query, LLM chooses which filters to use, provides arguments to the filters, and connects the filters with boolean operators. For aggregations, the LLM selects an appropriate aggregation method and applies it to the data. The LLM expresses these filter combinations and aggregation using a special language called [IQL](iql.md), in which the defined view filters and aggregations provide a layer of abstraction between the LLM and the raw syntax used to query the data source (e.g., SQL).
 
 !!! example
     For instance, this is a simple [view that uses SQLAlchemy](../how-to/views/sql.md) to select data from specific columns in a SQL database. It contains a single filter, that the LLM may optionally use to control which table rows to fetch:
@@ -18,14 +18,14 @@ Each structured view can contain one or more “filters”, which the LLM may de
         A view for retrieving candidates from the database.
         """
 
-        def get_select(self):
+        def get_select(self) -> Select:
             """
             Defines which columns to select
             """
             return sqlalchemy.select(Candidate.id, Candidate.name, Candidate.country)
 
         @decorators.view_filter()
-        def from_country(self, country: str):
+        def from_country(self, country: str) -> ColumnElement:
             """
             Filter candidates from a specific country.
             """

diff --git a/docs/index.md b/docs/index.md
@@ -10,8 +10,8 @@ hide:
 </style>
 
 <div align="center" markdown="span">
-  ![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/mp/update-logo/docs/assets/banner-light.svg#only-light){ width="30%" }
-  ![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/mp/update-logo/docs/assets/banner-dark.svg#only-dark){ width="30%" }
+  ![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/main/docs/assets/banner-light.svg#only-light){ width="30%" }
+  ![dbally logo](https://raw.githubusercontent.com/deepsense-ai/db-ally/main/docs/assets/banner-dark.svg#only-dark){ width="30%" }
 </div>
 
 <p align="center">
@@ -49,7 +49,7 @@ The benefits of db-ally can be described in terms of its four main characteristi
 
 ## Quickstart
 
-In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views) and **filters**. A list of possible filters is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.
+In db-ally, developers define their use cases by implementing [**views**](https://db-ally.deepsense.ai/concepts/views), **filters** and **aggregations**. A list of possible filters and aggregations is presented to the LLM in terms of [**IQL**](https://db-ally.deepsense.ai/concepts/iql) (Intermediate Query Language). Views are grouped and registered within a [**collection**](https://db-ally.deepsense.ai/concepts/views), which then serves as an entry point for asking questions in natural language.
 
 This is a basic implementation of a db-ally view for an example HR application, which retrieves candidates from an SQL database:
 
@@ -76,8 +76,10 @@ class CandidateView(SqlAlchemyBaseView):
         """
         return Candidate.country == country
 
-engine = create_engine('sqlite:///examples/recruiting/data/candidates.db')
+
 llm = LiteLLM(model_name="gpt-3.5-turbo")
+engine = create_engine("sqlite:///examples/recruiting/data/candidates.db")
+
 my_collection = create_collection("collection_name", llm)
 my_collection.add(CandidateView, lambda: CandidateView(engine))
 

diff --git a/docs/quickstart/aggregations.md b/docs/quickstart/aggregations.md
@@ -0,0 +1,93 @@
+# Quickstart: Aggregations
+
+This guide is a continuation of the [Intro](./intro.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 1 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/intro.py){:target="_blank"}.
+
+In this guide, we will add aggregations to our view to calculate general metrics about the candidates.
+
+## View Definition
+
+To add aggregations to our [structured view](../concepts/structured_views.md), we'll define new methods. These methods will allow the LLM model to perform calculations and summarize data across multiple rows. Let's add three aggregation methods to our `CandidateView`:
+
+```python
+class CandidateView(SqlAlchemyBaseView):
+    """
+    A view for retrieving candidates from the database.
+    """
+
+    def get_select(self) -> sqlalchemy.Select:
+        """
+        Creates the initial SqlAlchemy select object, which will be used to build the query.
+        """
+        return sqlalchemy.select(Candidate)
+
+    @decorators.view_aggregation()
+    def average_years_of_experience(self) -> sqlalchemy.Select:
+        """
+        Calculates the average years of experience of candidates.
+        """
+        return self.select.with_only_columns(
+            sqlalchemy.func.avg(Candidate.years_of_experience).label("average_years_of_experience")
+        )
+
+    @decorators.view_aggregation()
+    def positions_per_country(self) -> sqlalchemy.Select:
+        """
+        Returns the number of candidates per position per country.
+        """
+        return (
+            self.select.with_only_columns(
+                sqlalchemy.func.count(Candidate.position).label("number_of_positions"),
+                Candidate.position,
+                Candidate.country,
+            )
+            .group_by(Candidate.position, Candidate.country)
+            .order_by(sqlalchemy.desc("number_of_positions"))
+        )
+
+    @decorators.view_aggregation()
+    def candidates_per_country(self) -> sqlalchemy.Select:
+        """
+        Returns the number of candidates per country.
+        """
+        return (
+            self.select.with_only_columns(
+                sqlalchemy.func.count(Candidate.id).label("number_of_candidates"),
+                Candidate.country,
+            )
+            .group_by(Candidate.country)
+        )
+```
+
+By setting up these aggregations, you enable the LLM to calculate metrics about the average years of experience, the number of candidates per position per country, and the top universities based on the number of candidates.
+
+## Query Execution
+
+Having already defined and registered the view with the collection, we can now execute the query:
+
+```python
+result = await collection.ask("What is the average years of experience of candidates?")
+print(result.results)
+```
+
+This will return the average years of experience of candidates.
+
+<details>
+  <summary>The expected output</summary>
+```
+The generated SQL query is: SELECT avg(candidates.years_of_experience) AS average_years_of_experience
+FROM candidates
+
+Number of rows: 1
+{'average_years_of_experience': 4.98}
+```
+</details>
+
+Feel free to try other questions like: "What's the distribution of candidates across different positions and countries?" or "How many candidates are from China?".
+
+## Full Example
+
+Access the full example on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/aggregations.py){:target="_blank"}.
+
+## Next Steps
+
+Explore [Quickstart Part 3: Semantic Similarity](./semantic-similarity.md) to expand on the example and learn about using semantic similarity.
diff --git a/docs/quickstart/index.md b/docs/quickstart/index.md
@@ -52,7 +52,7 @@ Candidate = Base.classes.candidates
 
 ## View Definition
 
-To use db-ally, define the views you want to use. A [structured view](../concepts/structured_views.md) is a class that specifies what to select from the database and includes methods that the AI model can use to filter rows. These methods are known as "filters".
+To use db-ally, define the views you want to use. A [structured view](../concepts/structured_views.md) is a class that specifies what to select from the database and includes methods that the AI model can use to filter rows. These methods are known as **filters**.
 
 ```python
 from dbally import decorators, SqlAlchemyBaseView
@@ -174,4 +174,4 @@ Access the full example on [GitHub](https://github.com/deepsense-ai/db-ally/blob
 
 ## Next Steps
 
-Explore [Quickstart Part 2: Semantic Similarity](./semantic-similarity.md) to expand on the example and learn about using semantic similarity.
+Explore [Quickstart Part 2: Semantic Similarity](./semantic-similarity.md) to expand on the example and learn about using semantic similarity.
diff --git a/docs/quickstart/multiple-views.md b/docs/quickstart/multiple-views.md
@@ -1,6 +1,6 @@
 # Quickstart: Multiple Views
 
-This guide continues from [Semantic Similarity](./semantic-similarity.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 2 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/semantic_similarity.py){:target="_blank"}.
+This guide continues from [Semantic Similarity](./semantic-similarity.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 3 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/semantic_similarity.py){:target="_blank"}.
 
 The guide illustrates how to use multiple views to handle queries requiring different types of data. `CandidateView` and `JobView` are used as examples.
 
@@ -28,6 +28,7 @@ jobs_data = pd.DataFrame.from_records([
     {"title": "Machine Learning Engineer", "company": "Company C", "location": "Berlin", "salary": 90000},
     {"title": "Data Scientist", "company": "Company D", "location": "London", "salary": 110000},
     {"title": "Data Scientist", "company": "Company E", "location": "Warsaw", "salary": 80000},
+    {"title": "Data Scientist", "company": "Company F", "location": "Warsaw", "salary": 100000},
 ])
 ```
 

diff --git a/docs/quickstart/semantic-similarity.md b/docs/quickstart/semantic-similarity.md
@@ -1,6 +1,6 @@
 # Quickstart: Semantic Similarity
 
-This guide is a continuation of the [Intro](./index.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 1 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/intro.py){:target="_blank"}.
+This guide is a continuation of the [Aggregations](./aggregations.md) guide. It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 2 code on [GitHub](https://github.com/deepsense-ai/db-ally/blob/main/examples/aggregations.py){:target="_blank"}.
 
 This guide will demonstrate how to use semantic similarity to handle queries in which the filter values are similar to those in the database, without requiring an exact match. We will use filtering by country as an example.
 
@@ -150,4 +150,4 @@ To see the full example, you can find the code on [GitHub](https://github.com/de
 
 ## Next Steps
 
-Explore [Quickstart Part 3: Multiple Views](./multiple-views.md) to learn how to run queries with multiple views and display the results based on the view that was used to fetch the data.
+Explore [Quickstart Part 4: Multiple Views](./multiple-views.md) to learn how to run queries with multiple views and display the results.