Skip to content

Commit

Permalink
Add documentation for multi-schemas
Browse files Browse the repository at this point in the history
  • Loading branch information
jcjc712 committed Apr 23, 2024
1 parent 80c360e commit 4f3c4ea
Show file tree
Hide file tree
Showing 8 changed files with 56 additions and 17 deletions.
25 changes: 22 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,24 @@ curl -X 'POST' \
}'
```

##### Connecting multi-schemas
You can connect many schemas using one db connection if you want to create SQL joins between schemas.
Currently only `BigQuery`, `Snowflake`, `Databricks` and `Postgres` support this feature.
To use multi-schemas instead of sending the `schema` in the `connection_uri` set it in the `schemas` param, like this:

```
curl -X 'POST' \
'<host>/api/v1/database-connections' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"alias": "my_db_alias",
"use_ssh": false,
"connection_uri": snowflake://<user>:<password>@<organization>-<account-name>/<database>",
"schemas": ["schema_1", "schema_2", ...]
}'
```

##### Connecting to supported Data warehouses and using SSH
You can find the details on how to connect to the supported data warehouses in the [docs](https://dataherald.readthedocs.io/en/latest/api.create_database_connection.html)

Expand All @@ -194,7 +212,8 @@ While only the Database scan part is required to start generating SQL, adding ve
#### Scanning the Database
The database scan is used to gather information about the database including table and column names and identifying low cardinality columns and their values to be stored in the context store and used in the prompts to the LLM.
In addition, it retrieves logs, which consist of historical queries associated with each database table. These records are then stored within the query_history collection. The historical queries retrieved encompass data from the past three months and are grouped based on query and user.
db_connection_id is the id of the database connection you want to scan, which is returned when you create a database connection.
The db_connection_id param is the id of the database connection you want to scan, which is returned when you create a database connection.
The ids param is the table_description_id that you want to scan.
You can trigger a scan of a database from the `POST /api/v1/table-descriptions/sync-schemas` endpoint. Example below


Expand All @@ -205,11 +224,11 @@ curl -X 'POST' \
-H 'Content-Type: application/json' \
-d '{
"db_connection_id": "db_connection_id",
"table_names": ["table_name"]
"ids": ["<table_description_id_1>", "<table_description_id_2>", ...]
}'
```

Since the endpoint identifies low cardinality columns (and their values) it can take time to complete. Therefore while it is possible to trigger a scan on the entire DB by not specifying the `table_names`, we recommend against it for large databases.
Since the endpoint identifies low cardinality columns (and their values) it can take time to complete.

#### Get logs per db connection
Once a database was scanned you can use this endpoint to retrieve the tables logs
Expand Down
10 changes: 1 addition & 9 deletions dataherald/tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,13 @@ def test_heartbeat():
assert response.status_code == HTTP_200_CODE


def test_scan_all_tables():
response = client.post(
"/api/v1/table-descriptions/sync-schemas",
json={"db_connection_id": "64dfa0e103f5134086f7090c"},
)
assert response.status_code == HTTP_201_CODE


def test_scan_one_table():
try:
client.post(
"/api/v1/table-descriptions/sync-schemas",
json={
"db_connection_id": "64dfa0e103f5134086f7090c",
"table_names": ["foo"],
"ids": ["74dfa0e103f5134086f7090a"],
},
)
except ValueError as e:
Expand Down
25 changes: 24 additions & 1 deletion docs/api.create_database_connection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Once the database connection is established, it retrieves the table names and cr
"alias": "string",
"use_ssh": false,
"connection_uri": "string",
"schemas": [
"string"
],
"path_to_credentials_file": "string",
"llm_api_key": "string",
"ssh_settings": {
Expand Down Expand Up @@ -189,7 +192,7 @@ Connections to supported Data warehouses
-----------------------------------------

The format of the ``connection_uri`` parameter in the API call will depend on the data warehouse type you are connecting to.
You can find samples and how to generate them :ref:<below >.
You can find samples and how to generate them below.

Postgres
^^^^^^^^^^^^
Expand Down Expand Up @@ -324,3 +327,23 @@ Example::
"connection_uri": bigquery://v2-real-estate/K2


**Connecting multi-schemas**

You can connect many schemas using one db connection if you want to create SQL joins between schemas.
Currently only `BigQuery`, `Snowflake`, `Databricks` and `Postgres` support this feature.
To use multi-schemas instead of sending the `schema` in the `connection_uri` set it in the `schemas` param, like this:

**Example**

.. code-block:: rst
curl -X 'POST' \
'<host>/api/v1/database-connections' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"alias": "my_db_alias_identifier",
"use_ssh": false,
"connection_uri": "snowflake://<user>:<password>@<organization>-<account-name>/<database>",
"schemas": ["foo", "bar"]
}'
1 change: 1 addition & 0 deletions docs/api.get_table_description.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ HTTP 200 code response
"table_schema": "string",
"status": "NOT_SCANNED | SYNCHRONIZING | DEPRECATED | SCANNED | FAILED"
"error_message": "string",
"table_schema": "string",
"columns": [
{
"name": "string",
Expand Down
2 changes: 2 additions & 0 deletions docs/api.list_database_connections.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ HTTP 200 code response
"dialect": "databricks",
"use_ssh": false,
"connection_uri": "foooAABk91Q4wjoR2h07GR7_72BdQnxi8Rm6i_EjyS-mzz_o2c3RAWaEqnlUvkK5eGD5kUfE5xheyivl1Wfbk_EM7CgV4SvdLmOOt7FJV-3kG4zAbar=",
"schemas": null,
"path_to_credentials_file": null,
"llm_api_key": null,
"ssh_settings": null
Expand All @@ -31,6 +32,7 @@ HTTP 200 code response
"dialect": "postgres",
"use_ssh": true,
"connection_uri": null,
"schemas": null,
"path_to_credentials_file": "bar-LWxPdFcjQw9lU7CeK_2ELR3jGBq0G_uQ7E2rfPLk2RcFR4aDO9e2HmeAQtVpdvtrsQ_0zjsy9q7asdsadXExYJ0g==",
"llm_api_key": "gAAAAABlCz5TeU0ym4hW3bf9u21dz7B9tlnttOGLRDt8gq2ykkblNvpp70ZjT9FeFcoyMv-Csvp3GNQfw66eYvQBrcBEPsLokkLO2Jc2DD-Q8Aw6g_8UahdOTxJdT4izA6MsiQrf7GGmYBGZqbqsjTdNmcq661wF9Q==",
"ssh_settings": {
Expand Down
1 change: 1 addition & 0 deletions docs/api.list_table_description.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ HTTP 200 code response
"table_schema": "string",
"status": "NOT_SCANNED | SYNCHRONIZING | DEPRECATED | SCANNED | FAILED"
"error_message": "string",
"table_schema": "string",
"columns": [
{
"name": "string",
Expand Down
1 change: 1 addition & 0 deletions docs/api.refresh_table_description.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ HTTP 201 code response
"table_schema": "string",
"status": "NOT_SCANNED | SYNCHRONIZING | DEPRECATED | SCANNED | FAILED"
"error_message": "string",
"table_schema": "string",
"columns": [
{
"name": "string",
Expand Down
8 changes: 4 additions & 4 deletions docs/api.scan_table_description.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ which consist of historical queries associated with each database table. These r
query_history collection. The historical queries retrieved encompass data from the past three months and are grouped
based on query and user.

It can scan all db tables or if you specify a `table_names` then It will only scan those tables.
The `ids` param is used to set the table description ids that you want to scan.

The process is carried out through Background Tasks, ensuring that even if it operates slowly, taking several minutes, the HTTP response remains swift.

Expand All @@ -23,7 +23,7 @@ Request this ``POST`` endpoint::
{
"db_connection_id": "string",
"table_names": ["string"] # Optional
"ids": ["string"]
}
**Responses**
Expand All @@ -36,7 +36,6 @@ HTTP 201 code response
**Request example**

To scan all the tables in a db don't specify a `table_names`

.. code-block:: rst
Expand All @@ -45,5 +44,6 @@ To scan all the tables in a db don't specify a `table_names`
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"db_connection_id": "db_connection_id"
"db_connection_id": "db_connection_id",
"ids": ["14e52c5f7d6dc4bc510d6d27", "15e52c5f7d6dc4bc510d6d34"]
}'

0 comments on commit 4f3c4ea

Please sign in to comment.