text to sql after additional data #15930

sumitbindra · 2024-09-09T20:05:56Z

sumitbindra
Sep 9, 2024

i have a text to sql pipeline and i am wondering how i can address cases when additional or updated data comes in ... mainly from the updating the embeddings and index viewpoint.

what the most efficient way of handling new or updated data in :

rows
columns

@dosu · 2024-09-09T20:08:29Z

dosubot[bot]
bot Sep 9, 2024

The most efficient way to handle new or updated data in a text-to-SQL pipeline, specifically for updating embeddings and the index when dealing with new rows and columns, is to use a function that indexes all tables and updates the vector store index for each table. This ensures that the embeddings and the index are kept up-to-date with the new data.

Here is a relevant code snippet:

from llama_index.core import VectorStoreIndex, load_index_from_storage
from sqlalchemy import text
from llama_index.core.schema import TextNode
from llama_index.core import StorageContext
import os
from pathlib import Path
from typing import Dict

def index_all_tables(
    sql_database: SQLDatabase, table_index_dir: str = "table_index_dir"
) -> Dict[str, VectorStoreIndex]:
    """Index all tables."""
    if not Path(table_index_dir).exists():
        os.makedirs(table_index_dir)

    vector_index_dict = {}
    engine = sql_database.engine
    for table_name in sql_database.get_usable_table_names():
        print(f"Indexing rows in table: {table_name}")
        if not os.path.exists(f"{table_index_dir}/{table_name}"):
            # get all rows from table
            with engine.connect() as conn:
                cursor = conn.execute(text(f'SELECT * FROM "{table_name}"'))
                result = cursor.fetchall()
                row_tups = []
                for row in result:
                    row_tups.append(tuple(row))

            # index each row, put into vector store index
            nodes = [TextNode(text=str(t)) for t in row_tups]

            # put into vector store index (use OpenAIEmbeddings by default)
            index = VectorStoreIndex(nodes)

            # save index
            index.set_index_id("vector_index")
            index.storage_context.persist(f"{table_index_dir}/{table_name}")
        else:
            # rebuild storage context
            storage_context = StorageContext.from_defaults(
                persist_dir=f"{table_index_dir}/{table_name}"
            )
            # load index
            index = load_index_from_storage(
                storage_context, index_id="vector_index"
            )
        vector_index_dict[table_name] = index

    return vector_index_dict

vector_index_dict = index_all_tables(sql_database)

This function index_all_tables ensures that all tables are indexed, and the vector store index is updated with the new data. Additionally, scheduling indexing to run periodically and only updating the index for files or rows that have changed can save time and computational resources ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text to sql after additional data #15930

{{title}}

Replies: 1 comment

{{title}}

Select a reply

text to sql after additional data #15930

sumitbindra Sep 9, 2024

Replies: 1 comment

dosubot[bot] bot Sep 9, 2024

sumitbindra
Sep 9, 2024

dosubot[bot]
bot Sep 9, 2024