text to sql after additional data #15930
Replies: 1 comment
-
The most efficient way to handle new or updated data in a text-to-SQL pipeline, specifically for updating embeddings and the index when dealing with new rows and columns, is to use a function that indexes all tables and updates the vector store index for each table. This ensures that the embeddings and the index are kept up-to-date with the new data. Here is a relevant code snippet: from llama_index.core import VectorStoreIndex, load_index_from_storage
from sqlalchemy import text
from llama_index.core.schema import TextNode
from llama_index.core import StorageContext
import os
from pathlib import Path
from typing import Dict
def index_all_tables(
sql_database: SQLDatabase, table_index_dir: str = "table_index_dir"
) -> Dict[str, VectorStoreIndex]:
"""Index all tables."""
if not Path(table_index_dir).exists():
os.makedirs(table_index_dir)
vector_index_dict = {}
engine = sql_database.engine
for table_name in sql_database.get_usable_table_names():
print(f"Indexing rows in table: {table_name}")
if not os.path.exists(f"{table_index_dir}/{table_name}"):
# get all rows from table
with engine.connect() as conn:
cursor = conn.execute(text(f'SELECT * FROM "{table_name}"'))
result = cursor.fetchall()
row_tups = []
for row in result:
row_tups.append(tuple(row))
# index each row, put into vector store index
nodes = [TextNode(text=str(t)) for t in row_tups]
# put into vector store index (use OpenAIEmbeddings by default)
index = VectorStoreIndex(nodes)
# save index
index.set_index_id("vector_index")
index.storage_context.persist(f"{table_index_dir}/{table_name}")
else:
# rebuild storage context
storage_context = StorageContext.from_defaults(
persist_dir=f"{table_index_dir}/{table_name}"
)
# load index
index = load_index_from_storage(
storage_context, index_id="vector_index"
)
vector_index_dict[table_name] = index
return vector_index_dict
vector_index_dict = index_all_tables(sql_database) This function |
Beta Was this translation helpful? Give feedback.
-
i have a text to sql pipeline and i am wondering how i can address cases when additional or updated data comes in ... mainly from the updating the embeddings and index viewpoint.
what the most efficient way of handling new or updated data in :
Beta Was this translation helpful? Give feedback.
All reactions