Workflow for Pandas query pipeline in Llama-index 0.11 #15615

mraguth · 2024-08-24T05:13:47Z

mraguth
Aug 24, 2024

I am so thrilled to explore Llama-index 0.11.
Here my first objective is to move my pandas' query pipe line to workflow since query pipelines will not be supported in new versions. Do we have any sample 'workflow code' to implement the current pandas query pipeline? Just like the below example... Thank you.

https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline_pandas/

@dosu · 2024-08-24T05:16:23Z

dosubot[bot]
bot Aug 24, 2024

Yes, there is a sample 'workflow code' to implement the current pandas query pipeline in Llama-index 0.11. Here is the relevant code:

%pip install llama-index-llms-openai llama-index-experimental

from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

Download Data

Here we load the Titanic CSV dataset.

!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

import pandas as pd

df = pd.read_csv("./titanic_train.csv")

Define Modules

instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")

Build Query Pipeline

qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")

Run Query

response = qp.run(
    query_str="What is the correlation between survival and age?",
)

print(response.message.content)

Additionally, you can refer to the test_pandas_query_engine function in the LlamaIndex repository for more examples on how to use the PandasQueryEngine:

def test_pandas_query_engine(mock_service_context: ServiceContext) -> None:
    """Test pandas query engine."""
    df = pd.DataFrame(
        {
            "city": ["Toronto", "Tokyo", "Berlin"],
            "population": [2930000, 13960000, 3645000],
            "description": [
                """Toronto, Canada's largest city, is a vibrant and diverse metropolis situated in the province of Ontario.
Known for its iconic skyline featuring the CN Tower, Toronto is a cultural melting pot with a rich blend of communities, languages, and cuisines.
It boasts a thriving arts scene, world-class museums, and a strong economic hub.
Visitors can explore historic neighborhoods, such as Kensington Market and Distillery District, or enjoy beautiful natural surroundings on Toronto Islands.
With its welcoming atmosphere, top-notch education, and multicultural charm, Toronto is a global destination for both tourists and professionals alike.""",
                "A city",
                "Another City",
            ],
        }
    )
    query_engine = PandasQueryEngine(
        df, service_context=mock_service_context, verbose=True
    )
    response = query_engine.query(QueryBundle("population"))
    assert str(response) == str(df["population"])
    metadata = cast(Dict[str, Any], response.metadata)
    assert metadata["pandas_instruction_str"] == ('df["population"]')

This example demonstrates how to set up and test the PandasQueryEngine with a sample DataFrame ^[1]^[2].

To continue talking to Dosu, mention @dosu.

4 replies

mraguth Aug 24, 2024
Author

@dosu Seems like the code of pandas query pipe line is same as previous version. My ask is how I convert this pandas query pipe line to workflow? Please help.

dosubot[bot] bot Aug 24, 2024

To convert the pandas query pipeline to a workflow in Llama-index 0.11, you can follow the example provided in the document. Here is the complete code to build and run the query pipeline:

# Install necessary packages
%pip install llama-index-llms-openai llama-index-experimental

# Import necessary modules
from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

# Download the Titanic dataset
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

# Load the dataset into a pandas DataFrame
import pandas as pd
df = pd.read_csv("./titanic_train.csv")

# Define the instruction and prompt templates
instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

# Create prompt templates
pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")

# Build the query pipeline
qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# Add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")

# Run the query
response = qp.run(
    query_str="What is the correlation between survival and age?",
)

# Print the response
print(response.message.content)

This code sets up a query pipeline that processes a user query to perform structured operations on a pandas DataFrame using LLMs to infer the necessary operations. The LlamaIndex framework provides a declarative query API that allows chaining together different modules to orchestrate workflows over your data, enhancing readability and integration with low-code/no-code solutions. The QueryPipeline abstraction in LlamaIndex allows you to chain together various components, such as prompts and LLMs, to create complex workflows with ease ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

mraguth Aug 24, 2024
Author

@dosu....the question here is can I add this pandas query pipeline to the workflow? If so, can you give me a sample workflow code?

dosubot[bot] bot Aug 24, 2024

Yes, you can add the pandas query pipeline to the workflow in Llama-index 0.11. Below is a sample workflow code:

%pip install llama-index llama-index-experimental

from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate
import pandas as pd

# Download Data
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

df = pd.read_csv("./titanic_train.csv")

# Define Modules
instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")

# Build Query Pipeline
qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")

# Run Query
response = qp.run(
    query_str="What is the correlation between survival and age?",
)

print(response.message.content)

This code leverages the QueryPipeline abstraction in LlamaIndex to chain together different modules, including a pandas query engine, LLMs, and prompt templates, to create a cohesive workflow. This approach provides greater readability and integration points with low-code/no-code solutions, as well as potential for easier deployment and portability of pipeline components. The QueryPipeline interface allows you to chain together various modules in a sequential chain or DAG, making it easier to orchestrate complex workflows over your data ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow for Pandas query pipeline in Llama-index 0.11 #15615

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Workflow for Pandas query pipeline in Llama-index 0.11 #15615

mraguth Aug 24, 2024

Replies: 1 comment · 4 replies

dosubot[bot] bot Aug 24, 2024

Download Data

Define Modules

Build Query Pipeline

Run Query

mraguth Aug 24, 2024 Author

dosubot[bot] bot Aug 24, 2024

mraguth Aug 24, 2024 Author

dosubot[bot] bot Aug 24, 2024

mraguth
Aug 24, 2024

Replies: 1 comment 4 replies

dosubot[bot]
bot Aug 24, 2024

mraguth Aug 24, 2024
Author

mraguth Aug 24, 2024
Author