Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support Python in OS Scripting Service #17432

Open
yuancu opened this issue Feb 24, 2025 · 1 comment
Open

[RFC] Support Python in OS Scripting Service #17432

yuancu opened this issue Feb 24, 2025 · 1 comment

Comments

@yuancu
Copy link

yuancu commented Feb 24, 2025

Introduction

This RFC proposes adding Python 3 as a supported language in the OpenSearch Scripting Service, especially to comply Painless script which is now considered to be ‘painful’.

Python is widely recognized as a simple yet powerful language, especially within the data science community. By integrating Python, OpenSearch aims to broaden its appeal to users who rely on Python for data processing and analytical tasks.

Background & Motivation

OpenSearch currently supports several scripting languages, such as Painless, Mustache, and Expressions. While each has merits, they also come with learning curves that may be unfamiliar to Python users. Python’s ecosystem offers extensive data processing, machine learning, and analytical libraries. Enabling Python scripts within OpenSearch will reduce adoption barriers and empower a broader segment of the community to write custom logic for tasks such as scoring documents, executing specialized aggregations, and customizing ingestion pipelines.

Proposed Solution

Overview

The proposal is to implement a Python script plugin that integrates with the existing Scripting Service. This plugin will allow users to evaluate Python scripts at runtime under various contexts.

Below is a high-level flowchart illustrating how Python scripting will interact with the existing OpenSearch architecture:

flowchart LR
    A[Client] --"`{#quot;source#quot;: #quot;sum(doc['ratings'])#quot;,<br>#quot;lang#quot;: #quot;python#quot;}`" --> B("ScriptService<br>(Coordinating Node)")
    B -- Dispatch --> D(PythonScriptPlugin)
    D -- Compile &cache --> H(Shard Execution)
    H --> I("Aggregation<br>(Coordinating Node)")
    I -- Return results --> A
Loading

Implementation Approaches

We have identified two primary implementation strategies:

  1. Parsing and Translating Python Code

    The Python code could be parsed into an intermediate form (e.g. Calcite’s logical plan) that OpenSearch can convert into its native execution plan.

    • Pros: The translation into native representation is under full control of maintainers, thus naturally brings a higher degree of predictability and enhanced security.
    • Cons: This approach involves a more complex and extensive development effort. Obviously limited python capabilities can be implemented.
  2. Direct Execution as Guest Language (GraalVM)

    Using GraalVM’s Polyglot APIs, Python code can run directly within the OpenSearch process, where a standalone python runtime is hosted inside the same JVM. This solution embeds Python as a guest language, enabling executions of custom Python scripts.

    • Pros: Python’s full power is unleashed— users can leverage Python’s standard ad third-party packages, enabling more complex and more robust data processing. This approach is also more straightforward than implementing and maintaining a custom python language translator.
    • Cons: Running a Python runtime inside JVM increases the attack surface and may introduce a resource overhead. Strong sandboxing and resource usage limits are critical.

A PoC has demonstrated the feasibility of the second approach with GraalVM.

Image

Demo

Demo1: Custom Scoring with Python

This demo exemplifies how to calculate scores as an average of ratings using a Python script

  1. Create an index called “books” and insert 3 books into it

    POST /_bulk
    {"create": {"_index": "books", "_id": 1}
    {"name":"Beneath the Wheel", "ratings":[4,3,5]}
    {"create": {"_index": "books", "_id": 2}}
    {"name":"Faust", "ratings":[5,5,5]}
    {"create": {"_index": "books", "_id": 3}}
    {"name":"The Odyssey", "ratings":[2,1,5]}
    
  2. Store a Python script called agg_ratings.

    PUT /_scripts/agg_ratings
    {
      "script": {
          "lang": "python",
          "source": "sum(doc['ratings']) / len(doc['ratings']) * params['factor']"
      }
    }
    

    The script takes the average of book ratings and multiply it by a factor, which will be passed from query parameters.

  3. Execute the script under the score context. The score context runs a script as if the script were in a script_score function in a function_score query.

    POST /books/_search
    {
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "id": "agg_ratings",
              "params": {
                "factor": 2.0
              }
            }
          }
        }
      }
    }
    

    The params object specifies the factor as 2.0, which will scale the average ratings to a 0–10 range.

    A sample response might look as follows:

    {
        "took": 330,
        "timed_out": false,
        "_shards": {...},
        "hits": {
            "total": {"value": 3, "relation": "eq"},
            "max_score": 10.0,
            "hits": [
                {
                    "_index": "books",
                    "_id": "2",
                    "_score": 10.0,
                    "_source": {
                        "name": "Faust",
                        "ratings": [5, 5, 5]
                    }
                },
                {
                    "_index": "books",
                    "_id": "1",
                    "_score": 8.0,
                    "_source": {
                        "name": "Beneath the Wheel",
                        "ratings": [4, 3, 5]
                    }
                },
                {
                    "_index": "books",
                    "_id": "3",
                    "_score": 5.3333335,
                    "_source": {
                        "name": "The Odyssey",
                        "ratings": [2, 1, 5]
                    }
                }
            ]
        }
    }
    

    Here, _score is the average of the document’s ratings multiplied by the specified factor of 2.0. This confirms that the Python script correctly evaluates the provided documents and parameters.

Demo2: Post-processing tensor output in neural search

Neural search applies language models to transform document texts into vector embedding for a better performance in semantic search. It supports using externally hosted models to embed documents. This tutorial explains the process in more details. However, different language model vendors return tensors wrapped in different formats. Historically, users have to write Painless scripts to transform the data to a unified format that can be recognized by the document ingestion pipeline. In this demonstration, we use Python to process responses from the Bedrock Cohere embed-english model. The following steps follow the standard way to connect to externally hosted models and is modified from this blueprint; we only alter the post-processing part to use custom Python script. Irrelevant parts are omitted for brevity.

  1. Create a connector for Amazon Bedrock

    POST /_plugins/_ml/connectors/_create
    {
        "name": "Amazon Bedrock Connector: Cohere embed-english-v3",
        ...
        "parameters": {
            "region": "us-east-1",
            "service_name": "bedrock",
            "truncate": "END",
            "input_type": "search_document",
            "model": "cohere.embed-english-v3"
        },
        "actions": [
            {
                "action_type": "predict",
                ...
                "url": "https://bedrock-runtime.${parameters.region}.amazonaws.com/model/${parameters.model}/invoke",
                "request_body": "{ \"texts\": ${parameters.texts}, \"truncate\": \"${parameters.truncate}\", \"input_type\": \"${parameters.input_type}\" }",
                "pre_process_function": "\n    StringBuilder builder = new StringBuilder();\n    builder.append(\"[\");\n    for (int i=0; i< params.text_docs.length; i++) {\n        builder.append(\"\\\"\");\n        builder.append(params.text_docs[i]);\n        builder.append(\"\\\"\");\n        if (i < params.text_docs.length - 1) {\n          builder.append(\",\")\n        }\n    }\n    builder.append(\"]\");\n    def parameters = \"{\" +\"\\\"prompt\\\":\" + builder + \"}\";\n    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";",
                "pre_process_lang": "painless",
                "post_process_function": "import json\nNone if not doc['embeddings'] else json.dumps([{'name':'sentence_embedding','data_type':'FLOAT32','shape':[len(x)],'data':x} for x in doc['embeddings']])",
                "post_process_lang": "python"
            }
        ]
    }

    In the above example:

    • pre_process_function: Utilizes a Painless script to prepare the request payload for the model.
    • post_process_function: Uses a Python script to transform the returned embeddings into JSON objects that include metadata such as name, data_type, and shape.

    The Python script is shown below:

    import json
    None if not doc['embeddings'] else json.dumps([{'name':'sentence_embedding', 'data_type':'FLOAT32', 'shape':[len(x)], 'data':x} for x in doc['embeddings']])

    This script unpacks the returned list of tensors into JSON objects with their corresponding metadata, which can then be used by downstream components in the ingestion pipeline.

    Note:

    • The ml-commons plugin has been modified to support the optional pre_process_lang and post_process_lang parameters for this proof-of-concept.
    • In the current release of ml-commons, built-in support for certain Cohere models is available via connector.pre_process.cohere.embedding and connector.post_process.cohere.embedding. This demonstration uses custom scripts for illustrative purposes and to verify correctness.
  2. Generate embeddings with custom post-processing

    POST /_plugins/_ml/models/<MODEL_ID>/_predict
    {
      "parameters": {
        "texts" : ["Hello world", "This is a test"]
      }
    }

    The <MODEL_ID> is an identifier for the external model generated from previous steps. The response is as follows:

    {
        "inference_results": [
            {
                "output": [
                    {
                        "name": "sentence_embedding",
                        "data_type": "FLOAT32",
                        "shape": [1024],
                        "data": [-0.029205322, -0.02357483, ...]
                    },
                    {
                        "name": "sentence_embedding",
                        "data_type": "FLOAT32",
                        "shape": [1024],
                        "data": [-0.013885498, 0.009994507,...]
                    }
                ],
                "status_code": 200
            }
        ]
    }

    The embeddings here have been post-processed by the Python script to provide standardized metadata alongside the raw tensor data.

Python packages

Built-in Python libraries are self-contained in GraalVM’s Polyglot Python runtime. Third-party python packages can be configured with GraalPy gradle plugin by specifying package names and versions in build.gradle :

// An example of including numpy dependence
graalPy {
  packages = ["numpy==1.26.4"]
  ...
}

GraalPy is compatible with common Python packages such as Numpy and Pandas. Please consult GraalPy package compatibility for the list of supported Python packages.

Security and Compatibility

Security (varies based on implementation)

  • Sandboxing: GraalVM offers security mechanisms like sandboxing and host access control out of the box. We will need to scrutinize them to ensure the extended capability aligns with the security guidelines of OpenSearch.

  • Malicious scripts: Multiple approaches has been discussed to eliminate the risks of malicious scripts

    • Import restrictions: only whitelisted packages are allowed to import
    • No I/O access: Access to files and network are forbidden
    • Fine-grained syntactical / behavioral whitelist: only whitelisted syntax or behaviors are allowed
    • Please feel free to propose more measures to enhance security
  • Resource management: Python scripts should be subject to resource usage limits (e.g., CPU and memory) to ensure they do not disrupt cluster stability.

GraalVM Compatibility

GraalVM’s polyglot API is able to run on various Java runtime, including OpenJDK, GraalVM Community Edition, Oracle JDK, etc. This should cover most use cases. Runtimes that are not GraalVM can be further optimized if experimental VM options as below are enabled:

-XX:+UnlockExperimentalVMOptions
-XX:+EnableJVMCI
@model-collapse
Copy link

This is a fantastic proposal. We all agree that python can definitely help opensearch pricking into lots of potential areas. While scripts are consider to be a light weight interface for our users to do customizations. Python, as a popular language, will turn over the users' impress to painless script which is 'painful'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants