You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This RFC proposes adding Python 3 as a supported language in the OpenSearch Scripting Service, especially to comply Painless script which is now considered to be ‘painful’.
Python is widely recognized as a simple yet powerful language, especially within the data science community. By integrating Python, OpenSearch aims to broaden its appeal to users who rely on Python for data processing and analytical tasks.
Background & Motivation
OpenSearch currently supports several scripting languages, such as Painless, Mustache, and Expressions. While each has merits, they also come with learning curves that may be unfamiliar to Python users. Python’s ecosystem offers extensive data processing, machine learning, and analytical libraries. Enabling Python scripts within OpenSearch will reduce adoption barriers and empower a broader segment of the community to write custom logic for tasks such as scoring documents, executing specialized aggregations, and customizing ingestion pipelines.
Proposed Solution
Overview
The proposal is to implement a Python script plugin that integrates with the existing Scripting Service. This plugin will allow users to evaluate Python scripts at runtime under various contexts.
Below is a high-level flowchart illustrating how Python scripting will interact with the existing OpenSearch architecture:
flowchart LR
A[Client] --"`{#quot;source#quot;: #quot;sum(doc['ratings'])#quot;,<br>#quot;lang#quot;: #quot;python#quot;}`" --> B("ScriptService<br>(Coordinating Node)")
B -- Dispatch --> D(PythonScriptPlugin)
D -- Compile &cache --> H(Shard Execution)
H --> I("Aggregation<br>(Coordinating Node)")
I -- Return results --> A
Loading
Implementation Approaches
We have identified two primary implementation strategies:
Parsing and Translating Python Code
The Python code could be parsed into an intermediate form (e.g. Calcite’s logical plan) that OpenSearch can convert into its native execution plan.
Pros: The translation into native representation is under full control of maintainers, thus naturally brings a higher degree of predictability and enhanced security.
Cons: This approach involves a more complex and extensive development effort. Obviously limited python capabilities can be implemented.
Using GraalVM’s Polyglot APIs, Python code can run directly within the OpenSearch process, where a standalone python runtime is hosted inside the same JVM. This solution embeds Python as a guest language, enabling executions of custom Python scripts.
Pros: Python’s full power is unleashed— users can leverage Python’s standard ad third-party packages, enabling more complex and more robust data processing. This approach is also more straightforward than implementing and maintaining a custom python language translator.
Cons: Running a Python runtime inside JVM increases the attack surface and may introduce a resource overhead. Strong sandboxing and resource usage limits are critical.
A PoC has demonstrated the feasibility of the second approach with GraalVM.
Demo
Demo1: Custom Scoring with Python
This demo exemplifies how to calculate scores as an average of ratings using a Python script
Create an index called “books” and insert 3 books into it
The script takes the average of book ratings and multiply it by a factor, which will be passed from query parameters.
Execute the script under the score context. The score context runs a script as if the script were in a script_score function in a function_score query.
Here, _score is the average of the document’s ratings multiplied by the specified factor of 2.0. This confirms that the Python script correctly evaluates the provided documents and parameters.
Demo2: Post-processing tensor output in neural search
Neural search applies language models to transform document texts into vector embedding for a better performance in semantic search. It supports using externally hosted models to embed documents. This tutorial explains the process in more details. However, different language model vendors return tensors wrapped in different formats. Historically, users have to write Painless scripts to transform the data to a unified format that can be recognized by the document ingestion pipeline. In this demonstration, we use Python to process responses from the Bedrock Cohere embed-english model. The following steps follow the standard way to connect to externally hosted models and is modified from this blueprint; we only alter the post-processing part to use custom Python script. Irrelevant parts are omitted for brevity.
pre_process_function: Utilizes a Painless script to prepare the request payload for the model.
post_process_function: Uses a Python script to transform the returned embeddings into JSON objects that include metadata such as name, data_type, and shape.
This script unpacks the returned list of tensors into JSON objects with their corresponding metadata, which can then be used by downstream components in the ingestion pipeline.
Note:
The ml-commons plugin has been modified to support the optional pre_process_lang and post_process_lang parameters for this proof-of-concept.
In the current release of ml-commons, built-in support for certain Cohere models is available via connector.pre_process.cohere.embedding and connector.post_process.cohere.embedding. This demonstration uses custom scripts for illustrative purposes and to verify correctness.
Generate embeddings with custom post-processing
POST /_plugins/_ml/models/<MODEL_ID>/_predict
{
"parameters": {
"texts": ["Hello world", "This is a test"]
}
}
The <MODEL_ID> is an identifier for the external model generated from previous steps. The response is as follows:
The embeddings here have been post-processed by the Python script to provide standardized metadata alongside the raw tensor data.
Python packages
Built-in Python libraries are self-contained in GraalVM’s Polyglot Python runtime. Third-party python packages can be configured with GraalPy gradle plugin by specifying package names and versions in build.gradle :
// An example of including numpy dependence
graalPy {
packages = ["numpy==1.26.4"]
...
}
GraalPy is compatible with common Python packages such as Numpy and Pandas. Please consult GraalPy package compatibility for the list of supported Python packages.
Security and Compatibility
Security (varies based on implementation)
Sandboxing: GraalVM offers security mechanisms like sandboxing and host access control out of the box. We will need to scrutinize them to ensure the extended capability aligns with the security guidelines of OpenSearch.
Malicious scripts: Multiple approaches has been discussed to eliminate the risks of malicious scripts
Import restrictions: only whitelisted packages are allowed to import
No I/O access: Access to files and network are forbidden
Fine-grained syntactical / behavioral whitelist: only whitelisted syntax or behaviors are allowed
Please feel free to propose more measures to enhance security
Resource management: Python scripts should be subject to resource usage limits (e.g., CPU and memory) to ensure they do not disrupt cluster stability.
GraalVM Compatibility
GraalVM’s polyglot API is able to run on various Java runtime, including OpenJDK, GraalVM Community Edition, Oracle JDK, etc. This should cover most use cases. Runtimes that are not GraalVM can be further optimized if experimental VM options as below are enabled:
-XX:+UnlockExperimentalVMOptions
-XX:+EnableJVMCI
The text was updated successfully, but these errors were encountered:
This is a fantastic proposal. We all agree that python can definitely help opensearch pricking into lots of potential areas. While scripts are consider to be a light weight interface for our users to do customizations. Python, as a popular language, will turn over the users' impress to painless script which is 'painful'.
Introduction
This RFC proposes adding Python 3 as a supported language in the OpenSearch Scripting Service, especially to comply Painless script which is now considered to be ‘painful’.
Python is widely recognized as a simple yet powerful language, especially within the data science community. By integrating Python, OpenSearch aims to broaden its appeal to users who rely on Python for data processing and analytical tasks.
Background & Motivation
OpenSearch currently supports several scripting languages, such as Painless, Mustache, and Expressions. While each has merits, they also come with learning curves that may be unfamiliar to Python users. Python’s ecosystem offers extensive data processing, machine learning, and analytical libraries. Enabling Python scripts within OpenSearch will reduce adoption barriers and empower a broader segment of the community to write custom logic for tasks such as scoring documents, executing specialized aggregations, and customizing ingestion pipelines.
Proposed Solution
Overview
The proposal is to implement a Python script plugin that integrates with the existing Scripting Service. This plugin will allow users to evaluate Python scripts at runtime under various contexts.
Below is a high-level flowchart illustrating how Python scripting will interact with the existing OpenSearch architecture:
Implementation Approaches
We have identified two primary implementation strategies:
Parsing and Translating Python Code
The Python code could be parsed into an intermediate form (e.g. Calcite’s logical plan) that OpenSearch can convert into its native execution plan.
Direct Execution as Guest Language (GraalVM)
Using GraalVM’s Polyglot APIs, Python code can run directly within the OpenSearch process, where a standalone python runtime is hosted inside the same JVM. This solution embeds Python as a guest language, enabling executions of custom Python scripts.
A PoC has demonstrated the feasibility of the second approach with GraalVM.
Demo
Demo1: Custom Scoring with Python
This demo exemplifies how to calculate scores as an average of ratings using a Python script
Create an index called “books” and insert 3 books into it
Store a Python script called
agg_ratings
.The script takes the average of book ratings and multiply it by a factor, which will be passed from query parameters.
Execute the script under the score context. The
score
context runs a script as if the script were in ascript_score
function in afunction_score
query.The
params
object specifies thefactor
as2.0
, which will scale the average ratings to a 0–10 range.A sample response might look as follows:
Here,
_score
is the average of the document’sratings
multiplied by the specified factor of2.0
. This confirms that the Python script correctly evaluates the provided documents and parameters.Demo2: Post-processing tensor output in neural search
Neural search applies language models to transform document texts into vector embedding for a better performance in semantic search. It supports using externally hosted models to embed documents. This tutorial explains the process in more details. However, different language model vendors return tensors wrapped in different formats. Historically, users have to write Painless scripts to transform the data to a unified format that can be recognized by the document ingestion pipeline. In this demonstration, we use Python to process responses from the Bedrock Cohere embed-english model. The following steps follow the standard way to connect to externally hosted models and is modified from this blueprint; we only alter the post-processing part to use custom Python script. Irrelevant parts are omitted for brevity.
Create a connector for Amazon Bedrock
In the above example:
pre_process_function
: Utilizes a Painless script to prepare the request payload for the model.post_process_function
: Uses a Python script to transform the returned embeddings into JSON objects that include metadata such asname
,data_type
, andshape
.The Python script is shown below:
This script unpacks the returned list of tensors into JSON objects with their corresponding metadata, which can then be used by downstream components in the ingestion pipeline.
Note:
ml-commons
plugin has been modified to support the optionalpre_process_lang
andpost_process_lang
parameters for this proof-of-concept.ml-commons
, built-in support for certain Cohere models is available viaconnector.pre_process.cohere.embedding
andconnector.post_process.cohere.embedding
. This demonstration uses custom scripts for illustrative purposes and to verify correctness.Generate embeddings with custom post-processing
The
<MODEL_ID>
is an identifier for the external model generated from previous steps. The response is as follows:The embeddings here have been post-processed by the Python script to provide standardized metadata alongside the raw tensor data.
Python packages
Built-in Python libraries are self-contained in GraalVM’s Polyglot Python runtime. Third-party python packages can be configured with GraalPy gradle plugin by specifying package names and versions in
build.gradle
:GraalPy is compatible with common Python packages such as Numpy and Pandas. Please consult GraalPy package compatibility for the list of supported Python packages.
Security and Compatibility
Security (varies based on implementation)
Sandboxing: GraalVM offers security mechanisms like sandboxing and host access control out of the box. We will need to scrutinize them to ensure the extended capability aligns with the security guidelines of OpenSearch.
Malicious scripts: Multiple approaches has been discussed to eliminate the risks of malicious scripts
Resource management: Python scripts should be subject to resource usage limits (e.g., CPU and memory) to ensure they do not disrupt cluster stability.
GraalVM Compatibility
GraalVM’s polyglot API is able to run on various Java runtime, including OpenJDK, GraalVM Community Edition, Oracle JDK, etc. This should cover most use cases. Runtimes that are not GraalVM can be further optimized if experimental VM options as below are enabled:
The text was updated successfully, but these errors were encountered: