custom_func
is a transform that allows schema writer to alter, compose, transform and aggregate existing
data from the input. Among all custom_func
,
javascript
is the most important one to understand and master.
A custom_func
has 4 basic parts: xpath
/xpath_dynamic
, name
, args
, and type
.
Like any other transforms, custom_func
uses optional xpath
/xpath_dynamic
directive to move the current
IDR tree cursor. See here for more details.
name
is self-explanatory.
args
is a list of arguments, which themselves are transforms recursively, to the function.
Optional type
indicates a result type cast is needed. Valid types are 'string'
, 'int'
, 'float'
,
and 'boolean'
. Not specifying type
tells omniparser to keep whatever type of the result from the
custom_func
as is.
-
Fixed Argument List
Look at the following transform example:
"carrier": { "custom_func": { "name": "lower", "args": [ { "xpath": "./CARRIER_NAME" } ] } },
This transform, in English, takes the value of the immediate child node
CARRIER_NAME
from the current IDR tree cursor position, and returns it in lower-case. -
Variable Argument List
Look at the following transform example (adapted from here):
"event_datetime": { "custom_func": { "name": "concat", "args": [ { "xpath": "event_date" }, { "const": "T" }, { "xpath": "event_time" } ] }},
This transform, in English, takes the values of a child node
event_date
, a constant stringT
and a child nodeevent_time
, and returns them concatenated. -
Chaining/Composability
Arguments of a
custom_func
transform can also becustom_func
, thus enabling chaining and composability. Look at the following example (adapted from here):"event_date_template": { "custom_func": { "name": "dateTimeToRFC3339", "args": [ { "custom_func": { "name": "concat", "args": [ { "xpath": "event_date" }, { "const": "T" }, { "xpath": "event_time" } ] }}, { "xpath": "event_timezone", "_comment": "input timezone" }, { "const": "", "_comment": "output timezone" } ] }}
This transform, in English, concatenates child nodes to produce a full event datetime string and then use
dateTimeToRFC3339
to normalize the datetime string into RFC3339 standard format.There is no limit on how deep
custom_func
chaining can be. -
xpath
/xpath_dynamic
AnchoringSchema writer can also use
xpath
(orxpath_dynamic
) to change current IDR tree cursor to make data extractions on arguments easier. Consider the same transform as above but imagine this time the all the event date time related fields are not at the current IDR cursor node, but rather in a child nodedata
. Instead of writing each data extractxpath
in the arguments as"data/..."
, we can simply move the cursor todata
, by specifyingxpath
oncustom_func
itself."event_date_template": { "xpath": "data", "custom_func": { "name": "dateTimeToRFC3339", "args": [ { "custom_func": { "name": "concat", "args": [ { "xpath": "event_date" }, { "const": "T" }, { "xpath": "event_time" } ] }}, { "xpath": "event_timezone", "_comment": "input timezone" }, { "const": "", "_comment": "output timezone" } ] }}
Omniparser has several basic custom_func
like lower
, upper
, dateTimeToRFC3339
, uuidv3
, etc, among
which the most important, flexible and powerful one is javascript
(and its sibling
javascript_with_context
).
javascript
is a custom_func
transform that executes a JavaScript with optional input arguments.
Omniparser uses https://github.com/dop251/goja, a native Golang ECMAScript implementation thus free of
external C/C++ lib dependencies.
A simple example (adapted from here):
"temp_in_f": { "custom_func": {
"name": "javascript",
"args": [
{ "const": "Math.floor((temp_c * 9 / 5 + 32) * 10) / 10" },
{ "const": "temp_c" }, { "xpath": ".", "type": "float" }
]
}}
This transform takes the value of the current IDR node, assuming temperature data in celsius, converts it to fahrenheit.
The first argument is typically a const
transform that contains a javascript code. The rest of the
arguments always come in pairs. In each pair, the first argument specify an input argument name, and the
second specifies the value of the argument. Remember chaining is allowed for advanced composability.
The result type is whatever the type the script return value is, unless schema writer adds a type
cast
in the custom_func
transform to force a type conversion.
If there is any exception thrown in the script, javascript
transform will fail with an error. If the
result from the script is NaN
, null
, Infinity
or Undefined
, the transform will fail with an error.
Another example (adapted from here):
"uv_index": { "custom_func": {
"name": "javascript",
"args": [
{ "const": "uv.split('/').map(function(s){return s.trim();}).filter(function(s){return !!s;})" },
{ "const": "uv" }, { "xpath": "UV_INDEX" }
]
}},
where UV_INDEX
column contains text like "12/4/6"
.
The script above splits the input by '/'
, trims away spaces, tosses out empty ones and returns it
as an array, so the result for "uv_index"
in the output JSON would look like this:
"uv_index": [
"12",
"4",
"6"
],
So far the input arguments in the samples above are all of singular value. We can also support input argument of array, thus enabling aggregation (from here):
"sum_price_times_10": { "custom_func": {
"name": "javascript",
"args": [
{ "const": "t=0; for (i=0; i<prices.length; i++) { t+=prices[i]*10; } Math.floor(t*100)/100;" },
{ "const": "prices" }, { "array": [ { "xpath": "books/*/price", "type": "float" } ] }
]
}},
Contrived, this transform takes all the price values from "books/*/price"
XPath query, inflates each
by 10 (why oh why?! :)), sums them all up, and returns the sum with 2 decimal places.
Input arguments to javascript
function can be of simple primitive types (such as string, numbers, etc)
but also objects or arrays, as illustrated above.
To provide ultimate freedom of parsing and transform, javascript
has an even more powerful sibling
function javascript_with_context
. javascript_with_context
is very similar to javascript
, except that
omniparser automatically injects the current IDR node and its sub-tree as a JSON object into the script
under the global variable name _node
, thus allowing the script to parse, and transform the current
IDR node tree as it see fit. (You may ask why not just have javascript
and auto-inject _node
? It
is because converting IDR node tree to JSON isn't exactly cheap and for vast majority cases, _node
isn't needed so javascript
is perfectly sufficient.)
Consider the following example:
"full_name": { "xpath": "./personal_info", "custom_func": {
"name": "javascript_with_context",
"args": [
{ "const": "var n = JSON.parse(_node); n.['Last Name'] + ', ' + n.['First Name']" }
]
}}
assuming the current IDR context for this "full_name"
transform is:
Node(Type: ElementNode)
Node(Type: ElementNode, Data: "First Name")
Node(Type: TextNode, Data: "John")
Node(Type: ElementNode, Data: "Last Name")
Node(Type: TextNode, Data: "Doe")
Node(Type: ElementNode, Data: "Age")
Node(Type: TextNode, Data: "35")
When javascript_with_context
is invoked, omniparser will convert the IDR tree above into a JSON object:
{
"First Name": "John",
"Last Name": "Doe",
"Age": "35"
}
thus allowing the script to parse the JSON object in and do something about it.
Theoretically, the entire FINAL_OUTPUT
transform can be done with javascript_with_context
. However,
the cost/con of doing so or similarly "large-scale" javascript_with_context
is 1) multiple round trips
of serializing IDR into JSON then parsing JSON into javascript object and 2) it's just hard to write that
much javascript in one line -- the current limitation of schema being strictly JSON which doesn't support
multi-line string literals.
If any of the argument tranforms return error, or the custom function itself fails, an error will be
relayed out, unless ignore_error
is specified.
Look at the following example (adapted from here):
"event_date_template": { "custom_func": {
"name": "dateTimeToRFC3339",
"args": [
{ "custom_func": {
"name": "concat",
"args": [
{ "xpath": "event_date" },
{ "const": "T" },
{ "xpath": "event_time" }
]
}},
{ "xpath": "event_timezone", "_comment": "input timezone" },
{ "const": "", "_comment": "output timezone" }
],
"ignore_error": true
}}
If say the event_date
and event_time
contain invalid characters, and dateTimeToRFC3339
would
typically fail to convert it to RFC3339 standard format, thus failing out the transform of
FINAL_OUTPUT
for the current record. However, because of "ignore_error": true
, instead, this
custom_func
would simply return nil/null
without error.
If an argument transform value is nil/null
(possibly due to argument transform failure coupled with
its own "ignore_error": true
), then this argument's value will be whatever the default value of
the argument type dictates, such as 0
for int
, 0.0
for float
, ""
for string
, etc.