Skip to content

Commit

Permalink
Enable JSON Scan and from_json by default (#11753)
Browse files Browse the repository at this point in the history
Signed-off-by: Robert (Bobby) Evans <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
  • Loading branch information
revans2 and ttnghia authored Nov 25, 2024
1 parent 6cba00d commit 6539441
Show file tree
Hide file tree
Showing 53 changed files with 151 additions and 176 deletions.
6 changes: 3 additions & 3 deletions docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ Name | Description | Default Value | Applicable at
<a name="sql.format.hive.text.write.enabled"></a>spark.rapids.sql.format.hive.text.write.enabled|When set to false disables Hive text table write acceleration|false|Runtime
<a name="sql.format.iceberg.enabled"></a>spark.rapids.sql.format.iceberg.enabled|When set to false disables all Iceberg acceleration|true|Runtime
<a name="sql.format.iceberg.read.enabled"></a>spark.rapids.sql.format.iceberg.read.enabled|When set to false disables Iceberg input acceleration|true|Runtime
<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|false|Runtime
<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|false|Runtime
<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|true|Runtime
<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|true|Runtime
<a name="sql.format.orc.enabled"></a>spark.rapids.sql.format.orc.enabled|When set to false disables all orc input and output acceleration|true|Runtime
<a name="sql.format.orc.floatTypesToString.enable"></a>spark.rapids.sql.format.orc.floatTypesToString.enable|When reading an ORC file, the source data schemas(schemas of ORC file) may differ from the target schemas (schemas of the reader), we need to handle the castings from source type to target type. Since float/double numbers in GPU have different precision with CPU, when casting float/double to string, the result of GPU is different from result of CPU spark. Its default value is `true` (this means the strings result will differ from result of CPU). If it's set `false` explicitly and there exists casting from float/double to string in the job, then such behavior will cause an exception, and the job will fail.|true|Runtime
<a name="sql.format.orc.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type.|2147483647|Runtime
Expand Down Expand Up @@ -278,7 +278,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.IsNaN"></a>spark.rapids.sql.expression.IsNaN|`isnan`|Checks if a value is NaN|true|None|
<a name="sql.expression.IsNotNull"></a>spark.rapids.sql.expression.IsNotNull|`isnotnull`|Checks if a value is not null|true|None|
<a name="sql.expression.IsNull"></a>spark.rapids.sql.expression.IsNull|`isnull`|Checks if a value is null|true|None|
<a name="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|false|This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case|
<a name="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|true|None|
<a name="sql.expression.JsonTuple"></a>spark.rapids.sql.expression.JsonTuple|`json_tuple`|Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.|false|This is disabled by default because Experimental feature that could be unstable or have performance issues.|
<a name="sql.expression.KnownFloatingPointNormalized"></a>spark.rapids.sql.expression.KnownFloatingPointNormalized| |Tag to prevent redundant normalization|true|None|
<a name="sql.expression.KnownNotNull"></a>spark.rapids.sql.expression.KnownNotNull| |Tag an expression as known to not be null|true|None|
Expand Down
161 changes: 69 additions & 92 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,133 +316,110 @@ case.

## JSON

The JSON format read is an experimental feature which is expected to have some issues, so we disable
it by default. If you would like to test it, you need to enable `spark.rapids.sql.format.json.enabled` and
`spark.rapids.sql.format.json.read.enabled`.
JSON, despite being a standard format, has some ambiguity in it. Spark also offers the ability to allow
some invalid JSON to be parsed. We have tried to provide JSON parsing that is compatible with
what Apache Spark does support. Note that Spark itself has changed through different releases, and we will
try to call out which releases we offer different results for. JSON parsing is enabled by default
except for date and timestamp types where we still have work to complete. If you wish to disable
JSON Scan you can set `spark.rapids.sql.format.json.enabled` or
`spark.rapids.sql.format.json.read.enabled` to false. To disable `from_json` you can set
`spark.rapids.sql.expression.JsonToStructs` to false.

### Invalid JSON
### Limits

In Apache Spark on the CPU if a line in the JSON file is invalid the entire row is considered
invalid and will result in nulls being returned for all columns. It is considered invalid if it
violates the JSON specification, but with a few extensions.
In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After
3.5.0 this was updated to be 1,000 by default. The current GPU implementation of JSON Scan and
`from_json` limits this to 254 no matter what version of Spark is used. If the nesting level is
over this the JSON is considered invalid and all values will be returned as nulls.
`get_json_object` and `json_tuple` have a maximum nesting depth of 64. An exception is thrown if
the nesting depth goes over the maximum.

* Single quotes are allowed to quote strings and keys
* Unquoted values like NaN and Infinity can be parsed as floating point values
* Control characters do not need to be replaced with the corresponding escape sequences in a
quoted string.
* Garbage at the end of a row, if there is valid JSON at the beginning of the row, is ignored.
Spark 3.5.0 and above have limits on maximum string length 20,000,000 and maximum number length of
1,000. We do not have any of these limits on the GPU.

The GPU implementation does the same kinds of validations, but many of them are done on a per-column
basis, which, for example, means if a number is formatted incorrectly, it is likely only that value
will be considered invalid and return a null instead of nulls for the entire row.
We, like Spark, cannot support an JSON string that is larger than 2 GiB is size.

There are options that can be used to enable and disable many of these features which are mostly
listed below.
### JSON Validation

### JSON options
Spark supports the option `allowNonNumericNumbers`. Versions of Spark prior to 3.3.0 where inconsistent between
quoted and non-quoted values ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The
GPU implementation is consistent with 3.3.0 and above.

Spark supports passing options to the JSON parser when reading a dataset. In most cases if the RAPIDS Accelerator
sees one of these options that it does not support it will fall back to the CPU. In some cases we do not. The
following options are documented below.
### JSON Floating Point Types

- `allowNumericLeadingZeros` - Allows leading zeros in numbers (e.g. 00012). By default this is set to false.
When it is false Spark considers the JSON invalid if it encounters this type of number. The RAPIDS
Accelerator supports validating columns that are returned to the user with this option on or off.

- `allowUnquotedControlChars` - Allows JSON Strings to contain unquoted control characters (ASCII characters with
value less than 32, including tab and line feed characters) or not. By default this is set to false. If the schema
is provided while reading JSON file, then this flag has no impact on the RAPIDS Accelerator as it always allows
unquoted control characters but Spark sees these are invalid are returns nulls. However, if the schema is not provided
and this option is false, then RAPIDS Accelerator's behavior is same as Spark where an exception is thrown
as discussed in `JSON Schema discovery` section.

- `allowNonNumericNumbers` - Allows `NaN` and `Infinity` values to be parsed (note that these are not valid numeric
values in the [JSON specification](https://json.org)). Spark versions prior to 3.3.0 have inconsistent behavior and will
parse some variants of `NaN` and `Infinity` even when this option is disabled
([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
Spark version 3.3.0 and later.

### Nesting
In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After
3.5.0 this was updated to be 1000 by default. The current GPU implementation limits this to 254
no matter what version of Spark is used. If the nesting level is over this the JSON is considered
invalid and all values will be returned as nulls.

Mixed types can have some problems. If an item being read could have some lines that are arrays
and others that are structs/dictionaries it is possible an error will be thrown.

Dates and Timestamps have some issues and may return values for technically invalid inputs.

Floating point numbers have issues generally like with the rest of Spark, and we can parse them into
a valid floating point number, but it might not match 100% with the way Spark does it.

Strings are supported, but the data returned might not be normalized in the same way as the CPU
implementation. Generally this comes down to the GPU not modifying the input, whereas Spark will
do things like remove extra white space and parse numbers before turning them back into a string.
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).

### JSON Floating Point
### JSON Integral Types

Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
Versions of Spark prior to 3.3.0 would parse quoted integer values, like "1". But 3.3.0 and above consider
these to be invalid and will return `null` when parsed as an Integral types. The GPU implementation
follows 3.3.0 and above.

Prior to Spark 3.3.0, reading JSON strings such as `"+Infinity"` when specifying that the data type is `FloatType`
or `DoubleType` caused these values to be parsed even when `allowNonNumericNumbers` is set to false. Also, Spark
versions prior to 3.3.0 only supported the `"Infinity"` and `"-Infinity"` representations of infinity and did not
support `"+INF"`, `"-INF"`, or `"+Infinity"`, which Spark considers valid when unquoted. The GPU JSON reader is
consistent with the behavior in Spark 3.3.0 and later.
### JSON Decimal Types

Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where
Spark will treat them as invalid inputs and will just return `null`.
Spark supports parsing decimal types either formatted as floating point number or integral numbers, even if it is
in a quoted string. If it is in a quoted string the local of the JVM is used to determine the number format.
If the local is not for the `US`, which is the default we will fall back to the CPU because we do not currently
parse those numbers correctly. The `US` format removes all commas ',' from the quoted string.
As a part of this, though, non-arabic numbers are also supported. We do not support parsing these numbers
see (issue 10532)[https://github.com/NVIDIA/spark-rapids/issues/10532].

### JSON Dates/Timestamps
### JSON Date/Timestamp Types

Dates and timestamps are not supported by default in JSON parser, since the GPU implementation is not 100%
compatible with Apache Spark.
If needed, they can be turned on through the config `spark.rapids.sql.json.read.datetime.enabled`.
Once enabled, the JSON parser still does not support the `TimestampNTZ` type and will fall back to CPU
if `spark.sql.timestampType` is set to `TIMESTAMP_NTZ` or if an explicit schema is provided that
contains the `TimestampNTZ` type.
This config works for both JSON scan and `from_json`. Once enabled, the JSON parser still does
not support the `TimestampNTZ` type and will fall back to CPU if `spark.sql.timestampType` is set
to `TIMESTAMP_NTZ` or if an explicit schema is provided that contains the `TimestampNTZ` type.

There is currently no support for reading numeric values as timestamps and null values are returned instead
([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast
to timestamp.
([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast to timestamp.

### JSON Schema discovery
### JSON Arrays and Structs with Overflowing Numbers

Spark SQL can automatically infer the schema of a JSON dataset if schema is not provided explicitly. The CPU
handles schema discovery and there is no GPU acceleration of this. By default Spark will read/parse the entire
dataset to determine the schema. This means that some options/errors which are ignored by the GPU may still
result in an exception if used with schema discovery.
Spark is inconsistent between versions in how it handles numbers that overflow that are nested in either an array
or a non-top-level struct. In some versions only the value that overflowed is marked as null. In other versions the
wrapping array or struct is marked as null. We currently only mark the individual value as null. This matches
versions 3.4.2 and above of Spark for structs. Arrays on most versions of spark invalidate the entire array if there
is a single value that overflows within it.

### `from_json` function
### Duplicate Struct Names

`JsonToStructs` of `from_json` is based on the same code as reading a JSON lines file. There are
a few differences with it.
The JSON specification technically allows for duplicate keys in a struct, but does not explain what to
do with them. In the case of Spark it is inconsistent between operators which value wins. `get_json_object`
depends on the query being performed. We do not always match what Spark does. We do match it in many cases,
but we consider this enough of a corner case that we have not tried to make it work in all cases.

The `from_json` function is disabled by default because it is experimental and has some known
incompatibilities with Spark, and can be enabled by setting
`spark.rapids.sql.expression.JsonToStructs=true`. You don't need to set
`spark.rapids.sql.format.json.enabled` and`spark.rapids.sql.format.json.read.enabled` to true.
In addition, if the input schema contains date and/or timestamp types, an additional config
`spark.rapids.sql.json.read.datetime.enabled` also needs to be set to `true` in order
to enable this function on the GPU.
We also do not support schemas where there are duplicate column names. We just fall back to the CPU for those cases.

There is no schema discovery as a schema is required as input to `from_json`
### JSON Normalization (String Types)

In addition to `structs`, a top level `map` type is supported, but only if the key and value are
strings.
In versions of Spark prior to 4.0.0 input JSON Strings were parsed to JSON tokens and then converted back to
strings. This effectively normalizes the output string. So things like single quotes are transformed into double
quotes, floating point numbers are parsed and converted back to strings possibly changing the format, and
escaped characters are converted back to their simplest form. We try to support this on the GPU as well. Single quotes
will be converted to double quotes. Only `get_json_object` and `json_tuple` attempt to normalize floating point
numbers. There is no implementation on the GPU right now that tries to normalize escape characters.

### `from_json` Function

`JsonToStructs` or `from_json` is based on the same code as reading a JSON lines file. There are
a few differences with it.

### `to_json` function
The main difference is that `from_json` supports parsing Maps and Arrays directly from a JSON column, whereas
JSON Scan only supports parsing top level structs. The GPU implementation of `from_json` has support for parsing
a `MAP<STRING,STRING>` as a top level schema, but does not currently support arrays at the top level.

The `to_json` function is disabled by default because it is experimental and has some known incompatibilities
with Spark, and can be enabled by setting `spark.rapids.sql.expression.StructsToJson=true`.
### `to_json` Function

Known issues are:

- There can be rounding differences when formatting floating-point numbers as strings. For example, Spark may
produce `-4.1243574E26` but the GPU may produce `-4.124357351E26`.
- Not all JSON options are respected

### get_json_object
### `get_json_object` Function

Known issue:
- [Floating-point number normalization error](https://github.com/NVIDIA/spark-rapids-jni/issues/1922). `get_json_object` floating-point number normalization on the GPU could sometimes return incorrect results if the string contains high-precision values, see the String to Float and Float to String section for more details.
Expand Down
4 changes: 2 additions & 2 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -9279,7 +9279,7 @@ are limited.
<td rowSpan="2">JsonToStructs</td>
<td rowSpan="2">`from_json`</td>
<td rowSpan="2">Returns a struct value with the given `jsonStr` and `schema`</td>
<td rowSpan="2">This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case</td>
<td rowSpan="2">None</td>
<td rowSpan="2">project</td>
<td>jsonStr</td>
<td> </td>
Expand Down Expand Up @@ -9320,7 +9320,7 @@ are limited.
<td> </td>
<td> </td>
<td><b>NS</b></td>
<td><em>PS<br/>MAP only supports keys and values that are of STRING type;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>MAP only supports keys and values that are of STRING type and is only supported at the top level;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3780,7 +3780,8 @@ object GpuOverrides extends Logging {
ExprChecks.projectOnly(
TypeSig.STRUCT.nested(jsonStructReadTypes) +
TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP,
"MAP only supports keys and values that are of STRING type"),
"MAP only supports keys and values that are of STRING type " +
"and is only supported at the top level"),
(TypeSig.STRUCT + TypeSig.MAP + TypeSig.ARRAY).nested(TypeSig.all),
Seq(ParamCheck("jsonStr", TypeSig.STRING, TypeSig.STRING))),
(a, conf, p, r) => new UnaryExprMeta[JsonToStructs](a, conf, p, r) {
Expand Down Expand Up @@ -3821,10 +3822,7 @@ object GpuOverrides extends Logging {
override def convertToGpu(child: Expression): GpuExpression =
// GPU implementation currently does not support duplicated json key names in input
GpuJsonToStructs(a.schema, a.options, child, a.timeZoneId)
}).disabledByDefault("it is currently in beta and undergoes continuous enhancements."+
" Please consult the "+
"[compatibility documentation](../compatibility.md#json-supporting-types)"+
" to determine whether you can enable this configuration for your use case"),
}),
expr[StructsToJson](
"Converts structs to JSON text format",
ExprChecks.projectOnly(
Expand Down
Loading

0 comments on commit 6539441

Please sign in to comment.