Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing json reader options for JsonScanRetrySuite #11898

Merged
merged 2 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -193,18 +193,22 @@ object GpuJsonReadCommon {
val allowUnquotedControlChars = options.buildJsonFactory()
.isEnabled(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS)

baseCudfJsonOptionsBuilder()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this name can be something better but don't know what 😃

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? From what perspective can it be better? Like readability? brevity?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "base" and why? The builder is not final and not a "base" for something. It is just the builder, with some options set, and before we pass in more options. So I don't like to have "base" in it. But I don't have a better name for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I used "base" to imply that it has default options. How about cudfJsonOptionsBuilderWithDefaults()? Is that better? It's little long though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultJsonOptionsBuilder? No need to have such long name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I like the name, but want to have cudf in it as there are two different JSONOptions, one for cudf and another for catalyst. I will rename it to defaultCudfJsonOptionsBuilder in #11899.

.withNormalizeSingleQuotes(options.allowSingleQuotes)
.withLeadingZeros(options.allowNumericLeadingZeros)
.withNonNumericNumbers(options.allowNonNumericNumbers)
.withUnquotedControlChars(allowUnquotedControlChars)
.build()
}

def baseCudfJsonOptionsBuilder(): ai.rapids.cudf.JSONOptions.Builder = {
ai.rapids.cudf.JSONOptions.builder()
.withRecoverWithNull(true)
.withMixedTypesAsStrings(true)
.withNormalizeWhitespace(true)
.withKeepQuotes(true)
.withNormalizeSingleQuotes(options.allowSingleQuotes)
.withStrictValidation(true)
.withLeadingZeros(options.allowNumericLeadingZeros)
.withNonNumericNumbers(options.allowNonNumericNumbers)
.withUnquotedControlChars(allowUnquotedControlChars)
.withCudfPruneSchema(true)
.withExperimental(true)
.build()
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@

package com.nvidia.spark.rapids

import ai.rapids.cudf.JSONOptions
import com.nvidia.spark.rapids.jni.RmmSpark

import org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader
import org.apache.spark.sql.rapids.GpuJsonReadCommon
import org.apache.spark.sql.types._

class JsonScanRetrySuite extends RmmSparkRetrySuiteBase {
Expand All @@ -29,7 +29,7 @@ class JsonScanRetrySuite extends RmmSparkRetrySuiteBase {

val cudfSchema = GpuColumnVector.from(StructType(Seq(StructField("a", IntegerType),
StructField("b", IntegerType))))
val opts = JSONOptions.builder().withLines(true).build()
val opts = GpuJsonReadCommon.baseCudfJsonOptionsBuilder().withLines(true).build()
Copy link
Collaborator

@ttnghia ttnghia Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line(true) is default and the plugin relies on it. So probably we should better set withLines(true) explicitly in baseCudfJsonOptionsBuilder instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do that in a separate PR. I'd like to keep the change of this PR minimized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you already changed the cudfJsonOptions function right? So adding one more change like that would not be different IMO.

Copy link
Collaborator Author

@jihoonson jihoonson Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am modifying the code in this function without functionality change. Your suggestion changes its behavior which may or may not impact other tests. Given that this test failure is blocking all PRs, I think it's better to make that change in a different PR without unnecessarily blocking others further.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #11899.

RmmSpark.forceRetryOOM(RmmSpark.getCurrentThreadId, 1,
RmmSpark.OomInjectionType.GPU.ordinal, 0)
val table = JsonPartitionReader.readToTable(bufferer, cudfSchema, NoopMetric,
Expand Down
Loading