Change mixed type as string to have higher priority over JSON schema #16731

karthikeyann · 2024-09-03T23:47:10Z

Description

Fixes #15260
Prefer mixed type as string than input schema in json reader, to fix the issue.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

ttnghia

Please update docs to explicitly clarify this behavior.

revans2 · 2024-09-05T20:55:48Z

I need to spend some more time looking at this as it has not fixed the test cases that I have. I might have simplified the issue too much for the test case that I included when I filed the issue, or I might have lumped too many things under the same issue.

I am also a little concerned about this code. I don't think I want mixed type as string to have higher priority compared to the schema that I passed in, but I need to do some more testing to really understand what is happening here and not just my speculation from the description.

revans2

I'm sorry this just does not work for the issues I am having right now.

revans2 · 2024-09-05T21:01:55Z

cpp/tests/io/json/json_test.cpp

+  std::map<std::string, cudf::io::schema_element> data_types;
+  data_types.insert(
+    std::pair{"data", cudf::io::schema_element{cudf::data_type{cudf::type_id::LIST}}});
+


I am a little confused by this test. The data type of the schema element is just LIST? but the test I posed in the bug #15260 (comment) has it as a LIST of STRING. I'm not really sure if that makes much of a difference, but it makes a big difference if I want a LIST<STRUCT<a:STRING,b:STRING>> compared to just a LIST

revans2 · 2024-09-05T21:05:56Z

cpp/tests/io/json/json_test.cpp

+  EXPECT_EQ(result.tbl->num_rows(), 2);
+  EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
+  // expected output without whitespace
+  cudf::test::strings_column_wrapper expected({R"({"A": 0, "B": 1})", "[1,0]"});


This is not the result I want to come out of this. I want an LIST of strings to come out. Not a column of strings. I filed the issue because my users asked us (Spark) to return a LIST for the data column, but because the data types don't match that I don't want to get back a STRING column. I want to get back a column that has a LIST of STRINGS in it. So the first entry would be null because {"A":0, "B": 1} is not something that can be coerced into a LIST. The second entry would be a list with two STRING value in it "1" and "0" because each of them can be coerced into a string.

karthikeyann · 2024-09-26T16:38:58Z

Issue is fixed with #16545
closing this PR.

Change mixed type as string to have higher priority over schema

1802a8b

karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS breaking Breaking change labels Sep 3, 2024

karthikeyann requested a review from revans2 September 3, 2024 23:47

karthikeyann requested a review from a team as a code owner September 3, 2024 23:47

karthikeyann requested review from harrism and nvdbaranec September 3, 2024 23:47

karthikeyann mentioned this pull request Sep 3, 2024

[BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request #15260

Closed

harrism approved these changes Sep 3, 2024

View reviewed changes

ttnghia approved these changes Sep 4, 2024

View reviewed changes

karthikeyann added 3 commits September 5, 2024 02:05

fix merge issues

55c84d3

Update doc mixed type as string in json.hpp

cdc417e

Merge branch 'branch-24.10' into enh-json_prefer_mixed_type

14298a4

GregoryKimball requested a review from shrshi September 5, 2024 17:26

nvdbaranec approved these changes Sep 5, 2024

View reviewed changes

karthikeyann added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 3 - Ready for Review Ready for review by team labels Sep 5, 2024

revans2 requested changes Sep 5, 2024

View reviewed changes

karthikeyann closed this Sep 26, 2024

karthikeyann deleted the enh-json_prefer_mixed_type branch September 26, 2024 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change mixed type as string to have higher priority over JSON schema #16731

Change mixed type as string to have higher priority over JSON schema #16731

karthikeyann commented Sep 3, 2024

ttnghia left a comment

revans2 commented Sep 5, 2024

revans2 left a comment

revans2 Sep 5, 2024

revans2 Sep 5, 2024

karthikeyann commented Sep 26, 2024

Change mixed type as string to have higher priority over JSON schema #16731

Change mixed type as string to have higher priority over JSON schema #16731

Conversation

karthikeyann commented Sep 3, 2024

Description

Checklist

ttnghia left a comment

Choose a reason for hiding this comment

revans2 commented Sep 5, 2024

revans2 left a comment

Choose a reason for hiding this comment

revans2 Sep 5, 2024

Choose a reason for hiding this comment

revans2 Sep 5, 2024

Choose a reason for hiding this comment

karthikeyann commented Sep 26, 2024