Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change mixed type as string to have higher priority over JSON schema #16731

Conversation

karthikeyann
Copy link
Contributor

Description

Fixes #15260
Prefer mixed type as string than input schema in json reader, to fix the issue.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS breaking Breaking change labels Sep 3, 2024
@karthikeyann karthikeyann requested a review from revans2 September 3, 2024 23:47
@karthikeyann karthikeyann requested a review from a team as a code owner September 3, 2024 23:47
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docs to explicitly clarify this behavior.

@revans2
Copy link
Contributor

revans2 commented Sep 5, 2024

I need to spend some more time looking at this as it has not fixed the test cases that I have. I might have simplified the issue too much for the test case that I included when I filed the issue, or I might have lumped too many things under the same issue.

I am also a little concerned about this code. I don't think I want mixed type as string to have higher priority compared to the schema that I passed in, but I need to do some more testing to really understand what is happening here and not just my speculation from the description.

@karthikeyann karthikeyann added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 3 - Ready for Review Ready for review by team labels Sep 5, 2024
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry this just does not work for the issues I am having right now.

std::map<std::string, cudf::io::schema_element> data_types;
data_types.insert(
std::pair{"data", cudf::io::schema_element{cudf::data_type{cudf::type_id::LIST}}});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused by this test. The data type of the schema element is just LIST? but the test I posed in the bug #15260 (comment) has it as a LIST of STRING. I'm not really sure if that makes much of a difference, but it makes a big difference if I want a LIST<STRUCT<a:STRING,b:STRING>> compared to just a LIST

EXPECT_EQ(result.tbl->num_rows(), 2);
EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
// expected output without whitespace
cudf::test::strings_column_wrapper expected({R"({"A": 0, "B": 1})", "[1,0]"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the result I want to come out of this. I want an LIST of strings to come out. Not a column of strings. I filed the issue because my users asked us (Spark) to return a LIST for the data column, but because the data types don't match that I don't want to get back a STRING column. I want to get back a column that has a LIST of STRINGS in it. So the first entry would be null because {"A":0, "B": 1} is not something that can be coerced into a LIST. The second entry would be a list with two STRING value in it "1" and "0" because each of them can be coerced into a string.

@karthikeyann
Copy link
Contributor Author

Issue is fixed with #16545
closing this PR.

@karthikeyann karthikeyann deleted the enh-json_prefer_mixed_type branch September 26, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request
5 participants