Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change mixed type as string to have higher priority over JSON schema #16731

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/include/cudf/io/json.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,7 @@ class json_reader_options {
/**
* @brief Set whether to parse mixed types as a string column.
* Also enables forcing to read a struct as string column using schema.
* If enable, mixed types are parsed a string column regardless of schema.
*
* @param val Boolean value to enable/disable parsing mixed types as a string column
*/
Expand Down
6 changes: 3 additions & 3 deletions cpp/src/io/json/json_column.cu
Original file line number Diff line number Diff line change
Expand Up @@ -987,15 +987,15 @@ std::pair<std::unique_ptr<column>, std::vector<column_name_info>> device_json_co

data_type target_type{};

if (schema.has_value()) {
if (json_col.forced_as_string_column) {
target_type = data_type{type_id::STRING};
} else if (schema.has_value()) {
#ifdef NJP_DEBUG_PRINT
std::cout << "-> explicit type: "
<< (schema.has_value() ? std::to_string(static_cast<int>(schema->type.id()))
: "n/a");
#endif
target_type = schema.value().type;
} else if (json_col.forced_as_string_column) {
target_type = data_type{type_id::STRING};
}
// Infer column type, if we don't have an explicit type for it
else {
Expand Down
25 changes: 25 additions & 0 deletions cpp/tests/io/json/json_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2776,4 +2776,29 @@ TEST_F(JsonReaderTest, JSONMixedTypeChildren)
}
}

TEST_F(JsonReaderTest, MixedTypesWithSchema)
{
std::string data = "{\"data\": {\"A\": 0, \"B\": 1}}\n{\"data\": [1,0]}\n";

std::map<std::string, cudf::io::schema_element> data_types;
data_types.insert(
std::pair{"data", cudf::io::schema_element{cudf::data_type{cudf::type_id::LIST}}});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused by this test. The data type of the schema element is just LIST? but the test I posed in the bug #15260 (comment) has it as a LIST of STRING. I'm not really sure if that makes much of a difference, but it makes a big difference if I want a LIST<STRUCT<a:STRING,b:STRING>> compared to just a LIST

cudf::io::json_reader_options in_options =
cudf::io::json_reader_options::builder(cudf::io::source_info{data.data(), data.size()})
.dtypes(data_types)
.recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
.mixed_types_as_string(true)
.keep_quotes(true)
.lines(true);
cudf::io::table_with_metadata result = cudf::io::read_json(in_options);

EXPECT_EQ(result.tbl->num_columns(), 1);
EXPECT_EQ(result.tbl->num_rows(), 2);
EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
// expected output without whitespace
cudf::test::strings_column_wrapper expected({R"({"A": 0, "B": 1})", "[1,0]"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the result I want to come out of this. I want an LIST of strings to come out. Not a column of strings. I filed the issue because my users asked us (Spark) to return a LIST for the data column, but because the data types don't match that I don't want to get back a STRING column. I want to get back a column that has a LIST of STRINGS in it. So the first entry would be null because {"A":0, "B": 1} is not something that can be coerced into a LIST. The second entry would be a list with two STRING value in it "1" and "0" because each of them can be coerced into a string.

CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, result.tbl->get_column(0));
}

CUDF_TEST_PROGRAM_MAIN()
Loading