-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1403: ORC supports reading empty field name #1458
base: main
Are you sure you want to change the base?
Conversation
IllegalArgumentException e = assertThrows(IllegalArgumentException.class, () -> { | ||
TypeDescription.fromString("struct<``:int>"); | ||
}); | ||
assertTrue(e.getMessage().contains("Empty quoted field name at 'struct<``^:int>'")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test coverage means it's a feature before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably prefer to just ignore it instead of removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Apache Hive support empty field names?
See apache/spark#35253 (comment). When Before ORC-529 1.6.0, ORC should support reading empty filed name. Apache Hive does not support empty field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cxzl25 . Sorry but the existing way is the correct direction. When Apache Hive (HiveFileFormat) doesn't support it, we should not remove a test coverage, testQuotedField2
.
Apache Hive does not support empty field.
As you see https://issues.apache.org/jira/browse/SPARK-20901, Apache Spark and ORC community is together trying to reduce the differences among different configurations.
Now the Spark orc datasource does not check the field name, but the Spark hive orc format checks the field name, which causes the Spark orc datasource to be able to write the field name.
In addition, can we check the schema in orc writer, otherwise the written data may not be read?
|
Ya, we can do some. However, instead of switching behaviors back and forth in this layer, I believe we need to focus on your end goals in the higher layers. What is your end goal as a user? For example, did you hit the following? If then, do we have a test coverage in Apache Spark codebase across multiple data sources (Hive,ORC,Parquet)? Can we start from there?
|
We have encountered some problems. Some lower versions of Spark use a lower version of ORC to write data with empty field name, but using Spark3.2 ORC 1.6 to read data fails. We have injected custom rules into Spark through So I was thinking, it would be better if this problem could be solved at Spark or ORC level.
|
Based on the above scenario, in order to avoid some additional side effects, maybe we could skip the limitation by adding a new configuration? |
What changes were proposed in this pull request?
ParserUtils
removes empty checkWhy are the changes needed?
apache/spark#35253 (comment)
How was this patch tested?
add UT