-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configurable normalization for configuration options and preserve case for S3 paths #13576
base: main
Are you sure you want to change the base?
Conversation
Actually, I think it should work differently - you should specify whether a value is case-sensitive per item. For example, an AWS password would be case-sensitive, but CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION 'data.csv'
OPTIONS ('has_header' 'TRUE');
Error parsing TRUE as bool
caused by
External error: provided string was not `true` or `false` Will change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels the right direction.
Can you add tests showing why this matters?
Perhaps some tests with parsing non-lowercase s3 paths?
i agree but it doesn't have to be part of the SQL parser at all. alternatively, we could have some sort of option registry where options could have a type declared, so that parsing is consistent, but that would be a much bigger change |
a4490dd
to
7c2b3fe
Compare
When I see this PR, I remember a previous discussion: #11650. I have also tried some approaches for it earlier: #11330 (comment). Isn't it better to implement a future-proof and general solution, like the one I proposed, instead of toggling the configuration up and down? Do you have time to review the proposed approach and its intended behavior? Is it feasible to implement it without much effort? I have forgotten some of the implementation details, so perhaps you can provide a more reasonable perspective. |
hey @berkaysynnada, by future-proof solution do you mean synnada-ai@341b484#diff-2f697c310af1a48521829d8bd665cf64b6415fbf88edd370efa30b1ed686354dR1655? I agree that fine-grained control is better, but isn't it similar to how I implemented it? e.g. datafusion/datafusion/common/src/config.rs Line 442 in 06c013d
|
datafusion/common/src/config.rs
Outdated
@@ -212,11 +219,12 @@ config_namespace! { | |||
pub enable_ident_normalization: bool, default = true | |||
|
|||
/// When set to true, SQL parser will normalize options value (convert value to lowercase) | |||
pub enable_options_value_normalization: bool, default = true | |||
pub enable_options_value_normalization: bool, default = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the important change in this PR, along with the definition of transform
for option values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWI @xinlifoobar and @berkaysynnada who worked on #11330 where this feature was added. Interestingly, this option was added in response to seemingly the same problem (AWS access keys being normalized) as faced by @tinfoil-knight here:
Rather than changing the default config values, we could potentially change the default in datafusion cli and/or improve documentation as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that solution is good enough for the long-term.
I understand a global config was introduced to avoid changing behavior for everyone, but that clearly doesn't work when different parts of the system need different behavior. For example S3 component needs case-sensitivity whereas some other component, breaks when lowercasing is turned off.
The global config toggle solved the problem only for some use-cases and only for people knowing about the toggle. This PR aims to solve the problem the better way (IMO).
I would prefer we continue with the approach proposed here and later deprecate the global toggle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the linked issue and its PR, it seems what I am looking for is exactly what @ozankabak wanted in #11330 (comment)
We would like to have key-level control on what normalization means for config options, built-in and extension.
this PR addresses this for built-in options. A nice generic normalization for extension options is yet to be addressed.
@@ -444,7 +452,7 @@ config_namespace! { | |||
/// Valid values are: "none", "chunk", and "page" | |||
/// These values are not case sensitive. If NULL, uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the These values are not case sensitive
can be removed
@@ -470,7 +478,7 @@ config_namespace! { | |||
/// delta_byte_array, rle_dictionary, and byte_stream_split. | |||
/// These values are not case sensitive. If NULL, uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are not case sensitive.
this suggests the code reading this value does normalization and that it's no longer needed, can be simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this is a good info for users that you assign any values to it - "plain", "PLAIN", "Plain", etc. Also, I think that's what Oleks asked in #13576 (comment) - so maybe let's keep it as is?
this changes config option handling. |
Sorry, I have missed your reply. After looking more in detail, you did actually what I referred. We have talked with @ozankabak, and the most optimal solution is that we can accommodate this config setting such that it will enables/disables all kind of normalizations. For example, if normalization is disabled, then all given inputs are needed to be written as how they are exactly written. If it is enabled, then the hardcoded normalization functions become activated, and the inputs are normalized accordingly. With all of these,I think the default value should be true, "enable normalizations". |
I feel we should deprecate and remove the global config, in favor of per-value normalization. |
I know it will most probably be a super-rare use case, but assume we add a hardcoded normalization for an option, and user somehow needs to not normalize his input. Does he need to change the code? |
If the option has hard-coded normalization (eg lowercasing), it means there is no difference between certain values (eg those differing in case). Why would a user be concerned about that? |
If an option accepts a file name and is coded to normalize the input to snake_case (which applies to most data file naming conventions), but for a specific file type snake_case normalization is not preferable, the user might want to retain the original naming style. Can that be a valid example? I am not insisting that we should keep it, I am just playing devil's advocate 😅 |
I like this example, thank you @berkaysynnada. |
The config option is basically an escape hatch for users in case they find themselves in a situation where the normalization we enforce is inapplicable in their (possibly edge) case. |
Feels like the right move is to add a deprecation to that option, see if we get any requests or issues, and if not, just remove it completely. I also don’t think this option should exist. Each value is either case-sensitive or not - we know that for sure for each value. If we’re missing some edge case, we should fix it instead of normalizing the value. Normalizing might solve one problem but could also break other stuff, like aws integration. If you’re good with that, @ozankabak @berkaysynnada, I’ll deprecate the option |
I am OK with removing it. We can add it later if it turns out to be necessary |
|
97590fe
to
66c3cd5
Compare
Which issue does this PR close?
Closes #11650
Related to #13456
Rationale for this change
Surprisingly, even the simplest S3 example doesn't work 😱
That's because s3 keys are case-sensitive, but now we always lowercase them.
I see you had the same issue for HuggingFace so I really think the default behaviour should be NOT normalizing.
What changes are included in this PR?
Change the default value, added transformation for some fields where it makes sense (enums, bools, etc)
Are these changes tested?
I updated the tests
Are there any user-facing changes?
Yes, changed the default value