Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable normalization for configuration options and preserve case for S3 paths #13576

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Nov 26, 2024

Which issue does this PR close?

Closes #11650
Related to #13456

Rationale for this change

Surprisingly, even the simplest S3 example doesn't work 😱

> CREATE EXTERNAL TABLE test
STORED AS CSV
OPTIONS(
    'aws.access_key_id' '************',
    'aws.secret_access_key' '**********',
    'aws.region' 'eu-west-1'
)
LOCATION 's3://blaginin/playground/titanic.csv';


Object Store error: The operation lacked the necessary privileges to complete for path playground/titanic.csv: Client error with status 403 Forbidden: No Body

That's because s3 keys are case-sensitive, but now we always lowercase them.

I see you had the same issue for HuggingFace so I really think the default behaviour should be NOT normalizing.

What changes are included in this PR?

Change the default value, added transformation for some fields where it makes sense (enums, bools, etc)

Are these changes tested?

I updated the tests

Are there any user-facing changes?

Yes, changed the default value

@github-actions github-actions bot added sql SQL Planner common Related to common crate labels Nov 26, 2024
@blaginin blaginin changed the title Fix S3 in CLI: Do not normalize option values Fix S3 in CLI: Do not normalize options values Nov 26, 2024
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Nov 26, 2024
@blaginin blaginin marked this pull request as ready for review November 28, 2024 12:53
@blaginin blaginin marked this pull request as draft November 28, 2024 12:57
@blaginin
Copy link
Contributor Author

blaginin commented Dec 1, 2024

Actually, I think it should work differently - you should specify whether a value is case-sensitive per item. For example, an AWS password would be case-sensitive, but has_header would be case-insensitive. Otherwise this doesn't work:

CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION 'data.csv'
OPTIONS ('has_header' 'TRUE');

Error parsing TRUE as bool
caused by
External error: provided string was not `true` or `false`

Will change

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels the right direction.
Can you add tests showing why this matters?
Perhaps some tests with parsing non-lowercase s3 paths?

@findepi
Copy link
Member

findepi commented Dec 2, 2024

you should specify whether a value is case-sensitive per item.

i agree

but it doesn't have to be part of the SQL parser at all.
It can simply be applied when pulling values from the option map (the moment when strings are interpreted into something meaningful)

alternatively, we could have some sort of option registry where options could have a type declared, so that parsing is consistent, but that would be a much bigger change

@github-actions github-actions bot added the core Core DataFusion crate label Dec 3, 2024
@blaginin blaginin force-pushed the bugfix/do-not-normalize-values branch from a4490dd to 7c2b3fe Compare December 4, 2024 23:39
@blaginin blaginin marked this pull request as ready for review December 4, 2024 23:49
@blaginin blaginin requested a review from findepi December 6, 2024 13:28
@berkaysynnada
Copy link
Contributor

When I see this PR, I remember a previous discussion: #11650. I have also tried some approaches for it earlier: #11330 (comment).

Isn't it better to implement a future-proof and general solution, like the one I proposed, instead of toggling the configuration up and down?

Do you have time to review the proposed approach and its intended behavior? Is it feasible to implement it without much effort? I have forgotten some of the implementation details, so perhaps you can provide a more reasonable perspective.

@blaginin
Copy link
Contributor Author

blaginin commented Dec 10, 2024

hey @berkaysynnada, by future-proof solution do you mean synnada-ai@341b484#diff-2f697c310af1a48521829d8bd665cf64b6415fbf88edd370efa30b1ed686354dR1655? I agree that fine-grained control is better, but isn't it similar to how I implemented it? e.g.

pub compression: Option<String>, transform = str::to_lowercase, default = Some("zstd(3)".into())

@blaginin blaginin requested a review from findepi December 10, 2024 17:59
@findepi findepi changed the title Fix S3 in CLI: Do not normalize options values Add configurable normalization for configuration options and preserve case for S3 paths Dec 11, 2024
@@ -212,11 +219,12 @@ config_namespace! {
pub enable_ident_normalization: bool, default = true

/// When set to true, SQL parser will normalize options value (convert value to lowercase)
pub enable_options_value_normalization: bool, default = true
pub enable_options_value_normalization: bool, default = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the important change in this PR, along with the definition of transform for option values.

cc @alamb @comphead @jayzhan211

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWI @xinlifoobar and @berkaysynnada who worked on #11330 where this feature was added. Interestingly, this option was added in response to seemingly the same problem (AWS access keys being normalized) as faced by @tinfoil-knight here:

Rather than changing the default config values, we could potentially change the default in datafusion cli and/or improve documentation as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that solution is good enough for the long-term.
I understand a global config was introduced to avoid changing behavior for everyone, but that clearly doesn't work when different parts of the system need different behavior. For example S3 component needs case-sensitivity whereas some other component, breaks when lowercasing is turned off.
The global config toggle solved the problem only for some use-cases and only for people knowing about the toggle. This PR aims to solve the problem the better way (IMO).
I would prefer we continue with the approach proposed here and later deprecate the global toggle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the linked issue and its PR, it seems what I am looking for is exactly what @ozankabak wanted in #11330 (comment)

We would like to have key-level control on what normalization means for config options, built-in and extension.

this PR addresses this for built-in options. A nice generic normalization for extension options is yet to be addressed.

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
@@ -444,7 +452,7 @@ config_namespace! {
/// Valid values are: "none", "chunk", and "page"
/// These values are not case sensitive. If NULL, uses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the These values are not case sensitive can be removed

@@ -470,7 +478,7 @@ config_namespace! {
/// delta_byte_array, rle_dictionary, and byte_stream_split.
/// These values are not case sensitive. If NULL, uses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are not case sensitive.

this suggests the code reading this value does normalization and that it's no longer needed, can be simplified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is a good info for users that you assign any values to it - "plain", "PLAIN", "Plain", etc. Also, I think that's what Oleks asked in #13576 (comment) - so maybe let's keep it as is?

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/core/tests/config_from_env.rs Outdated Show resolved Hide resolved
@findepi
Copy link
Member

findepi commented Dec 11, 2024

this changes config option handling.
cc @alamb @jayzhan211 @comphead @crepererum @jonahgao if you want to skim the PR

@comphead
Copy link
Contributor

Thanks @blaginin and @findepi for the review.

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Dec 12, 2024

hey @berkaysynnada, by future-proof solution do you mean synnada-ai@341b484#diff-2f697c310af1a48521829d8bd665cf64b6415fbf88edd370efa30b1ed686354dR1655? I agree that fine-grained control is better, but isn't it similar to how I implemented it? e.g.

pub compression: Option<String>, transform = str::to_lowercase, default = Some("zstd(3)".into())

Sorry, I have missed your reply. After looking more in detail, you did actually what I referred.
I just wonder can we remove that config setting at all 🤔 because now every option can modify the input as intended, and the users should write knowing that behavior -- for example an option is case sensitive, you would know that and write accordingly. But if it's not a specific option and just takes some elements from a set, and then you would know they are all normalized. In which scenarios does that config setting get a meaning?

We have talked with @ozankabak, and the most optimal solution is that we can accommodate this config setting such that it will enables/disables all kind of normalizations. For example, if normalization is disabled, then all given inputs are needed to be written as how they are exactly written. If it is enabled, then the hardcoded normalization functions become activated, and the inputs are normalized accordingly. With all of these,I think the default value should be true, "enable normalizations".

@findepi
Copy link
Member

findepi commented Dec 12, 2024

I feel we should deprecate and remove the global config, in favor of per-value normalization.
@berkaysynnada What's the use-case to have per-value normalization and a global config?

@berkaysynnada
Copy link
Contributor

I feel we should deprecate and remove the global config, in favor of per-value normalization. @berkaysynnada What's the use-case to have per-value normalization and a global config?

I know it will most probably be a super-rare use case, but assume we add a hardcoded normalization for an option, and user somehow needs to not normalize his input. Does he need to change the code?

@findepi
Copy link
Member

findepi commented Dec 12, 2024

If the option has hard-coded normalization (eg lowercasing), it means there is no difference between certain values (eg those differing in case). Why would a user be concerned about that?
It feels you didn't describe a use-case fully, or I didn't quite understand it.

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Dec 12, 2024

If the option has hard-coded normalization (eg lowercasing), it means there is no difference between certain values (eg those differing in case). Why would a user be concerned about that? It feels you didn't describe a use-case fully, or I didn't quite understand it.

If an option accepts a file name and is coded to normalize the input to snake_case (which applies to most data file naming conventions), but for a specific file type snake_case normalization is not preferable, the user might want to retain the original naming style. Can that be a valid example?

I am not insisting that we should keep it, I am just playing devil's advocate 😅

@findepi
Copy link
Member

findepi commented Dec 12, 2024

I like this example, thank you @berkaysynnada.
If an option accepts a file name (eg to write data to), would we want to normalize it? Maybe we just shouldn't.

@ozankabak
Copy link
Contributor

The config option is basically an escape hatch for users in case they find themselves in a situation where the normalization we enforce is inapplicable in their (possibly edge) case.

@blaginin
Copy link
Contributor Author

blaginin commented Dec 12, 2024

Feels like the right move is to add a deprecation to that option, see if we get any requests or issues, and if not, just remove it completely.

I also don’t think this option should exist. Each value is either case-sensitive or not - we know that for sure for each value. If we’re missing some edge case, we should fix it instead of normalizing the value. Normalizing might solve one problem but could also break other stuff, like aws integration.

If you’re good with that, @ozankabak @berkaysynnada, I’ll deprecate the option

@ozankabak
Copy link
Contributor

I am OK with removing it. We can add it later if it turns out to be necessary

@blaginin
Copy link
Contributor Author

> set datafusion.sql_parser.enable_options_value_normalization to true;
[2024-12-13T12:44:10Z WARN  datafusion_common::config] `enable_options_value_normalization` is deprecated and ignored

@blaginin blaginin force-pushed the bugfix/do-not-normalize-values branch from 97590fe to 66c3cd5 Compare December 13, 2024 14:18
@blaginin blaginin requested a review from comphead December 13, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support per-option value normalization
6 participants