Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DynamicFileCatalog back to core #10745

Closed
wants to merge 21 commits into from

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

No specific issue but mentioned here #4850 (comment)

Rationale for this change

Discussed with @alamb here. #10619 (comment)

What changes are included in this PR?

Move DynamicFileCatalog and object storage-related features from CLI to the DataFusion core.

Are these changes tested?

No, just move original tests to core.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label May 31, 2024
Comment on lines 95 to 96
aws-config = "0.101.0"
aws-credential-types = "0.101.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade aws package to fix the audit failed message:

Crate:     rustls
Version:   0.20.9
error: 1 vulnerability found!
Title:     `rustls::ConnectionCommon::complete_io` could fall into an infinite loop based on network input
Date:      2024-04-19
ID:        RUSTSEC-2024-0336
URL:       https://rustsec.org/advisories/RUSTSEC-2024-0336
Severity:  7.5 (high)
Solution:  Upgrade to >=0.23.5 OR >=0.22.4, <0.23.0 OR >=0.21.[11](https://github.com/apache/datafusion/actions/runs/9321626823/job/25661037574?pr=10745#step:4:12), <0.22.0
Dependency tree:
rustls 0.20.9
├── tokio-rustls 0.23.4
│   └── hyper-rustls 0.23.2
│       └── aws-smithy-client 0.55.3
│           ├── aws-types 0.55.3
│           │   ├── aws-sig-auth 0.55.3
│           │   │   ├── aws-sdk-sts 0.28.0
│           │   │   │   └── aws-config 0.55.3
│           │   │   │       └── datafusion 38.0.0
│           │   │   │           ├── datafusion-wasmtest 38.0.0
│           │   │   │           ├── datafusion-substrait 38.0.0
│           │   │   │           ├── datafusion-sqllogictest 38.0.0
│           │   │   │           ├── datafusion-proto 38.0.0
│           │   │   │           │   └── datafusion-benchmarks 38.0.0
│           │   │   │           ├── datafusion-examples 38.0.0
│           │   │   │           ├── datafusion-docs-tests 38.0.0
│           │   │   │           └── datafusion-benchmarks 38.0.0
│           │   │   └── aws-sdk-sso 0.28.0
│           │   │       └── aws-config 0.55.3
│           │   ├── aws-sdk-sts 0.28.0
│           │   ├── aws-sdk-sso 0.28.0
│           │   ├── aws-http 0.55.3
│           │   │   ├── aws-sdk-sts 0.28.0
│           │   │   ├── aws-sdk-sso 0.28.0
│           │   │   └── aws-config 0.55.3
│           │   ├── aws-endpoint 0.55.3
│           │   │   ├── aws-sdk-sts 0.28.0
│           │   │   └── aws-sdk-sso 0.28.0
│           │   └── aws-config 0.55.3
│           ├── aws-sdk-sts 0.28.0
│           ├── aws-sdk-sso 0.28.0
│           └── aws-config 0.55.3
├── hyper-rustls 0.[23](https://github.com/apache/datafusion/actions/runs/9321626823/job/25661037574?pr=10745#step:4:24).2
└── aws-smithy-client 0.55.3

I'm not sure which version is better. Just pick up the latest version for version 0.xxx.xx.

@goldmedal goldmedal marked this pull request as draft May 31, 2024 17:49
@goldmedal goldmedal force-pushed the chore/move-dyfile-to-core branch from 8c76b92 to a3fb390 Compare June 1, 2024 05:10
@goldmedal
Copy link
Contributor Author

Hi @alamb, I encountered a dependency issue with wasm-build.
https://github.com/apache/datafusion/actions/runs/9327817555/job/25678224652?pr=10745

After some research, I found that some people encountered a similar error message, and someone mentioned it could be an issue with Tokio's limited support for WASM. I found that the failed crate, mio, is used by tokio and aws-config. datafusion-cli enables more of Tokio's features than core. I think I might have moved too many objects to core. I might need to keep aws, gcp, or other object-store-related things in cli. I'll keep trying.

However, this also means that this feature will have limited support for WASM.

Do you have any ideas about this issue? Thanks.

@goldmedal goldmedal force-pushed the chore/move-dyfile-to-core branch from 0417f89 to 9811979 Compare June 2, 2024 15:07
@github-actions github-actions bot removed the substrait label Jun 2, 2024
@goldmedal goldmedal force-pushed the chore/move-dyfile-to-core branch from e492aa3 to 4654397 Compare June 2, 2024 15:21
@goldmedal
Copy link
Contributor Author

goldmedal commented Jun 2, 2024

To solve the WASM building issue, I disabled the object_store related feature for WASM. There is one remaining issue with macOS that I don't know how to fix.

GitHub Actions Log:

libunwind: stepWithCompactEncoding - invalid compact unwind encoding
error: test failed, to rerun pass `-p datafusion --lib`

Caused by:
  process didn't exit successfully: `/Users/runner/work/datafusion/datafusion/target/debug/deps/datafusion-e14cd1eb6c13d6be` (signal: 6, SIGABRT: process abort signal)

After some research, it seems to be a known issue with Xcode 15. Refer to:

I'm not sure if it's related to the macos-latest GitHub runner. I can't reproduce it in my local.

I have tried the following:

  • Upgrading Rust to version 1.78.
  • Upgrading AWS-related crates to the latest version.
  • Rolling back AWS-related crates to version 0.55.3.
  • Upgrade dirs to the latest version.

None of these solutions have worked.

@goldmedal goldmedal force-pushed the chore/move-dyfile-to-core branch from 090558d to efa9bac Compare June 2, 2024 16:23
@goldmedal goldmedal marked this pull request as ready for review June 2, 2024 16:55
@goldmedal goldmedal marked this pull request as draft June 3, 2024 15:33
aws-config = "0.101.0"
aws-credential-types = "0.101.0"
# the default features will cause libunwind error when testing on macos paltform.
object_store = { version = "0.9.1", default-features = false, features = ["aws", "gcp", "http"] }
Copy link
Contributor Author

@goldmedal goldmedal Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't know which default feature causes the error on macOS. I just used trial and error many times, and eventually, it worked. I noticed that we disabled the default features in the root Cargo.toml by #8095, but I can't find the reason. Could someone explain this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be problematic -- I don't think we can add these dependencies to the core datafusion crate without massively increasing the dependency footprint.

The idea is that the core of datafusion can be used from many places (including wasm) that don't want the (very large) additional dependency of the aws sdk. Having the dependencies in the DataFusion CLI is fine as it is the standalone app (rather than a library)

@goldmedal goldmedal marked this pull request as ready for review June 3, 2024 16:25
@goldmedal
Copy link
Contributor Author

goldmedal commented Jun 3, 2024

I'm not sure if it's related to the macos-latest GitHub runner. I can't reproduce it in my local.

I have checked that the macos-latest GitHub runner has been updated to Xcode 15.4 after actions/runner-images#9842. So, it may not be an Xcode issue.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @goldmedal -- it is very impressive you got this to build 🙏

I think in general, I don't think we should add these new dependencies into the core, I think we should split out the features of the dynamic file catalog (which knows how to treat strings as table names and look them up as tables) and the registering / interpreting of URLs

I took a look around in the code, and it seems like one challenge is that the catalog traits (like CatalogProvider are still in the core crate and thus any new catalog implementation would have to be there.

I think it may be time to break out the catalog code into its own crate and enforce the boundaries a bit more. I wrote up this idea here: #10782

Let me know what you think

aws-config = "0.101.0"
aws-credential-types = "0.101.0"
# the default features will cause libunwind error when testing on macos paltform.
object_store = { version = "0.9.1", default-features = false, features = ["aws", "gcp", "http"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be problematic -- I don't think we can add these dependencies to the core datafusion crate without massively increasing the dependency footprint.

The idea is that the core of datafusion can be used from many places (including wasm) that don't want the (very large) additional dependency of the aws sdk. Having the dependencies in the DataFusion CLI is fine as it is the standalone app (rather than a library)

@goldmedal
Copy link
Contributor Author

Thanks @alamb
I agree with this approach. The dependency of the core should be simpler. Therefore, I think I can wait for the discussion in #10782 to plan the next steps for this issue.

@alamb alamb marked this pull request as draft June 6, 2024 20:52
@alamb
Copy link
Contributor

alamb commented Jun 6, 2024

Converting to draft as we work on separating out the catalog a bit more

@goldmedal
Copy link
Contributor Author

Hi @alamb, is there any update on this topic? Or maybe I can help by splitting out some traits into the new crate datafusion-catalog first?

@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

Hi @alamb, is there any update on this topic? Or maybe I can help by splitting out some traits into the new crate datafusion-catalog first?

Hi @goldmedal -- sorry for the lack of update. From my perspective, it is not a good idea to add any more dependencies (even optional) to the core datafusion crate.

After thinking about this some more, I think there is a difference between:

  1. The logic that interprets table names as potential object store locations
  2. The logic that handles urls like s3://... by automatically registering object store factories

Putting the first in the datafusion core makes sense. I think the second comes with many dependencies and thus doesn't make sense.

Thus, what do you think about somehow splitting the two pieces of functionalty apart? Maybe via a trait like

struct DynamicTableProvider {
...
  /// A callback function that is 
  url_lookup: Arc<dyn UrlLookup>
}

/// Trait for looking up the correct object store instance based on url
pub trait UrlLookup{
  fn lookup(&self, url: &Url) -> Result<Arc<dyn ObjectStore>>;
}

Then we could put DynamicTableProvider in the core (with a default handler that knew how to look up file:// urls, for example) but leave the s3 specicific stuff in datafusion-cli

Thank you for working on this PR / IDE

@goldmedal
Copy link
Contributor Author

Then we could put DynamicTableProvider in the core (with a default handler that knew how to look up file:// urls, for example) but leave the s3 specicific stuff in datafusion-cli

I think it makes sense to me. As far as I know, DuckDB also only supports reading local files by default. You need to install an additional extension, httpfs, to enable the feature for object storages. I'm not sure if there is any similar feature (installing an extension or table function, then query dynamically) in DataFusion, but I can implement DynamicTableProvider first.

@goldmedal
Copy link
Contributor Author

I think we should close this PR to keep the discussions. Then, I'll create another PR for DynamicTableProvider implementation. WDYT @alamb?

@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

. I'm not sure if there is any similar feature (installing an extension or table function, then query dynamically) in DataFusion, but I can implement DynamicTableProvider first.

I think this is what https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.register_object_store is used for

And access to the remote store is hanlded via and ObjectStore

@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

I think we should close this PR to keep the discussions. Then, I'll create another PR for DynamicTableProvider implementation. WDYT @alamb?

I think this is a good idea

Given how this project has grown, I think it might also be valuable to file a ticket describing the high level idea too -- are you able to do so? I can do so too, but likely won't have a chance until tomorrow

@goldmedal
Copy link
Contributor Author

Given how this project has grown, I think it might also be valuable to file a ticket describing the high level idea too -- are you able to do so? I can do so too, but likely won't have a chance until tomorrow

Sure, I can file a ticket for it.

@goldmedal
Copy link
Contributor Author

I have filed #10986. Let's move the follow up discussion to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants