-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for dbfs file access api with mocked test suite #3236
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't directly depend on py4j
, see techniques. Otherwise the approach is good.
pyproject.toml
Outdated
@@ -49,7 +49,8 @@ dependencies = ["databricks-sdk~=0.30", | |||
"databricks-labs-blueprint>=0.9.1,<0.10", | |||
"PyYAML>=6.0.0,<7.0.0", | |||
"sqlglot>=25.5.0,<25.30", | |||
"astroid>=3.3.1"] | |||
"astroid>=3.3.1", | |||
"py4j==0.10.9.7"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot add py4j
as a dependency to UCX, as it'll have to be packaged with transitive dependencies to run CLI.
def _jvm(self): | ||
try: | ||
_jvm = self._spark._jvm | ||
self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem") | |
FileSystem = jvm.org.apache.hadoop.fs.FileSystem |
use the same technique as here, where you don't have to depend directly on py4j:
https://github.com/databrickslabs/ucx/blame/main/src/databricks/labs/ucx/hive_metastore/locations.py#L397-L401
Changes
scan-tables-in-mounts
.Why this PR?
Refactoring
scan-tables-in-mounts
to enhance performance will need to leverage parallelized crawling ofdbfs:/mnt/
locations. @nfx suggested using backend Hadoop libraries throughpy4j
for this through theSparkSession
. Since this is a fairly big change, I wanted to make sure we are aligned on the backend file lister before I proceed to modify any existing files.Enable
dbfs
file listingThis PR proposes the
DbfsFiles
class which uses theSparkSession
's java backend withpy4j
to leverage the following Hadoop libraries for efficientdbfs:/
(andmount
) file access:org.apache.hadoop.fs.FileSystem
org.apache.hadoop.fs.Path
As of now, the only useful method is
list_dir
, but if the approach is confirmed I plan to add crawling capabilities leveragingdatabricks.labs.blueprint.parallel
.Test strategy
A testing strategy was painstakingly created to:
scan-tables-in-mounts
. I created a simple trie-based mock file system which can be leveraged forscan-tables-in-mounts
functionality after the refactor. This mock file system also has it's own unit tests.Functionality
databricks labs ucx ...
...
...
Tests