Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for dbfs file access api with mocked test suite #3236

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

danzafar
Copy link
Contributor

Changes

❗ This PR does not provide any new user-facing functionality nor refactor existing code, it is simply intended to be a sanity check on approach before the comprehensive refactor of scan-tables-in-mounts.

Why this PR?

Refactoring scan-tables-in-mounts to enhance performance will need to leverage parallelized crawling of dbfs:/mnt/ locations. @nfx suggested using backend Hadoop libraries through py4j for this through the SparkSession. Since this is a fairly big change, I wanted to make sure we are aligned on the backend file lister before I proceed to modify any existing files.

Enable dbfs file listing

This PR proposes the DbfsFiles class which uses the SparkSession's java backend with py4j to leverage the following Hadoop libraries for efficient dbfs:/ (and mount) file access:

  • org.apache.hadoop.fs.FileSystem
  • org.apache.hadoop.fs.Path

As of now, the only useful method is list_dir, but if the approach is confirmed I plan to add crawling capabilities leveraging databricks.labs.blueprint.parallel.

Test strategy

A testing strategy was painstakingly created to:

  • mock the jvm functionality needed
  • clean up the currently messy test suite pattern for existing scan-tables-in-mounts. I created a simple trie-based mock file system which can be leveraged for scan-tables-in-mounts functionality after the refactor. This mock file system also has it's own unit tests.
⚠️ WARNING
If you don't like my mock file system, it will probably hurt my feelings so please have a good reason!

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: databricks labs ucx ...
  • added a new workflow
  • modified existing workflow: ...
  • added a new table
  • modified existing table: ...

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't directly depend on py4j, see techniques. Otherwise the approach is good.

pyproject.toml Outdated
@@ -49,7 +49,8 @@ dependencies = ["databricks-sdk~=0.30",
"databricks-labs-blueprint>=0.9.1,<0.10",
"PyYAML>=6.0.0,<7.0.0",
"sqlglot>=25.5.0,<25.30",
"astroid>=3.3.1"]
"astroid>=3.3.1",
"py4j==0.10.9.7"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot add py4j as a dependency to UCX, as it'll have to be packaged with transitive dependencies to run CLI.

def _jvm(self):
try:
_jvm = self._spark._jvm
self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem")
FileSystem = jvm.org.apache.hadoop.fs.FileSystem

use the same technique as here, where you don't have to depend directly on py4j:
https://github.com/databrickslabs/ucx/blame/main/src/databricks/labs/ucx/hive_metastore/locations.py#L397-L401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants