Proposal for dbfs file access api with mocked test suite #3236

danzafar · 2024-11-10T03:34:41Z

Changes

❗ This PR does not provide any new user-facing functionality nor refactor existing code, it is simply intended to be a sanity check on approach before the comprehensive refactor of `scan-tables-in-mounts`.

Why this PR?

Refactoring scan-tables-in-mounts to enhance performance will need to leverage parallelized crawling of dbfs:/mnt/ locations. @nfx suggested using backend Hadoop libraries through py4j for this through the SparkSession. Since this is a fairly big change, I wanted to make sure we are aligned on the backend file lister before I proceed to modify any existing files.

Enable `dbfs` file listing

This PR proposes the DbfsFiles class which uses the SparkSession's java backend with py4j to leverage the following Hadoop libraries for efficient dbfs:/ (and mount) file access:

org.apache.hadoop.fs.FileSystem
org.apache.hadoop.fs.Path

As of now, the only useful method is list_dir, but if the approach is confirmed I plan to add crawling capabilities leveraging databricks.labs.blueprint.parallel.

Test strategy

A testing strategy was painstakingly created to:

mock the jvm functionality needed
clean up the currently messy test suite pattern for existing scan-tables-in-mounts. I created a simple trie-based mock file system which can be leveraged for scan-tables-in-mounts functionality after the refactor. This mock file system also has it's own unit tests.

⚠️ WARNING
If you don't like my mock file system, it will probably hurt my feelings so please have a good reason!

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: ...
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

nfx

don't directly depend on py4j, see techniques. Otherwise the approach is good.

nfx · 2024-11-11T13:35:10Z

pyproject.toml

@@ -49,7 +49,8 @@ dependencies = ["databricks-sdk~=0.30",
                "databricks-labs-blueprint>=0.9.1,<0.10",
                "PyYAML>=6.0.0,<7.0.0",
                "sqlglot>=25.5.0,<25.30",
-                "astroid>=3.3.1"]
+                "astroid>=3.3.1",
+                "py4j==0.10.9.7"]


we cannot add py4j as a dependency to UCX, as it'll have to be packaged with transitive dependencies to run CLI.

nfx · 2024-11-11T13:36:59Z

src/databricks/labs/ucx/hive_metastore/dbfs_files.py

+    def _jvm(self):
+        try:
+            _jvm = self._spark._jvm
+            self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem")


Suggested change

self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem")

FileSystem = jvm.org.apache.hadoop.fs.FileSystem

use the same technique as here, where you don't have to depend directly on py4j:
https://github.com/databrickslabs/ucx/blame/main/src/databricks/labs/ucx/hive_metastore/locations.py#L397-L401

proposed dbfs file api with test suite

d5f5d8e

danzafar requested a review from a team as a code owner November 10, 2024 03:34

danzafar had a problem deploying to account-admin November 10, 2024 03:34 — with GitHub Actions Failure

nfx requested changes Nov 11, 2024

View reviewed changes

remove py4j dep

8c80c8e

danzafar had a problem deploying to account-admin November 11, 2024 19:34 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for dbfs file access api with mocked test suite #3236

Proposal for dbfs file access api with mocked test suite #3236

danzafar commented Nov 10, 2024

nfx left a comment

nfx Nov 11, 2024

nfx Nov 11, 2024

	self._java_import(_jvm, "org.apache.hadoop.fs.FileSystem")
	FileSystem = jvm.org.apache.hadoop.fs.FileSystem

Proposal for dbfs file access api with mocked test suite #3236

Are you sure you want to change the base?

Proposal for dbfs file access api with mocked test suite #3236

Conversation

danzafar commented Nov 10, 2024

Changes

Why this PR?

Enable dbfs file listing

Test strategy

Functionality

Tests

nfx left a comment

Choose a reason for hiding this comment

nfx Nov 11, 2024

Choose a reason for hiding this comment

nfx Nov 11, 2024

Choose a reason for hiding this comment

Enable `dbfs` file listing