-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --recursive
option to fs ls
command
#513
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR, this looks useful.
Two general comments:
- It seems to me that recursive implies absolute, or at least relative to the specified path. As is, it would allow for recursively showing the basename only, which doesn't make a lot of sense. Can you make the behavior of non-absolute recursive listing show the relative paths?
- Please add a couple tests for this behavior.
databricks_cli/dbfs/cli.py
Outdated
if f.is_dir: | ||
recursive_echo(this_dbfs_path.join(f.basename)) | ||
|
||
recursive_echo(dbfs_path) if recursive else echo_path(dbfs_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modifies existing behavior to no longer list files in the specified path, but just the path itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @pietern thank you for your helpful feedback! i believe i fixed this in the latest commit:
databricks-cli/databricks_cli/dbfs/api.py
Line 94 in 3330faf
def _recursive_list(self, **kwargs): |
databricks_cli/dbfs/cli.py
Outdated
|
||
def echo_path(files): | ||
table = tabulate([f.to_row(is_long_form=l, is_absolute=absolute) for f in files], | ||
tablefmt='plain') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to tabulate
if the max width of these files is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pietern thanks again. i believe tabulate will only be called once now after the latest commit, and the max file width should be a single value.
databricks_cli/dbfs/cli.py
Outdated
for f in files: | ||
if f.is_dir: | ||
recursive_echo(this_dbfs_path.join(f.basename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a rough estimation how much more requests this would generate? Do we have customers who already manually doing this and we are just giving them a shortcut?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @stormwindy thanks for your review. i truly have no idea how many more requests it will generate; totally dependent on customer file structure. will be 1 extra request for each directory in the path and all of their subdirectories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. What are the use cases for this, at least in your perspective? Unless there is a crazy deep folder structure this shouldn't cause any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting filepaths of all worker logs in a cluster: #512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the idea to add this option is good on our end. I am also happy with the overall implementation direction. I have left one not comment about the code. After that @pietern can decide if/when to approve the PR. Thanks a lot for the contribution.
fix Dbfs files assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinTM Thanks for addressing my earlier comment. You refactored a bunch of the PR and I think there are correctness issues. Please have a look.
paths = self.client.list_files(*args, **kwargs) | ||
files = [p for p in paths if not p.is_dir] | ||
for p in paths: | ||
files = files + self._recursive_list(p) if p.is_dir else files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to read this a couple times to understand what you're doing. I think it should be simplified to iterate over paths
just once and then have an if p.is_dir
in there with both dealing with a file or a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate comment: this also need to pass along the headers
from the **kwargs
to be consistent with the list_files
API today. Otherwise you use it on the first call but not later calls.
assert len(files) == 2 | ||
assert TEST_FILE_INFO0 == files[0] | ||
assert TEST_FILE_INFO1 == files[1] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be a test for a real recursive call.
The variable TEST_FILE_JSON2
is unused. When you make a test that uses it and does a real recursive call, I expect it to fail for the reason I commented about above; the mismatch between FileInfo
and DbfsPath
.
No description provided.