Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Azure bucket sub directory is ignored if the bucket previously exists #4706

Merged
merged 6 commits into from
Feb 14, 2025

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 13, 2025

This PR resolves two issues:

Issue 1: When there is an existing Azure bucket in SkyPilot storage database, the sub path will be ignored. This affects the smoke test pytest tests/test_smoke.py::test_managed_jobs_storage --azure.

To reproduce:

  1. set jobs.bucket: https://xxx.blob.core.windows.net/yyy that is previously created by SkyPilot in ~/.sky/config.yaml
  2. sky jobs launch -n test-azure-upload test.yaml
# A managed spot example with storages.
#
# Runs a task that uses cloud buckets for uploading and accessing files.
#
# Usage:
#   sky jobs launch -c spot-storage examples/managed_job_with_storage.yaml
#   sky down spot-storage

file_mounts:
  # File mounts for folder
  /tmp/workdir: ~/tmp-workdir


  # File mounts for file
  ~/tmpfile: ~/tmpfile
  ~/a/b/c/tmpfile: ~/tmpfile

run: |
  set -ex
  cat ~/tmpfile
  cat ~/a/b/c/tmpfile
  
  # Write to a file in the mounted bucket
  echo "hello world!" > /output_path/output.txt

The bucket will be polluted with in the root directory, and the job above fails with:

Detailed Error
E 02-13 01:43:29 subprocess_utils.py:149] Cannot perform sync due to error: trying to sync between different resource types (either file <-> directory or directory <-> file) which is not allowed.sync must happen between source and destination of the same type, e.g. either file <-> file or directory <-> directory.To make sure target is handled as a directory, add a trailing '/' to the target.
E 02-13 01:43:29 subprocess_utils.py:149] 

I 02-13 01:43:29 recovery_strategy.py:339] Failed to launch a cluster with error: sky.exceptions.CommandError: Command mkdir -p ~/tmpfile && azcopy --version > /dev/null 2>&1 || (mkdir -p /usr/local/bin; curl -L https:/... failed with return code 1.
I 02-13 01:43:29 recovery_strategy.py:339] Failed to run command before rsync https://sky635611d9a69285a3.blob.core.windows.net/skypilot-filemounts-vscode-11d9a692-ahpbb7v3/job-7xgk48n3/tmp-files/file-0 -> ~/tmpfile. Ensure that the network is stable, then retry. mkdir -p ~/tmpfile && azcopy --version > /dev/null 2>&1 || (mkdir -p /usr/local/bin; curl -L https://aka.ms/downloadazcopy-v10-linux -o azcopy.tar.gz; sudo tar -xvzf azcopy.tar.gz --strip-components=1 -C /usr/local/bin --exclude=*.txt; sudo chmod +x /usr/local/bin/azcopy; rm azcopy.tar.gz) && azcopy sync 'https://sky635611d9a69285a3.blob.core.windows.net/skypilot-filemounts-vscode-11d9a692-ahpbb7v3/job-7xgk48n3/tmp-files/file-0?se=2025-02-13T02%3A43%3A27Z&sp=rcwl&sv=2025-01-05&sr=c&sig=mr59%2B5SlTa72/5c%2BkM0BY6eiIqCx/uQOs3dUB%2BFnV1g%3D' ~/tmpfile/ --recursive --delete-destination=false See logs in ~/sky_logs/sky-2025-02-13-01-39-47-646407/file_mounts.log)
I 02-13 01:43:29 recovery_strategy.py:342]   Traceback: Traceback (most recent call last):
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/recovery_strategy.py", line 286, in _launch
I 02-13 01:43:29 recovery_strategy.py:342]     sky.launch(
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 385, in _record
I 02-13 01:43:29 recovery_strategy.py:342]     return f(*args, **kwargs)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 385, in _record
I 02-13 01:43:29 recovery_strategy.py:342]     return f(*args, **kwargs)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 529, in launch
I 02-13 01:43:29 recovery_strategy.py:342]     return _execute(
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 329, in _execute
I 02-13 01:43:29 recovery_strategy.py:342]     backend.sync_file_mounts(handle, task.file_mounts,
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 385, in _record
I 02-13 01:43:29 recovery_strategy.py:342]     return f(*args, **kwargs)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 365, in _record
I 02-13 01:43:29 recovery_strategy.py:342]     return f(*args, **kwargs)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 101, in sync_file_mounts
I 02-13 01:43:29 recovery_strategy.py:342]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 3207, in _sync_file_mounts
I 02-13 01:43:29 recovery_strategy.py:342]     self._execute_file_mounts(handle, all_file_mounts)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4790, in _execute_file_mounts
I 02-13 01:43:29 recovery_strategy.py:342]     backend_utils.parallel_data_transfer_to_nodes(
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 1477, in parallel_data_transfer_to_nodes
I 02-13 01:43:29 recovery_strategy.py:342]     subprocess_utils.run_in_parallel(_sync_node, runners, num_threads)
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/subprocess_utils.py", line 122, in run_in_parallel
I 02-13 01:43:29 recovery_strategy.py:342]     return [func(args[0])]
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 1455, in _sync_node
I 02-13 01:43:29 recovery_strategy.py:342]     subprocess_utils.handle_returncode(rc,
I 02-13 01:43:29 recovery_strategy.py:342]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/subprocess_utils.py", line 156, in handle_returncode
I 02-13 01:43:29 recovery_strategy.py:342]     raise exceptions.CommandError(returncode, command, format_err_msg,
I 02-13 01:43:29 recovery_strategy.py:342] sky.exceptions.CommandError: Command mkdir -p ~/tmpfile && azcopy --version > /dev/null 2>&1 || (mkdir -p /usr/local/bin; curl -L https:/... failed with return code 1.
I 02-13 01:43:29 recovery_strategy.py:342] Failed to run command before rsync https://sky635611d9a69285a3.blob.core.windows.net/skypilot-filemounts-vscode-11d9a692-ahpbb7v3/job-7xgk48n3/tmp-files/file-0 -> ~/tmpfile. Ensure that the network is stable, then retry. mkdir -p ~/tmpfile && azcopy --version > /dev/null 2>&1 || (mkdir -p /usr/local/bin; curl -L https://aka.ms/downloadazcopy-v10-linux -o azcopy.tar.gz; sudo tar -xvzf azcopy.tar.gz --strip-components=1 -C /usr/local/bin --exclude=*.txt; sudo chmod +x /usr/local/bin/azcopy; rm azcopy.tar.gz) && azcopy sync 'https://sky635611d9a69285a3.blob.core.windows.net/skypilot-filemounts-vscode-11d9a692-ahpbb7v3/job-7xgk48n3/tmp-files/file-0?se=2025-02-13T02%3A43%3A27Z&sp=rcwl&sv=2025-01-05&sr=c&sig=mr59%2B5SlTa72/5c%2BkM0BY6eiIqCx/uQOs3dUB%2BFnV1g%3D' ~/tmpfile/ --recursive --delete-destination=false See logs in ~/sky_logs/sky-2025-02-13-01-39-47-646407/file_mounts.log

Issue 2: During the file mounts, the storage upload does not include the sub directory, which can be a bit confusing:
previously

  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: ./examples -> https://xxxx.blob.core.windows.net/skypilot-filemounts-vscode-yyyy/  View logs at: ~/sky_logs/sky-2025-02-13-01-38-17-371662/storage_sync.log
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: ~/tmp-workdir -> https://xxxx.blob.core.windows.net/skypilot-filemounts-vscode-yyyy/  View logs at: ~/sky_logs/sky-2025-02-13-01-38-30-519142/storage_sync.log

Now

  Workdir: './examples' -> storage: 'skypilot-filemounts-vscode-11d9a692-ahpbb7v3'.
Verifying bucket and uploading from source for storage skypilot-filemounts-vscode-11d9a692-ahpbb7v3
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: ~/tmp-workdir -> https://xxxx.blob.core.windows.net/skypilot-filemounts-vscode-yyyy/job-ct66lqv4/local-file-mounts/0/  View logs at: ~/sky_logs/sky-2025-02-13-01-35-59-046482/storage_sync.log
  Folder : '~/tmp-workdir' -> storage: 'skypilot-filemounts-vscode-11d9a692-ahpbb7v3'.
Verifying bucket and uploading from source for storage skypilot-filemounts-vscode-11d9a692-ahpbb7v3
  Excluded files to sync to cluster based on .gitignore.
✓ Storage synced: /tmp/skypilot-filemounts-files-ct66lqv4 -> https://xxxx.blob.core.windows.net/skypilot-filemounts-vscode-yyyy/job-ct66lqv4/tmp-files/  View logs at: ~/sky_logs/sky-2025-02-13-01-36-12-515049/storage_sync.log

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll changed the title [Storage] Fix Azure bucket sub directory ignored issue [Storage] Azure bucket sub directory is ignored if the bucket previously exists Feb 13, 2025
Comment on lines 2319 to 2325
'sync_on_reconstruction', True),
# Backward compatibility
# TODO: remove the hasattr check after v0.11.0
_bucket_sub_path=override_args.get(
'_bucket_sub_path',
metadata._bucket_sub_path # pylint: disable=protected-access
) if hasattr(metadata, '_bucket_sub_path') else None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main part that resolves issue 1

Michaelvll added a commit that referenced this pull request Feb 13, 2025
Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update test_managed_jobs_intermediate_storage to also support azure so that we can include this test case?

@Michaelvll
Copy link
Collaborator Author

Could we update test_managed_jobs_intermediate_storage to also support azure so that we can include this test case?

There seems a bunch of places that require aws specific CLIs. Could you help add the test for azure in a future PR @zpoint?

@zpoint
Copy link
Collaborator

zpoint commented Feb 13, 2025

Sure

Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll

@@ -354,7 +354,8 @@ def from_metadata(cls, metadata: StoreMetadata, **override_args):
metadata.is_sky_managed),
sync_on_reconstruction=override_args.get('sync_on_reconstruction',
True),
# backward compatibility
# Backward compatibility
# TODO: remove the hasattr check after v0.11.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying that backward compatibility is removed after v0.11.0?
For example, v0.6.0 won't be compatible with v0.11.0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have a convention that keep backward compatibilty for 3 major releases.

@Michaelvll
Copy link
Collaborator Author

/smoke-test --azure -k TestStorageWithCredentials

@Michaelvll
Copy link
Collaborator Author

/smoke-test --azure -k TestStorageWithCredentials

@Michaelvll
Copy link
Collaborator Author

/smoke-test --azure -k TestStorageWithCredentials

@Michaelvll
Copy link
Collaborator Author

/smoke-test --azure -k test_docker_storage_mounts

@Michaelvll
Copy link
Collaborator Author

cc'ing @romilbhardwaj for a quick look as well : )

@Michaelvll Michaelvll merged commit aa413c6 into master Feb 14, 2025
19 checks passed
@Michaelvll Michaelvll deleted the azure-bucket-path-fix branch February 14, 2025 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants