Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch dump and batch update of old stages #10630

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bric-afisher
Copy link

@bric-afisher bric-afisher commented Nov 23, 2024

Description

Rather than running stage.dump separately for each stage, which involves loading the dvc.lock file many times, this proposed code groups stages together by their dvcfiles and does a dvcfile.batch_dump (see the stages_by_dvcfile object).

Also, rather than getting the old version of each stage within get_versioned_outs, this proposed code gets all old versions in one step, stores them in a old_stages object, and then passes each old stage to stage.save. This similarly removes a step where the lock file is reloaded once per stage.

My understanding is that these changes improve the speed of dvc commit when the dvc.lock file is very large and slow to load, without causing any changes or problems regarding the function's behavior. That said, I am not 100%, especially when it comes to the get_versioned_outs part which I have less context for. I would appreciate feedback about what this could be missing!

Related issues

Fixes #10629, although it seems like it would be even better to have separate lock files.

Checklist

  • ❗❗ (WITH CAVEAT BELOW!!!) ❗❗ I have followed the Contributing to DVC checklist.
    • I am not a python expert so I apologize if I missed something! I did not add a test for this change because I did not see an obvious unit test to update. However, I did run this version with the reproducible example in dvc commit is slow when there are many stages #10629 and saw that it reduced commit time as described there.
    • I ran pytest in the the development environment and there were 6 tests that failed that seem to be related to some kind of API. These tests also failed when I ran pytest while on the main branch without any changes. So it seems like there was something about the developer environment that I did not set up correctly, rather than these being caused by the changes I made. I'm not sure though. Might you have any insight as to why these fail?
FAILED tests/func/test_daemon.py::test_analytics - subprocess.CalledProcessError: Command '['/home/fishea10/dvc/venv/bin/python3', '/home/fishea10/dvc/dvc/__main__.py', 'config', '-l', '-vv']' returned non-zero exit status 1.

FAILED tests/func/test_daemon.py::test_updater - subprocess.CalledProcessError: Command '['/home/fishea10/dvc/venv/bin/python3', '/home/fishea10/dvc/dvc/__main__.py', 'version', '-vv']' returned non-zero exit status 1.

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[None-False-True] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[None-False-False] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[DVC_EXP_GIT_REMOTE-False-True] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[DVC_EXP_GIT_REMOTE-False-False] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'
  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
    • NA. There is no change from the user's perspective.

Copy link

codecov bot commented Nov 23, 2024

Codecov Report

Attention: Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.00%. Comparing base (2431ec6) to head (aedc2fa).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
dvc/dvcfile.py 85.18% 1 Missing and 3 partials ⚠️
dvc/repo/commit.py 91.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10630      +/-   ##
==========================================
+ Coverage   90.68%   91.00%   +0.32%     
==========================================
  Files         504      504              
  Lines       39795    39835      +40     
  Branches     3141     3154      +13     
==========================================
+ Hits        36087    36253     +166     
+ Misses       3042     2949      -93     
+ Partials      666      633      -33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc commit is slow when there are many stages
1 participant