Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc commit is slow when there are many stages #10629

Open
bric-afisher opened this issue Nov 23, 2024 · 0 comments · May be fixed by #10630
Open

dvc commit is slow when there are many stages #10629

bric-afisher opened this issue Nov 23, 2024 · 0 comments · May be fixed by #10630

Comments

@bric-afisher
Copy link

bric-afisher commented Nov 23, 2024

Description

My team has been having a problem as we've been increasing the size of our dataset. We've created an internal tool for using the dvc DAG, along with dvc matrix, to parallelize the different stages and substages. After running these stages, we use dvc commit to update the lock file. However, because lock file is large, this commit stage takes a very long time. Looking at it more closely, my hypothesis is that the time required for dvc commit grows quadratically in the number of stages. For ex, in the example below, the commit step takes longer than the repro step.

The problem seems to that dvc opens the dvc.lock file at least twice per stage. Once to update the stage objects, and once to update the lockfile. As far as I understand, the first operation involves calling:
repo.commit -> stage.save -> stage.get_versioned_outs -> stage.reload -> dvcfile._reset
and the second involves:
repo.commit -> stage.dump -> ProjectFile.dump -> Lockfile.dump
When the lock file is large, the costs of these steps adds up.

Possible solutions?

It would be amazing if dvc supported multiple lock files per pipeline, as this would make the files faster load and also make it easier to scale the size of the dataset used, possibly in a streaming way. Alternatively, it would be nice if dvc commit did these two operations in batches rather than once-per-stage.

I'm working on a pull request for this "batch solution" that I'm hoping to submit later today, and would be curious to hear your thoughts on it or if there are better ways!

[ update : PR here ]

Reproduce

Make a dvc project with a data folder and a dvc.yaml file containing

stages:
  stepA:
    matrix:
      x: [1,2,3,4,5,6,7]
      y: [1,2,3,4,5,6,7]
      z: [1,2,3,4,5,6,7]
    cmd: echo "substep${item.x}-${item.y}-${item.z}" > data/substep${item.x}-${item.y}-${item.z}.txt
    outs:
      - data/substep${item.x}-${item.y}-${item.z}.txt

Run dvc repro --no-commit, which takes about 2.5 min.
Run dvc commit -f, which takes about 4 minutes!

Of course, you can scale this example up or down.

With the "batch solution" that I'm hoping form a pull request for today, the commit time goes down to ~15 seconds.

Related issues

#755

#7607

Environment

Output of dvc doctor:

DVC version: 0.1.dev9360+gfb24775
---------------------------------
Platform: Python 3.10.9 on Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.3.0
        scmrepo = 3.3.8
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3)
Config:
        Global: /home/fishea10/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: lustre on 10.175.175.16@tcp:/eriwzbev/home/fishea10
Caches: local
Remotes: None
Workspace directory: lustre on 10.175.175.16@tcp:/eriwzbev/home/fishea10
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/d8250502cb0524d2bb958c32788698bf
@bric-afisher bric-afisher linked a pull request Nov 23, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant