dvc commit is slow when there are many stages #10629

bric-afisher · 2024-11-23T18:33:12Z

Description

My team has been having a problem as we've been increasing the size of our dataset. We've created an internal tool for using the dvc DAG, along with dvc matrix, to parallelize the different stages and substages. After running these stages, we use dvc commit to update the lock file. However, because lock file is large, this commit stage takes a very long time. Looking at it more closely, my hypothesis is that the time required for dvc commit grows quadratically in the number of stages. For ex, in the example below, the commit step takes longer than the repro step.

The problem seems to that dvc opens the dvc.lock file at least twice per stage. Once to update the stage objects, and once to update the lockfile. As far as I understand, the first operation involves calling:
repo.commit -> stage.save -> stage.get_versioned_outs -> stage.reload -> dvcfile._reset
and the second involves:
repo.commit -> stage.dump -> ProjectFile.dump -> Lockfile.dump
When the lock file is large, the costs of these steps adds up.

Possible solutions?

It would be amazing if dvc supported multiple lock files per pipeline, as this would make the files faster load and also make it easier to scale the size of the dataset used, possibly in a streaming way. Alternatively, it would be nice if dvc commit did these two operations in batches rather than once-per-stage.

I'm working on a pull request for this "batch solution" that I'm hoping to submit later today, and would be curious to hear your thoughts on it or if there are better ways!

[ update : PR here ]

Reproduce

Make a dvc project with a data folder and a dvc.yaml file containing

stages:
  stepA:
    matrix:
      x: [1,2,3,4,5,6,7]
      y: [1,2,3,4,5,6,7]
      z: [1,2,3,4,5,6,7]
    cmd: echo "substep${item.x}-${item.y}-${item.z}" > data/substep${item.x}-${item.y}-${item.z}.txt
    outs:
      - data/substep${item.x}-${item.y}-${item.z}.txt

Run dvc repro --no-commit, which takes about 2.5 min.
Run dvc commit -f, which takes about 4 minutes!

Of course, you can scale this example up or down.

With the "batch solution" that I'm hoping form a pull request for today, the commit time goes down to ~15 seconds.

Related issues

#755

#7607

Environment

Output of dvc doctor:

DVC version: 0.1.dev9360+gfb24775
---------------------------------
Platform: Python 3.10.9 on Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.3.0
        scmrepo = 3.3.8
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3)
Config:
        Global: /home/fishea10/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: lustre on 10.175.175.16@tcp:/eriwzbev/home/fishea10
Caches: local
Remotes: None
Workspace directory: lustre on 10.175.175.16@tcp:/eriwzbev/home/fishea10
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/d8250502cb0524d2bb958c32788698bf

The text was updated successfully, but these errors were encountered:

bric-afisher linked a pull request Nov 23, 2024 that will close this issue

batch dump and batch update of old stages #10630

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc commit is slow when there are many stages #10629

dvc commit is slow when there are many stages #10629

bric-afisher commented Nov 23, 2024 •

edited

Loading

dvc commit is slow when there are many stages #10629

dvc commit is slow when there are many stages #10629

Comments

bric-afisher commented Nov 23, 2024 • edited Loading

Description

Possible solutions?

Reproduce

Related issues

Environment

bric-afisher commented Nov 23, 2024 •

edited

Loading