You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My team has been having a problem as we've been increasing the size of our dataset. We've created an internal tool for using the dvc DAG, along with dvc matrix, to parallelize the different stages and substages. After running these stages, we use dvc commit to update the lock file. However, because lock file is large, this commit stage takes a very long time. Looking at it more closely, my hypothesis is that the time required for dvc commit grows quadratically in the number of stages. For ex, in the example below, the commit step takes longer than the repro step.
The problem seems to that dvc opens the dvc.lock file at least twice per stage. Once to update the stage objects, and once to update the lockfile. As far as I understand, the first operation involves calling: repo.commit -> stage.save -> stage.get_versioned_outs -> stage.reload -> dvcfile._reset
and the second involves: repo.commit -> stage.dump -> ProjectFile.dump -> Lockfile.dump
When the lock file is large, the costs of these steps adds up.
Possible solutions?
It would be amazing if dvc supported multiple lock files per pipeline, as this would make the files faster load and also make it easier to scale the size of the dataset used, possibly in a streaming way. Alternatively, it would be nice if dvc commit did these two operations in batches rather than once-per-stage.
I'm working on a pull request for this "batch solution" that I'm hoping to submit later today, and would be curious to hear your thoughts on it or if there are better ways!
Description
My team has been having a problem as we've been increasing the size of our dataset. We've created an internal tool for using the dvc DAG, along with dvc matrix, to parallelize the different stages and substages. After running these stages, we use dvc commit to update the lock file. However, because lock file is large, this commit stage takes a very long time. Looking at it more closely, my hypothesis is that the time required for dvc commit grows quadratically in the number of stages. For ex, in the example below, the commit step takes longer than the repro step.
The problem seems to that dvc opens the dvc.lock file at least twice per stage. Once to update the stage objects, and once to update the lockfile. As far as I understand, the first operation involves calling:
repo.commit -> stage.save -> stage.get_versioned_outs -> stage.reload -> dvcfile._reset
and the second involves:
repo.commit -> stage.dump -> ProjectFile.dump -> Lockfile.dump
When the lock file is large, the costs of these steps adds up.
Possible solutions?
It would be amazing if dvc supported multiple lock files per pipeline, as this would make the files faster load and also make it easier to scale the size of the dataset used, possibly in a streaming way. Alternatively, it would be nice if dvc commit did these two operations in batches rather than once-per-stage.
I'm working on a pull request for this "batch solution" that I'm hoping to submit later today, and would be curious to hear your thoughts on it or if there are better ways!
[ update : PR here ]
Reproduce
Make a dvc project with a
data
folder and a dvc.yaml file containingRun
dvc repro --no-commit
, which takes about 2.5 min.Run
dvc commit -f
, which takes about 4 minutes!Of course, you can scale this example up or down.
With the "batch solution" that I'm hoping form a pull request for today, the commit time goes down to ~15 seconds.
Related issues
#755
#7607
Environment
Output of dvc doctor:
The text was updated successfully, but these errors were encountered: