Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster index building #9813

Closed
johnyaku opened this issue Aug 7, 2023 · 27 comments
Closed

Faster index building #9813

johnyaku opened this issue Aug 7, 2023 · 27 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks

Comments

@johnyaku
Copy link

johnyaku commented Aug 7, 2023

We have several large DVC-controlled projects. One example is over 10TB. The "raw" input data includes about 3000 larget files ranging in size from 5GB to 100GB or more. Building an index takes 10 hours or more.

If this was a one-time cost, it might be acceptable. But the index needs to be rebuilt for each instance (clone) of the project, and for each compute node working on the project, even tho the shared external cache is on a shared file system. We have a work-around for the mutli-node problem, but the paths in cache.db are absolute rather than relative to the project top, and so we are unable to recycle indices across project instances.

  • Using relative paths would make indices recyclable across instances.
  • Parallelization across multiple cores would also greatly speed up initial indexing.
  • The fastest fix would be to simply trust the checkums implied by the filenames in the cache, as I believe is already done for remotes

This bottleneck is threatening to become a show-stopper. I understand from @efiop that a fix is under development, but we'd really appreciate if this could become a priority.

Tagging @dlroden

@daavoo daavoo added performance improvement over resource / time consuming tasks A: data-management Related to dvc add/checkout/commit/move/remove labels Aug 7, 2023
@skshetry
Copy link
Member

Hi, @johnyaku.

What commands are you using? Also, I would appreciate a profiling data to see what is slow.
Also, I am curious what the performance was like before index was introduced.

@johnyaku
Copy link
Author

What commands are you using?

Any command that requires checksum calculations for indexing. This includes dvc add, dvc commit, dvc import etc.

The most frustrating part involves importing data that is already in an external shared cache, with filenames already based on checksums. Having to spend another 10 hours recalculating these checksums feels like a waste of time. I understand that we might want to verify that the content of the files matches the filenames, but the files in the external shared cache are read only and so the content is unlikely to have changed, barring disk corruption. It would be nice to be able to trust the filenames and construct an index quickly.

I would appreciate a profiling data to see what is slow.

Not sure exactly what you are looking for here, but happy to help.

I am curious what the performance was like before index was introduced.

Well, dvc add etc was much faster (no indexing required) but dvc data status was much slower (no index). We are OK with the idea of using an index, but the practical realities of creating indices are causing some of the team to question our adoption of DVC.

@efiop
Copy link
Contributor

efiop commented Aug 17, 2023

For the record, here is the original thread https://discord.com/channels/485586884165107732/563406153334128681/1133619509656301628

The index discussed here mostly means having it with the cache so that it can be shared in shared cache scenario instead of every user having itsown instance. This would also be useful for other applications (e.g. in remote, so that we can download index and use it instead of polling remote for every file).

I don't think this issue is actionable on its own, as it depends on a lot of other stuff after which we can dive deeper into related topic. E.g. #9333 needs to b ecompleted, so that we can throw out old state db and use new index db that is more suitable for sharing.

@skshetry
Copy link
Member

skshetry commented Aug 18, 2023

@johnyaku, one tip: if dvc add is being slow, you can now selectively update the datasets which should be a bit faster if the changes are small/limited.

eg: if the dataset is data, you can do dvc add data/subpath to only update from subpath.

@johnyaku
Copy link
Author

Thanks @skshetry. Can I clarify that last point with a concrete example?

Suppose I have already added a directory with subdirectories, as follows, and this is tracked via a.dvc:

└── a
    ├── b
    └── c

Then I change the content of a/c.

I'm interpreting your comment as meaning that I can just dvc add a/c rather than having to match the a.dvc file and run dvc add a. Is this correct?

I can see how that might speed things up a bit, so thanks! :)

However, our bigger problem is due to the multi-node nature of our work environment. Our hacks to mirror the index dbs sometimes work, but not always. And we need to build a new index for each user working on a dataset, and even for each instance of the dataset owned by the same user. This would be fine if it only took a few minutes, but with many large files this can take 10 hours or more.

So we desperately need either

  1. faster index building, or
  2. portable indices

Faster index building could include:

  1. trusting the external shared cache
  2. parallel checksums
  3. skipping verification

By 1) trusting the cache I mean trusting the checksum implied by the filename in the cache, rather than recalculating. I think something like this is already done for remote?

For 2) there is already a config option core.checksum_jobs that I believe parallelizes checksum calculations during push and pull operations. It would be great if it could also apply to index building operations, whether they are triggered by dvc add, dvc data status or whatever. We can potentially throw up to 144 cores at this, so that would reduce 10 hours to under five minutes. (I understand that calculating the checkusm for an individual file can't be parallelized, but calculating checksums for hundreds of files can -- by assigning a separate core for each file.)

For 3), while having an index does deliver a noticeable performance boost later, sometimes we just want to get on with life. We typically use dvc install to set up git hooks, but the having to wait hours for dvc data status to run in order to complete a git commit is causing many of our team to either disable the pre-commit hook or skip it with --no-verify. I understand that there are risks about simply trusting the cache, but there are also risks in disabling these checks ...

Turning to portable indices, I can think of multiple ways to skin that particular cat but it would be great to have a ready-made solution that allowed indices to be recycled across machines, users and instances. I understand that the index needs to be on a local disk for performance reasons, but the /var directories on our on-prem nodes are regularly deleted by the sys admins and so we need a more permanent store (perhaps in the cache/remote or even in git itself) that can be rapidly "reflated" to the site_cache_dir on demand. Ideally, this would be taken care of by the dvc commands that reference the index rather than being something that we would have to remember to do ourselves first (as is the case with our current hacks).

Inspecting the sqlite files I noticed that the key column uses absolute paths, which gets in the way of portability between instances. Switching to relative paths (relative to the top of the project) might be a prerequiste for portability.

Finally, all of these headaches are coming close to being a showstopper for our use of DVC, and are making folks reluctant to migrate to v3.0+. We are currently using v2.58.2. If a solution for these indexing issues is made, it would be super helpful to have a patch for v2 so that we can regain the team's confidence before dealing with the challenges of migrating to the new version.

@johnyaku
Copy link
Author

I'm sure that a complete fix for this will require quite a bit of planning and testing.

But progress on any of the points above would greatly improve our situation.

@efiop @skshetry

@skshetry skshetry added this to DVC Aug 29, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Aug 29, 2023
@skshetry skshetry removed this from DVC Aug 29, 2023
@skshetry skshetry added this to DVC Aug 29, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Aug 29, 2023
@skshetry
Copy link
Member

skshetry commented Aug 29, 2023

I'm interpreting your comment as meaning that I can just dvc add a/c rather than having to match the a.dvc file and run dvc add a. Is this correct?

Yes, that's what I meant. :)

I understand that the index needs to be on a local disk for performance reasons, but the /var directories on our on-prem nodes are regularly deleted by the sys admins and so we need a more permanent store.

Are you aware of core.site_cache_dir config?

For 2) there is already a config option core.checksum_jobs

core.checksum_jobs is a no-op (even in push/pull IIRC) after iterative/dvc-data#53, as it had high overhead with small files. Related: #3177.

Btw, do you mind sharing a cprofile data for dvc add? You can generate using the following command:

dvc add --cprofile-dump add.prof <...args>

@johnyaku
Copy link
Author

Are you aware of core.site_cache_dir config?

Yes. We actually set it core.site_cache_dir to /tmp/dvc but unfortunately /tmp is not much safer from the sys admins than /var/tmp. Both are small local volumes used by many users and wiped frequently.

core.checksum_jobs is a no-op

That is unfortunate. Although we lack admin rights over our compute environments, we do have a lot of compute resources that we could throw at this if there was a way to utilise them. I echo a lot of @ivartz's sentiments in #3177 even though our datasets have very different file number/file size profiles.

add.prof attached via Discord for a modestly sized directory that I just added. Many of our datasets are orders of magnitude larger than this.

@skshetry
Copy link
Member

skshetry commented Aug 29, 2023

Looking at the profiling data, most of the time is spent hashing here. On a total runtime of 1505s, 75% of the time was spent hashing 475 files (2.3s/file) and the rest is spent transferring those files to cache.

Regarding the first, I think parallelization would help to a certain extent for new files.
Index caching will help if they are existing tracked files. I am not sure how much new files you will have versus existing tracked files. But anyway, we won't be able to use all cores, as we will be IO-bounded, not CPU bounded. And using Python, we haven't been able to scale linearly with more cores/threads even when we supported parallel hashing. The performance was negative on small files, maybe 2x - 3x on big files.

But with single core/thread, we are not able to saturate IO either.

Regarding the transfer, I guess panfs does not support hardlinks, does it? We use hardlinks during dvc add. Another option here is server-side copying using copy_file_range syscall if panfs supports.

add.prof.tar.gz

@johnyaku
Copy link
Author

Thanks @skshetry. We are not particularly concerned about transfer times at this point.

75% of the time was spent hashing

That sounds about right. So I guess the essence of this issue is ...

  1. can we reduce the time spent hashing? or
  2. can we minimise index building, or avoid the need for it completely?

Thanks for explaining the nuances of how parallelization impacts performance. Altho it seems that parallelization is not a general panacea for reducing hashing time, I feel that in our case -- several datasets with hundreds of files in the 50~150GB range -- it would make a huge difference, so it would be nice to have the option to turn parallelization on manually (with a --jobs option, for example). Obviously it would be optimal If this option only applied to files over a certain size , but right now we'd take a few seconds' hit on the small files for a few hours gained on the big ones.

But even if hashing inevitably takes time, we are prepared to pay the upfront cost of building the index in return for performance improvments later. The problem that is that we find ourselves indexing the same files again and again for existing tracked files. This happens for three reasons:

  1. hashes are recalculated for imports
  2. we switch between compute nodes on the HPC
  3. different users clone different instances at differerent root directories

Recalculating hashes for 1) above feels like a huge waste of time. It should be possible to either query the index for the import source, or simply trust the hash-based file names in the shared cache. Problems 2) and 3) could be solved by making indices more portable, so if indices could be cached or otherwise recycled that would be a huge help.

There are few moving pieces to this, and a perfect solution might involve a few coordinated steps. But this is such a huge blocker right now that even partial improvements would be game changing.

@johnyaku
Copy link
Author

johnyaku commented Sep 6, 2023

Another time we hit this obstacle is even when using dvc checkout to inspect a dataset that somebody else is working on. It can take hours just to build the index, whereas previously we could checkout a large dataset in seconds.

If making the index more performant is not a priority right now, it would be great to have the option to delay indexing. Yes, having an index will make future operations run faster, but I'd like to inspect the dataset right now. WIth delayed indexing I could then index the workspace overnight, but still get on with life in the meantime.

Seriously thinking of reverting at this point. What is the most recent version prior to the introduction of indexing?

@johnyaku
Copy link
Author

Delaying indexing as in #9982 is probably the simplest change that might help us work around this issue.

The next simplest change would be adjust build_entry() or hash_file() or Meta.from_info() so that ...

  1. if a file is a symlink to the cache, then
  2. the hash is determined based on the filename of the link target, rather than recalculating the checksum.

I appreciate the risks entailed by this approach given the possibility that somebody might manipulate files in the cache directly (rather than via dvc operations). However, the files in the cache are read-only and so it would take some fairly deliberate stupidity to mess up the cache like this -- it isn't something that it is likely to happen by accident.

The paranoid option is to recalculate the checksums each time, but this can take hours as described above. We'd really appreciate either an option or a setting to enable hashes to be derived from the filenames in the cache.

@dberenbaum dberenbaum removed the status in DVC Oct 3, 2023
@dberenbaum dberenbaum removed this from DVC Oct 3, 2023
@ghost
Copy link

ghost commented Nov 2, 2023

Hi, just in order to up this issue, we also have a similar problem on our end (though not exactly the same). The index is rebuilt at each of our CI runs when the data is pulled, and this is what takes 90% of the duration of the dvc pull (more than 1 hour).
Joined is a screenshot of the profiling of the dvc pull call.
Capture d’écran 2023-11-02 à 12 29 22

We are looking into workarounds, but delaying indexing or avoiding recalculating checksums would make it easier

@johnyaku
Copy link
Author

Here is another case for faster index building.

I have a directory tracked by dvc via a .dvc file. It happens to be an imported directory, but I think that is incidental.

When I add (import) the directory, an index gets built. It takes hours, but I have other things to do so I can live with it. Later tho, I discover that the directory contained a few empty files that act as checkpoints for the pipeline that generated the directory. So I decide to clean up the directory by deleting the checkpoint files and dvc commiting the directory again.

Since none of the other (large) files have changed, all that really needs to happen is to create a new .dir file (without the deleted checkpoint files) and then recreate a new .dvc file to point to this .dir file. But instead dvc starts computing checksums all over again! So an operation that should have taken a few seconds ends up taking forever.

I am down with being paranoid about possible changes to the data that still remains in the directory, but in this case I know that nothing has changed because I only just added the data. I'd really appreciate having the ability to skip the checksum validation, or perhaps just do a light touch validation by comparing the hashes in the cached filenames against what the old .dir file thought should be there.

@johnyaku
Copy link
Author

Nudging this issue with a question ...

We are considering migrating to pulbic cloud (GCP). Will we still face long waits for indexing? Or can the index be constructed quickly from file metadata?

@maxstrobel
Copy link

Hi, any updates on that one?

@dberenbaum
Copy link
Collaborator

Sorry, no updates at the moment

@maxstrobel
Copy link

Thanks for the quick response!
I'm facing also issues in a CI environment - jobs get spawned in different temporary directories, which leads to an expensive re-hashing of the full dataset. So instead of creating only a bunch of links (cache is on the same machine), the additional checksum calculation takes ages, which makes usage in a CI environment infeasible.
For us this is a show stopper.

@dberenbaum @efiop @skshetry Has anybody of you an idea, which of the proposals in #9813 (comment) would be realistic to be added to DVC?

@johnyaku Do you have any workarounds that you can share?

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Apr 3, 2024
@dberenbaum
Copy link
Collaborator

@johnyaku Thanks for your detailed explanations and persistence here. I have tried to take a closer look at some of these issues and test out the bottlenecks, and it seems that there are multiple different issues:

  1. Sqlite operations are not batched - mainly a problem for many files. See building tree: batch save/get from state dvc-data#111, checkout: bulk save to state db dvc-data#125, repro: Rebuilds same tree unnecessarily #9085 (comment)
  2. Files are always relinked - mainly a problem for many files. See DVC very slow for datasets with many files #7607, meta: capture nlink/ishardlink and islink/issymlink dvc-data#274
  3. Checksums are processed sequentially - mainly a problem for large files (see above)
  4. Regression in imports that rehashes and transfers all the data. See import: local cache is ignored when importing data that already exist in cache #10255 (comment).

For already tracked files, dvc will not actually recompute any checksums (except for imports due to 4), but issues 1 and 2 can still be a problem for many thousands of files. I don't think any of these issues are specific to the index, and I don't see that it was faster before indexing (dvc add has to calculate checksums of all files with or without indexing).

Until we are able to address those issues, @skshetry already mentioned that you can use dvc add with granular targets in the directory so dvc knows to only check for changes to those files. You can even specify deleted files as targets to dvc add.

@johnyaku
Copy link
Author

johnyaku commented Apr 5, 2024

@dberenbaum Thanks for looking deeply into this! I hadn't appreciated the possibility of granular targets with dvc add and that looks useful.

For already tracked files, dvc will not actually recompute any checksums (except for imports due to 4)

I initially wrote "That's not true!" ... then I did some testing.

It wasn't true in v2.58.2 and I even think it may not have been true as late as v3.27 (approx).
With earlier versions, checksums were recalculated when dvc checkout was run on a fresh clone with the data already in a shared cache, and this has been driving me mad for at least six months.

But the claim above is true in v3.43.1 and possibly earlier. I've been stalking the release notes ever since v3 came out waiting for a solution so that I could persuade the rest of the team to migrate, but I somehow missed the change that addressed this. That's going to be an easy case to make now.

I'll migrate a couple of projects to v3 and close this after a couple of days of testing if the fixes already in place address all of our concerns. Tagging @dlroden

@gregstarr
Copy link

also everything is reshashed in experiments: #10308

@gregstarr
Copy link

@johnyaku for your large datasets, how are you accessing them from your projects? won't dvc add download them all to your local computer/node? Or are you using external data?

@johnyaku
Copy link
Author

@gregstarr
At least count we have 71 projects. Some of them use external data but data we generate in-house is added to one of several data registries. We then create datasets by importing selected data from one or more registries.
It is possible that all this importing is the root cause of the rehashing problem, but I haven't had time to test properly.

@gregstarr
Copy link

Interesting.

  • For your external data, are they on a NAS or in object storage like s3 or minio?
  • for importing data, are you doing --no-download or not?
  • on projects with large datasets, have you run into the issue where running experiments causes rehashing?

@johnyaku
Copy link
Author

johnyaku commented May 2, 2024

@gregstarr

  • We don't use experiments (yet). Just add / import and repro, etc.
  • We sometimes use --no-download but eventually we need to download the data to do something with it.
  • External data could be literally anthing: local file system, remote (SSH) file system, s3, HTTPS, etc. But our most common pattern is to add data generated in-house (or by collaborators) to a registry, and then import from there into different datasets for different projects.

@dberenbaum: I've finally had a chance to do some systematic testing.

With v3.27 I can checkout from an external shared cache without rehashing (unlke v2.58.2).
Hashes are reculculated for import, which still seems a bit redundant, but only when the data is first imported.
Subsequent check outs from the external shared cache containing the imports do not trigger another rehash (again, unike v2.58.2).

So, v3 is not quite perfect, but dramatically better than v2. We're in the progress of migrating our many repos, and once this is complete I expect this problem will become a thing of the past.

Tagging @dlroden

@johnyaku johnyaku closed this as completed May 2, 2024
@gregstarr
Copy link

@johnyaku thanks for the info. I have a medium size dataset ~1TB and have found certain things very inconvenient. It sounds like using import rather than external inputs/outputs could improve things for me.

@johnyaku
Copy link
Author

johnyaku commented May 6, 2024

The downside with imports from a registry is that you have to actively manage the registry. The upside is that the registry is actively managed ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

7 participants