-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster index building #9813
Comments
Hi, @johnyaku. What commands are you using? Also, I would appreciate a profiling data to see what is slow. |
Any command that requires checksum calculations for indexing. This includes The most frustrating part involves importing data that is already in an external shared cache, with filenames already based on checksums. Having to spend another 10 hours recalculating these checksums feels like a waste of time. I understand that we might want to verify that the content of the files matches the filenames, but the files in the external shared cache are read only and so the content is unlikely to have changed, barring disk corruption. It would be nice to be able to trust the filenames and construct an index quickly.
Not sure exactly what you are looking for here, but happy to help.
Well, |
For the record, here is the original thread https://discord.com/channels/485586884165107732/563406153334128681/1133619509656301628 The index discussed here mostly means having it with the cache so that it can be shared in shared cache scenario instead of every user having itsown instance. This would also be useful for other applications (e.g. in remote, so that we can download index and use it instead of polling remote for every file). I don't think this issue is actionable on its own, as it depends on a lot of other stuff after which we can dive deeper into related topic. E.g. #9333 needs to b ecompleted, so that we can throw out old state db and use new index db that is more suitable for sharing. |
@johnyaku, one tip: if eg: if the dataset is |
Thanks @skshetry. Can I clarify that last point with a concrete example? Suppose I have already added a directory with subdirectories, as follows, and this is tracked via
Then I change the content of I'm interpreting your comment as meaning that I can just I can see how that might speed things up a bit, so thanks! :) However, our bigger problem is due to the multi-node nature of our work environment. Our hacks to mirror the index dbs sometimes work, but not always. And we need to build a new index for each user working on a dataset, and even for each instance of the dataset owned by the same user. This would be fine if it only took a few minutes, but with many large files this can take 10 hours or more. So we desperately need either
Faster index building could include:
By 1) trusting the cache I mean trusting the checksum implied by the filename in the cache, rather than recalculating. I think something like this is already done for remote? For 2) there is already a config option For 3), while having an index does deliver a noticeable performance boost later, sometimes we just want to get on with life. We typically use Turning to portable indices, I can think of multiple ways to skin that particular cat but it would be great to have a ready-made solution that allowed indices to be recycled across machines, users and instances. I understand that the index needs to be on a local disk for performance reasons, but the Inspecting the sqlite files I noticed that the Finally, all of these headaches are coming close to being a showstopper for our use of DVC, and are making folks reluctant to migrate to v3.0+. We are currently using v2.58.2. If a solution for these indexing issues is made, it would be super helpful to have a patch for v2 so that we can regain the team's confidence before dealing with the challenges of migrating to the new version. |
Yes, that's what I meant. :)
Are you aware of
Btw, do you mind sharing a cprofile data for dvc add --cprofile-dump add.prof <...args> |
Yes. We actually set it
That is unfortunate. Although we lack admin rights over our compute environments, we do have a lot of compute resources that we could throw at this if there was a way to utilise them. I echo a lot of @ivartz's sentiments in #3177 even though our datasets have very different file number/file size profiles.
|
Looking at the profiling data, most of the time is spent hashing here. On a total runtime of 1505s, 75% of the time was spent hashing 475 files (2.3s/file) and the rest is spent transferring those files to cache. Regarding the first, I think parallelization would help to a certain extent for new files. But with single core/thread, we are not able to saturate IO either. Regarding the transfer, I guess panfs does not support hardlinks, does it? We use hardlinks during |
Thanks @skshetry. We are not particularly concerned about transfer times at this point.
That sounds about right. So I guess the essence of this issue is ...
Thanks for explaining the nuances of how parallelization impacts performance. Altho it seems that parallelization is not a general panacea for reducing hashing time, I feel that in our case -- several datasets with hundreds of files in the 50~150GB range -- it would make a huge difference, so it would be nice to have the option to turn parallelization on manually (with a But even if hashing inevitably takes time, we are prepared to pay the upfront cost of building the index in return for performance improvments later. The problem that is that we find ourselves indexing the same files again and again for existing tracked files. This happens for three reasons:
Recalculating hashes for 1) above feels like a huge waste of time. It should be possible to either query the index for the import source, or simply trust the hash-based file names in the shared cache. Problems 2) and 3) could be solved by making indices more portable, so if indices could be cached or otherwise recycled that would be a huge help. There are few moving pieces to this, and a perfect solution might involve a few coordinated steps. But this is such a huge blocker right now that even partial improvements would be game changing. |
Another time we hit this obstacle is even when using If making the index more performant is not a priority right now, it would be great to have the option to delay indexing. Yes, having an index will make future operations run faster, but I'd like to inspect the dataset right now. WIth delayed indexing I could then index the workspace overnight, but still get on with life in the meantime. Seriously thinking of reverting at this point. What is the most recent version prior to the introduction of indexing? |
Delaying indexing as in #9982 is probably the simplest change that might help us work around this issue. The next simplest change would be adjust build_entry() or hash_file() or Meta.from_info() so that ...
I appreciate the risks entailed by this approach given the possibility that somebody might manipulate files in the cache directly (rather than via dvc operations). However, the files in the cache are read-only and so it would take some fairly deliberate stupidity to mess up the cache like this -- it isn't something that it is likely to happen by accident. The paranoid option is to recalculate the checksums each time, but this can take hours as described above. We'd really appreciate either an option or a setting to enable hashes to be derived from the filenames in the cache. |
Here is another case for faster index building. I have a directory tracked by dvc via a When I add (import) the directory, an index gets built. It takes hours, but I have other things to do so I can live with it. Later tho, I discover that the directory contained a few empty files that act as checkpoints for the pipeline that generated the directory. So I decide to clean up the directory by deleting the checkpoint files and Since none of the other (large) files have changed, all that really needs to happen is to create a new I am down with being paranoid about possible changes to the data that still remains in the directory, but in this case I know that nothing has changed because I only just added the data. I'd really appreciate having the ability to skip the checksum validation, or perhaps just do a light touch validation by comparing the hashes in the cached filenames against what the old |
Nudging this issue with a question ... We are considering migrating to pulbic cloud (GCP). Will we still face long waits for indexing? Or can the index be constructed quickly from file metadata? |
Hi, any updates on that one? |
Sorry, no updates at the moment |
Thanks for the quick response! @dberenbaum @efiop @skshetry Has anybody of you an idea, which of the proposals in #9813 (comment) would be realistic to be added to DVC? @johnyaku Do you have any workarounds that you can share? |
@johnyaku Thanks for your detailed explanations and persistence here. I have tried to take a closer look at some of these issues and test out the bottlenecks, and it seems that there are multiple different issues:
For already tracked files, dvc will not actually recompute any checksums (except for imports due to 4), but issues 1 and 2 can still be a problem for many thousands of files. I don't think any of these issues are specific to the index, and I don't see that it was faster before indexing ( Until we are able to address those issues, @skshetry already mentioned that you can use |
@dberenbaum Thanks for looking deeply into this! I hadn't appreciated the possibility of granular targets with
I initially wrote "That's not true!" ... then I did some testing. It wasn't true in v2.58.2 and I even think it may not have been true as late as v3.27 (approx). But the claim above is true in v3.43.1 and possibly earlier. I've been stalking the release notes ever since v3 came out waiting for a solution so that I could persuade the rest of the team to migrate, but I somehow missed the change that addressed this. That's going to be an easy case to make now. I'll migrate a couple of projects to v3 and close this after a couple of days of testing if the fixes already in place address all of our concerns. Tagging @dlroden |
also everything is reshashed in experiments: #10308 |
@johnyaku for your large datasets, how are you accessing them from your projects? won't |
@gregstarr |
Interesting.
|
@dberenbaum: I've finally had a chance to do some systematic testing. With v3.27 I can So, v3 is not quite perfect, but dramatically better than v2. We're in the progress of migrating our many repos, and once this is complete I expect this problem will become a thing of the past. Tagging @dlroden |
@johnyaku thanks for the info. I have a medium size dataset ~1TB and have found certain things very inconvenient. It sounds like using import rather than external inputs/outputs could improve things for me. |
The downside with imports from a registry is that you have to actively manage the registry. The upside is that the registry is actively managed ... |
We have several large DVC-controlled projects. One example is over 10TB. The "raw" input data includes about 3000 larget files ranging in size from 5GB to 100GB or more. Building an index takes 10 hours or more.
If this was a one-time cost, it might be acceptable. But the index needs to be rebuilt for each instance (clone) of the project, and for each compute node working on the project, even tho the shared external cache is on a shared file system. We have a work-around for the mutli-node problem, but the paths in
cache.db
are absolute rather than relative to the project top, and so we are unable to recycle indices across project instances.This bottleneck is threatening to become a show-stopper. I understand from @efiop that a fix is under development, but we'd really appreciate if this could become a priority.
Tagging @dlroden
The text was updated successfully, but these errors were encountered: