-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: migrating to dvc-data's index #9333
Comments
сс @shcheklein |
Prioritizing dvc fetch now because it's needed for studio backend to use cloud versioning (https://github.com/iterative/studio/issues/4782). Not prioritizing other items now. |
Thanks @efiop for creating this! Can we add some context here related to features that we dropped along the way (e.g. the one we discussed in Slack- parallelism in hashes, |
iterative/dvc-data#341 (comment) says:
Should I pause the work in #2036 until we finish this migration? |
Indeed, I have a working branch using the current logic. I was asking just in case we are going to "throw out" that code any time soon |
In operations like fetch/data-status, we can think of all data files tracked as one single piece. In #4657, I am thinking of using |
@skshetry Hm, could you elaborate why |
I am still trying to understand With def virtual_commit(repo, path):
out = repo.index.outs_trie.longest_prefix(path)
filter_info = out.fs.relpath(path, out.path)
_fetch_tree_if_needed()
out.obj = _update_obj(out.get_obj(), path)
out.stage.dump() But with index, you are working with a larger data structure, which you have to build, I see the benefits of using an index where we can assume all the files/datasets tracked as one single piece, but add/commit are selective updates, so I don't quite see the need for Index here. There are potential performance benefits of the Index in the future (when we introduce batching or getting rid of state, etc.) but that can be implemented outside the Index too. Anyway, I am only looking at this from #4657 implementation perspective only. What would Index provide I think we should wait before implementing #4657 in any case (cc @dberenbaum). |
Wait for what? Your rationale for |
If we want to use Index in dvc add/commit, outs_trie implementation will be temporary, and we will have to rewrite virtual directory support based on Index. Let’s wait for Ruslan, i want to know his thoughts. |
Okay, makes sense, thanks for the explanation. |
We should probably move this discussion to the related ticket.
@skshetry But we use Index views for targets even below output level of granularity. You can create a view from a path, tweak index to your liking and then use, for example, outs_trie, to map it back to stages. But if the way you see it doesn't involve index - that's totally fine too, I have no problems with it. Feel free to implement it the way you want it for now. |
Views are more of a convenience, conceptually it's just an index though. Could you please elaborate more on the
But if we use Index, we won't need outs_trie, right? As all the virtual operations has to be operated on the Index in that case. |
DataIndex is a structure for managing data. It is below the level of outputs. It is similar to Git's Index, but more general and is not tied to real local workspaces. Even in git, you don't always have to use index to do things, you could write objects directly if you really want to, same here.
Not really, outputs are above data index level. DataIndex is only managing data, which outputs use. |
I'd appreciate more details on the index. Is there any document?
|
@dmpetrov Unfortunately there is no comprehensive document.
|
It looks like an extension of the state, isn't it? Why these two should be separate? An example would be helpful.
👍 the most critical part since dvc is "codification" tool.
Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag. An extra question. A similar index can significantly speed up metrics and plots. Should it be the same index? Why? |
It is not an extension of state. State is key-value storage that is an afterthought for data management, while index is hierarchical and can be used directly. The best analogy I can come up with is our state vs git index.
But etag is not a hash either. We have misused it as hash for external outputs and stuff, but that was wrong, since etag is a checksum not a hash. Same with version_ids. etag/version_id/etc is a metadata that we capture that serves like a checksum in a sense that it allows us to detect changes, but it can't be used for universal content identification (version_ids with a particular path on particular cloud allow you to get access to particular data, but that's not strictly speaking universal even for file renames or copies on the same cloud). I think git index vs git objects is the best analogy here.
Since we cache index - yes, it can speed up those operations. But for metrics and plots there is an additional step of "rendering", so it will benefit from caching that separately as well. |
The state is just a table in DB. Why the hierarchical index cannot be added to the same DB?
I'm not sure I understand the analogy. I got the hierarchy part (is it the Git analogy). However, I don't understanding the need in a separate DB instance/file.
Also from Dave's comment:
I got an impression that this index was introduced for cloud versioning. Is my impression correct? The question is - can cloud versioning be implemented without this index?
The question is - will we need another DB instance/file to cache the metrics or the same file can be re-used? So far, it feel like we are introducing to many DB files while the best practice is to keep a single DB. |
It can be, but there is no reason. We already have state and legacy remote index dbs and new index is going to replace both of those, so it didn't make sense to try to shove it into state. Instead we'll complete migration to the new index and drop legacy state/remote indexes.
There is no need in separate new file, see answer above.
It was needed for cloud versioning because it doesn't fit into old ways of doing things.
Could be reused, sure, but I don't know what size and format will metrics/plots try to cache, so I can't say how it will be stored and where right now. |
Introducing a new sqlite file(DB) instead of migrating a schema for existing one? Please elaborate why is this better.
I'd appreciate an explanation. |
State and remote indexes are using diskcache and not pure sqlite, while new index is pure sqlite. diskcache makes some assumptions about db structure so it would be messy to try to shove them into one db. Considering that state and remote indexes are going to be dropped, we can just have a fresh start in the new index. So 1 new file is better than 2+ old ones.
I've elaborated above in this issue. This is about how it works internally, with no user implication. |
That's helpful! Additionally, it would be great to know the options that were considered - is there is an option to do a DB migration (these DBs are compatible I assume).
🤔 It's not for users, it is for the team to understand the motivation. Based on the discussion and the questions, The initiative looks great but It requires a discussion with the team. I strongly encourage the team to discuss and carefully consider such design decisions before proceeding with implementation. |
btw (since I asked @efiop to create this to actually better understand the timeline for the It would be very great to have some tech design specs behind the index, and other systems for the whole team. |
diskcache is a sqlite db, but it has its own assumptions about tables. It is not worth trying to shove them together. We are getting rid of it completely in favor of 1 file, so it is unreasonable to try to migrate or merge together with discache. Those are really all the options that we had. I don't think this issue of the temporary number of hidden db files is so critical.
Sure, there wasn't an official process with clear prelude, but this was a gradual process and we've talked about it in private for more than a year and lots of things were already implemented with that (e.g. cloud versioning).
This issue is just for visibility for the things that were going on for a very long time already. This is not a brand-new effort or anything like that.
Not sure how to approach that at this point. There is https://github.com/iterative/dvc-data and how it is used in dvc. The analogies are from well-known git concepts. Do we have some great examples from other teams maybe that I could take as a template? |
For the record: had a meeting with @skshetry and agreed that the legacy obj-based implementation even though would work for regular outputs, won't work for cloud versioning/import/etc (there are no hashes, no objects there, hence no Tree objects), so it makes sense to also use index there as discussed before. The missing part in our toolset is serializing dvc's |
Sounds reasonable. @skshetry Can we keep the serialization part as minimal as possible so it doesn't become an entire separate story for now? |
Very basic implementation to start with. Will be tweaked later. Pre-requisite for iterative/dvc#9333
Very basic implementation to start with. Will be tweaked later. Pre-requisite for iterative/dvc#9333
Aiming to do push this sprint |
Summary / Background
Tracking progress on migrating dvc internals to using dvc-data's DataIndex
Scope
Old hashfile/state/pipeline/output based logic in data management migrating to
DataIndex
to introduce a common logic/code/arch that everything could rely on and so everything could benefit from optimizations in one place.Assumptions
Open Questions
Blockers / Dependencies
General Approach
Steps
Must have (p1)
Optional / followup (p2)
Timelines
Iterative migration based on higher-level product needs.
The text was updated successfully, but these errors were encountered: