Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: migrating to dvc-data's index #9333

Open
8 of 12 tasks
efiop opened this issue Apr 15, 2023 · 29 comments
Open
8 of 12 tasks

Epic: migrating to dvc-data's index #9333

efiop opened this issue Apr 15, 2023 · 29 comments
Labels
epic Umbrella issue (high level). Include here: list of tasks/PRs, details, Qs, timelines, etc

Comments

@efiop
Copy link
Contributor

efiop commented Apr 15, 2023

Summary / Background

Tracking progress on migrating dvc internals to using dvc-data's DataIndex

Scope

Old hashfile/state/pipeline/output based logic in data management migrating to DataIndex to introduce a common logic/code/arch that everything could rely on and so everything could benefit from optimizations in one place.

Assumptions

Open Questions

Blockers / Dependencies

General Approach

Steps

Must have (p1)

Optional / followup (p2)

  • fine-tuning/optimizing index checkout/fetch/etc

Timelines

Iterative migration based on higher-level product needs.

@efiop efiop added the epic Umbrella issue (high level). Include here: list of tasks/PRs, details, Qs, timelines, etc label Apr 15, 2023
@efiop
Copy link
Contributor Author

efiop commented Apr 15, 2023

сс @shcheklein

@dberenbaum
Copy link
Collaborator

Prioritizing dvc fetch now because it's needed for studio backend to use cloud versioning (https://github.com/iterative/studio/issues/4782). Not prioritizing other items now.

@shcheklein
Copy link
Member

Thanks @efiop for creating this! Can we add some context here related to features that we dropped along the way (e.g. the one we discussed in Slack- parallelism in hashes, -j, etc) assuming that we can get to all these items soon? So that we can track it first, see if need to some actions to mitigate this for now, more visibility to users.

@daavoo
Copy link
Contributor

daavoo commented Apr 19, 2023

iterative/dvc-data#341 (comment) says:

This is about all data management that we have in dvc. So that we can get rid of get_used_objs stuff and so that all manipulations (like filtering by size, etc) are in one place.

dvc gc also relies on get_used_objs.

Should I pause the work in #2036 until we finish this migration?
Do we want to also migrate gc (add it to the list of p1 here) or are we ok with using the current logic for now?

cc @dberenbaum @efiop

@efiop
Copy link
Contributor Author

efiop commented Apr 19, 2023

@daavoo Since current gc doesn't work with cloud versioning and the whole logic for #2036 is to compare local objects with remote ones and delete overlap - you should be able to do it just fine without waiting for index for now.

@daavoo
Copy link
Contributor

daavoo commented Apr 19, 2023

@daavoo Since current gc doesn't work with cloud versioning and the whole logic for #2036 is to compare local objects with remote ones and delete overlap - you should be able to do it just fine without waiting for index for now.

Indeed, I have a working branch using the current logic.

I was asking just in case we are going to "throw out" that code any time soon

@skshetry
Copy link
Member

skshetry commented Apr 24, 2023

In operations like fetch/data-status, we can think of all data files tracked as one single piece.
But I am not sure that applies to add/commit where they are updates to individual dataset. Also mapping index changes back to dvc.yaml/.dvc is much more complicated and fragile than stages/outs.

In #4657, I am thinking of using outs_trie (still investigating though).

@efiop
Copy link
Contributor Author

efiop commented Apr 24, 2023

@skshetry Hm, could you elaborate why outs_trie seems to be more suitable than index? Just sounds surprising. It is fine if you want to discuss it in a PR and not now.

@skshetry
Copy link
Member

I am still trying to understand Index, so I may be wrong here, but to my mind I find outs_trie simpler.

With outs_trie, I find implementing #4657 this quite straightforward:

def virtual_commit(repo, path):
  out = repo.index.outs_trie.longest_prefix(path)
  filter_info = out.fs.relpath(path, out.path)

  _fetch_tree_if_needed()
  out.obj = _update_obj(out.get_obj(), path)
  out.stage.dump()

But with index, you are working with a larger data structure, which you have to build,
and update. Then, you have to diff two indexes again to find changes so that you can write it back to dvc.yaml/.dvc which is again complicated. With outs_trie, we already have a good mapping between output and dvc.yaml/.dvc, and the updates happen in-place which makes it much easier. (I wish we had data section where all outputs are tracked, which would have made all of these things simpler).

I see the benefits of using an index where we can assume all the files/datasets tracked as one single piece, but add/commit are selective updates, so I don't quite see the need for Index here. There are potential performance benefits of the Index in the future (when we introduce batching or getting rid of state, etc.) but that can be implemented outside the Index too.
Working with a complicated data structure than you need might not be a good idea.

Anyway, I am only looking at this from #4657 implementation perspective only. What would Index provide add/commit apart from the code-reuse?

I think we should wait before implementing #4657 in any case (cc @dberenbaum).

@dberenbaum
Copy link
Collaborator

I think we should wait before implementing #4657 in any case (cc @dberenbaum).

Wait for what? Your rationale for outs_trie makes sense to me, but I don't follow why it means we should wait to implement #4657.

@skshetry
Copy link
Member

Wait for what?

If we want to use Index in dvc add/commit, outs_trie implementation will be temporary, and we will have to rewrite virtual directory support based on Index.
So I think it’s better to discuss and plan on either ways.

Let’s wait for Ruslan, i want to know his thoughts.
Also, it depends on if Ruslan was suggesting use of Index in dvc add/commit is as a primary data structure or auxiliary.

@dberenbaum
Copy link
Collaborator

Okay, makes sense, thanks for the explanation.

@efiop
Copy link
Contributor Author

efiop commented Apr 26, 2023

We should probably move this discussion to the related ticket.

In operations like fetch/data-status, we can think of all data files tracked as one single piece.

@skshetry But we use Index views for targets even below output level of granularity. You can create a view from a path, tweak index to your liking and then use, for example, outs_trie, to map it back to stages. DataIndex also automatically fetches .dirs.

But if the way you see it doesn't involve index - that's totally fine too, I have no problems with it. Feel free to implement it the way you want it for now. outs_trie is likely going to stay and your effort won't be wasted even if we swap it index later on.

@skshetry
Copy link
Member

Views are more of a convenience, conceptually it's just an index though.

Could you please elaborate more on the Index please? What is it, and how does it fit?
What does it solve? etc.
Should it be treated like Git's Index (which means we have to populate it in all operations even on add/commit, etc)?

Feel free to implement it the way you want it for now. outs_trie is likely going to stay and your effort won't be wasted even if we swap it index later on.

But if we use Index, we won't need outs_trie, right? As all the virtual operations has to be operated on the Index in that case.

@efiop
Copy link
Contributor Author

efiop commented Apr 28, 2023

Could you please elaborate more on the Index please? What is it, and how does it fit?
What does it solve? etc.
Should it be treated like Git's Index (which means we have to populate it in all operations even on add/commit, etc)?

DataIndex is a structure for managing data. It is below the level of outputs. It is similar to Git's Index, but more general and is not tied to real local workspaces. Even in git, you don't always have to use index to do things, you could write objects directly if you really want to, same here.

But if we use Index, we won't need outs_trie, right? As all the virtual operations has to be operated on the Index in that case.

Not really, outputs are above data index level. DataIndex is only managing data, which outputs use. outs_trie will stay in one shape or form, because you need to map index entries to outputs/stages to serialize them.

@dmpetrov
Copy link
Member

I'd appreciate more details on the index. Is there any document?

  1. DVC already has the state which was introduced for optimization. Was index introduced for the same purpose - optimization - or something else?
  2. Is there a separated index file/dir? What is the path to the file?
  3. Does user need to commit the index (I assume there is a separate index file)?
  4. What if user accidentally removes the index file?
  5. Why cloud versioning is special and need a new data structure?

@efiop
Copy link
Contributor Author

efiop commented Apr 29, 2023

@dmpetrov Unfortunately there is no comprehensive document.

  1. Index is a data management layer, one level below pipeline outputs. It happens to remove the need for state though, since index is serializable and contains the same metadata. Think git index.

  2. It can be virtual or it can be persistent. The persistent one is a db file. It is in Repo's site_cache_dir (see your dvc doctor output).

  3. No, user doesn't need to even know about it. We do write persistent index for optimization to avoid reparsing large repos and all their dvc files (e.g. when handling a particular git revision).

  4. It is not essential, we will rebuild it if we need it (dvcfiles are the source of truth). Just like state.

  5. Because cloud versioning is not about hash files/objects, but about particular metadata (e.g. version_ids). You could build hashfiles/objects/trees out of index, but that requires additional processing (e.g. hashing).

@dmpetrov
Copy link
Member

  • Index is a data management layer, one level below pipeline outputs. It happens to remove the need for state though, since index is serializable and contains the same metadata. Think git index.

It looks like an extension of the state, isn't it? Why these two should be separate? An example would be helpful.

dvcfiles are the source of truth

👍 the most critical part since dvc is "codification" tool.

5. cloud versioning is not about hash files/objects, but about particular metadata (e.g. version_ids).

Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag.

An extra question. A similar index can significantly speed up metrics and plots. Should it be the same index? Why?

@efiop
Copy link
Contributor Author

efiop commented Apr 30, 2023

It looks like an extension of the state, isn't it? Why these two should be separate? An example would be helpful.

It is not an extension of state. State is key-value storage that is an afterthought for data management, while index is hierarchical and can be used directly. The best analogy I can come up with is our state vs git index.

Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag.

But etag is not a hash either. We have misused it as hash for external outputs and stuff, but that was wrong, since etag is a checksum not a hash. Same with version_ids. etag/version_id/etc is a metadata that we capture that serves like a checksum in a sense that it allows us to detect changes, but it can't be used for universal content identification (version_ids with a particular path on particular cloud allow you to get access to particular data, but that's not strictly speaking universal even for file renames or copies on the same cloud). I think git index vs git objects is the best analogy here.

An extra question. A similar index can significantly speed up metrics and plots. Should it be the same index? Why?

Since we cache index - yes, it can speed up those operations. But for metrics and plots there is an additional step of "rendering", so it will benefit from caching that separately as well.

@dmpetrov
Copy link
Member

dmpetrov commented May 1, 2023

It is not an extension of state. State is key-value storage that is an afterthought for data management, while index is hierarchical and can be used directly.

The state is just a table in DB. Why the hierarchical index cannot be added to the same DB?

The best analogy I can come up with is our state vs git index.

I'm not sure I understand the analogy. I got the hierarchy part (is it the Git analogy). However, I don't understanding the need in a separate DB instance/file.

Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag.

But etag is not a hash either. We have misused it as hash for external outputs and stuff

Also from Dave's comment:

dvc fetch now because it's needed for studio backend to use cloud versioning

I got an impression that this index was introduced for cloud versioning. Is my impression correct? The question is - can cloud versioning be implemented without this index?

A similar index can significantly speed up metrics and plots. Should it be the same index? Why?

Since we cache index - yes, it can speed up those operations. But for metrics and plots there is an additional step of "rendering", so it will benefit from caching that separately as well.

The question is - will we need another DB instance/file to cache the metrics or the same file can be re-used?

So far, it feel like we are introducing to many DB files while the best practice is to keep a single DB.

@efiop
Copy link
Contributor Author

efiop commented May 1, 2023

@dmpetrov

The state is just a table in DB. Why the hierarchical index cannot be added to the same DB?

It can be, but there is no reason. We already have state and legacy remote index dbs and new index is going to replace both of those, so it didn't make sense to try to shove it into state. Instead we'll complete migration to the new index and drop legacy state/remote indexes.

I'm not sure I understand the analogy. I got the hierarchy part (is it the Git analogy). However, I don't understanding the need in a separate DB instance/file.

There is no need in separate new file, see answer above.

I got an impression that this index was introduced for cloud versioning. Is my impression correct? The question is - can cloud versioning be implemented without this index?

It was needed for cloud versioning because it doesn't fit into old ways of doing things.

The question is - will we need another DB instance/file to cache the metrics or the same file can be re-used?

Could be reused, sure, but I don't know what size and format will metrics/plots try to cache, so I can't say how it will be stored and where right now.

@dmpetrov
Copy link
Member

dmpetrov commented May 1, 2023

It can be, but there is no reason.

Introducing a new sqlite file(DB) instead of migrating a schema for existing one? Please elaborate why is this better.

It was needed for cloud versioning because it doesn't fit into old ways of doing things.

I'd appreciate an explanation.

@efiop
Copy link
Contributor Author

efiop commented May 1, 2023

Introducing a new sqlite file(DB) instead of migrating a schema for existing one? Please elaborate why is this better.

State and remote indexes are using diskcache and not pure sqlite, while new index is pure sqlite. diskcache makes some assumptions about db structure so it would be messy to try to shove them into one db. Considering that state and remote indexes are going to be dropped, we can just have a fresh start in the new index. So 1 new file is better than 2+ old ones.

I'd appreciate an explanation.

I've elaborated above in this issue. This is about how it works internally, with no user implication.

@dmpetrov
Copy link
Member

dmpetrov commented May 2, 2023

State and remote indexes are using diskcache and not pure sqlite, while new index is pure sqlite.

That's helpful! Additionally, it would be great to know the options that were considered - is there is an option to do a DB migration (these DBs are compatible I assume).

This is about how it works internally, with no user implication.

🤔 It's not for users, it is for the team to understand the motivation. Based on the discussion and the questions,
it feels there is a lack of understanding of the motivation behind the changes. Some design decisions were made without proper collaboration with the team and without alignment with the broader vision (how about metrics and plots?).

The initiative looks great but It requires a discussion with the team. I strongly encourage the team to discuss and carefully consider such design decisions before proceeding with implementation.

@shcheklein
Copy link
Member

btw (since I asked @efiop to create this to actually better understand the timeline for the -j and parallelism for some use cases and even in the first place have place to refer users to), and in case it's not 100% - this epic by itself is not something new, as far as I understand most of the checkboxes were done a long time ago and as @dberenbaum mentioned we are migrating fetch (?) because it's needed for Studio cloud versioning. We are not making any new global decision (right, @efiop ?.

It would be very great to have some tech design specs behind the index, and other systems for the whole team.

@efiop
Copy link
Contributor Author

efiop commented May 2, 2023

@dmpetrov

That's helpful! Additionally, it would be great to know the options that were considered - is there is an option to do a DB migration (these DBs are compatible I assume).

diskcache is a sqlite db, but it has its own assumptions about tables. It is not worth trying to shove them together. We are getting rid of it completely in favor of 1 file, so it is unreasonable to try to migrate or merge together with discache. Those are really all the options that we had. I don't think this issue of the temporary number of hidden db files is so critical.

🤔 It's not for users, it is for the team to understand the motivation. Based on the discussion and the questions,
it feels there is a lack of understanding of the motivation behind the changes. Some design decisions were made without proper collaboration with the team and without alignment with the broader vision (how about metrics and plots?).

Sure, there wasn't an official process with clear prelude, but this was a gradual process and we've talked about it in private for more than a year and lots of things were already implemented with that (e.g. cloud versioning).

@shcheklein

We are not making any new global decision (right, @efiop ?.

This issue is just for visibility for the things that were going on for a very long time already. This is not a brand-new effort or anything like that.

It would be very great to have some tech design specs behind the index, and other systems for the whole team.

Not sure how to approach that at this point. There is https://github.com/iterative/dvc-data and how it is used in dvc. The analogies are from well-known git concepts. Do we have some great examples from other teams maybe that I could take as a template?

@efiop
Copy link
Contributor Author

efiop commented May 3, 2023

For the record: had a meeting with @skshetry and agreed that the legacy obj-based implementation even though would work for regular outputs, won't work for cloud versioning/import/etc (there are no hashes, no objects there, hence no Tree objects), so it makes sense to also use index there as discussed before.

The missing part in our toolset is serializing dvc's Index back into dvcfiles. We have some hacks in cloud versioning for that and maybe something like that would do, but probably need to make a new separate function for that, similar to how we serialize stages into dvc.yaml and lockfiles.

@dberenbaum
Copy link
Collaborator

Sounds reasonable.

@skshetry Can we keep the serialization part as minimal as possible so it doesn't become an entire separate story for now?

This was referenced May 11, 2023
efiop added a commit to efiop/dvc-data that referenced this issue Jul 30, 2023
Very basic implementation to start with. Will be tweaked later.

Pre-requisite for iterative/dvc#9333
efiop added a commit to iterative/dvc-data that referenced this issue Jul 31, 2023
Very basic implementation to start with. Will be tweaked later.

Pre-requisite for iterative/dvc#9333
@dberenbaum
Copy link
Collaborator

Aiming to do push this sprint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Umbrella issue (high level). Include here: list of tasks/PRs, details, Qs, timelines, etc
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

6 participants