Epic: migrating to dvc-data's index #9333

efiop · 2023-04-15T18:47:45Z

efiop · 2023-04-15T18:53:22Z

dberenbaum · 2023-04-18T12:16:43Z

Prioritizing dvc fetch now because it's needed for studio backend to use cloud versioning (https://github.com/iterative/studio/issues/4782). Not prioritizing other items now.

shcheklein · 2023-04-18T18:20:02Z

Thanks @efiop for creating this! Can we add some context here related to features that we dropped along the way (e.g. the one we discussed in Slack- parallelism in hashes, -j, etc) assuming that we can get to all these items soon? So that we can track it first, see if need to some actions to mitigate this for now, more visibility to users.

daavoo · 2023-04-19T12:03:12Z

iterative/dvc-data#341 (comment) says:

This is about all data management that we have in dvc. So that we can get rid of get_used_objs stuff and so that all manipulations (like filtering by size, etc) are in one place.

dvc gc also relies on get_used_objs.

Should I pause the work in #2036 until we finish this migration?
Do we want to also migrate gc (add it to the list of p1 here) or are we ok with using the current logic for now?

cc @dberenbaum @efiop

efiop · 2023-04-19T12:48:00Z

@daavoo Since current gc doesn't work with cloud versioning and the whole logic for #2036 is to compare local objects with remote ones and delete overlap - you should be able to do it just fine without waiting for index for now.

daavoo · 2023-04-19T12:52:23Z

@daavoo Since current gc doesn't work with cloud versioning and the whole logic for #2036 is to compare local objects with remote ones and delete overlap - you should be able to do it just fine without waiting for index for now.

Indeed, I have a working branch using the current logic.

I was asking just in case we are going to "throw out" that code any time soon

skshetry · 2023-04-24T12:11:22Z

In operations like fetch/data-status, we can think of all data files tracked as one single piece.
But I am not sure that applies to add/commit where they are updates to individual dataset. Also mapping index changes back to dvc.yaml/.dvc is much more complicated and fragile than stages/outs.

In #4657, I am thinking of using outs_trie (still investigating though).

efiop · 2023-04-24T16:30:38Z

@skshetry Hm, could you elaborate why outs_trie seems to be more suitable than index? Just sounds surprising. It is fine if you want to discuss it in a PR and not now.

skshetry · 2023-04-26T08:30:18Z

I am still trying to understand Index, so I may be wrong here, but to my mind I find outs_trie simpler.

With outs_trie, I find implementing #4657 this quite straightforward:

def virtual_commit(repo, path):
  out = repo.index.outs_trie.longest_prefix(path)
  filter_info = out.fs.relpath(path, out.path)

  _fetch_tree_if_needed()
  out.obj = _update_obj(out.get_obj(), path)
  out.stage.dump()

But with index, you are working with a larger data structure, which you have to build,
and update. Then, you have to diff two indexes again to find changes so that you can write it back to dvc.yaml/.dvc which is again complicated. With outs_trie, we already have a good mapping between output and dvc.yaml/.dvc, and the updates happen in-place which makes it much easier. (I wish we had data section where all outputs are tracked, which would have made all of these things simpler).

I see the benefits of using an index where we can assume all the files/datasets tracked as one single piece, but add/commit are selective updates, so I don't quite see the need for Index here. There are potential performance benefits of the Index in the future (when we introduce batching or getting rid of state, etc.) but that can be implemented outside the Index too.
Working with a complicated data structure than you need might not be a good idea.

Anyway, I am only looking at this from #4657 implementation perspective only. What would Index provide add/commit apart from the code-reuse?

I think we should wait before implementing #4657 in any case (cc @dberenbaum).

dberenbaum · 2023-04-26T15:27:59Z

I think we should wait before implementing #4657 in any case (cc @dberenbaum).

Wait for what? Your rationale for outs_trie makes sense to me, but I don't follow why it means we should wait to implement #4657.

skshetry · 2023-04-26T16:38:21Z

Wait for what?

If we want to use Index in dvc add/commit, outs_trie implementation will be temporary, and we will have to rewrite virtual directory support based on Index.
So I think it’s better to discuss and plan on either ways.

Let’s wait for Ruslan, i want to know his thoughts.
Also, it depends on if Ruslan was suggesting use of Index in dvc add/commit is as a primary data structure or auxiliary.

dberenbaum · 2023-04-26T18:18:31Z

Okay, makes sense, thanks for the explanation.

efiop · 2023-04-26T20:21:27Z

We should probably move this discussion to the related ticket.

In operations like fetch/data-status, we can think of all data files tracked as one single piece.

@skshetry But we use Index views for targets even below output level of granularity. You can create a view from a path, tweak index to your liking and then use, for example, outs_trie, to map it back to stages. DataIndex also automatically fetches .dirs.

But if the way you see it doesn't involve index - that's totally fine too, I have no problems with it. Feel free to implement it the way you want it for now. outs_trie is likely going to stay and your effort won't be wasted even if we swap it index later on.

skshetry · 2023-04-27T03:26:37Z

Views are more of a convenience, conceptually it's just an index though.

Could you please elaborate more on the Index please? What is it, and how does it fit?
What does it solve? etc.
Should it be treated like Git's Index (which means we have to populate it in all operations even on add/commit, etc)?

Feel free to implement it the way you want it for now. outs_trie is likely going to stay and your effort won't be wasted even if we swap it index later on.

But if we use Index, we won't need outs_trie, right? As all the virtual operations has to be operated on the Index in that case.

efiop · 2023-04-28T10:20:14Z

Could you please elaborate more on the Index please? What is it, and how does it fit?
What does it solve? etc.
Should it be treated like Git's Index (which means we have to populate it in all operations even on add/commit, etc)?

DataIndex is a structure for managing data. It is below the level of outputs. It is similar to Git's Index, but more general and is not tied to real local workspaces. Even in git, you don't always have to use index to do things, you could write objects directly if you really want to, same here.

But if we use Index, we won't need outs_trie, right? As all the virtual operations has to be operated on the Index in that case.

Not really, outputs are above data index level. DataIndex is only managing data, which outputs use. outs_trie will stay in one shape or form, because you need to map index entries to outputs/stages to serialize them.

dmpetrov · 2023-04-29T18:19:18Z

I'd appreciate more details on the index. Is there any document?

DVC already has the state which was introduced for optimization. Was index introduced for the same purpose - optimization - or something else?
Is there a separated index file/dir? What is the path to the file?
Does user need to commit the index (I assume there is a separate index file)?
What if user accidentally removes the index file?
Why cloud versioning is special and need a new data structure?

efiop · 2023-04-29T23:27:48Z

@dmpetrov Unfortunately there is no comprehensive document.

Index is a data management layer, one level below pipeline outputs. It happens to remove the need for state though, since index is serializable and contains the same metadata. Think git index.
It can be virtual or it can be persistent. The persistent one is a db file. It is in Repo's site_cache_dir (see your dvc doctor output).
No, user doesn't need to even know about it. We do write persistent index for optimization to avoid reparsing large repos and all their dvc files (e.g. when handling a particular git revision).
It is not essential, we will rebuild it if we need it (dvcfiles are the source of truth). Just like state.
Because cloud versioning is not about hash files/objects, but about particular metadata (e.g. version_ids). You could build hashfiles/objects/trees out of index, but that requires additional processing (e.g. hashing).

dmpetrov · 2023-04-30T08:22:27Z

Index is a data management layer, one level below pipeline outputs. It happens to remove the need for state though, since index is serializable and contains the same metadata. Think git index.

It looks like an extension of the state, isn't it? Why these two should be separate? An example would be helpful.

dvcfiles are the source of truth

👍 the most critical part since dvc is "codification" tool.

5. cloud versioning is not about hash files/objects, but about particular metadata (e.g. version_ids).

Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag.

An extra question. A similar index can significantly speed up metrics and plots. Should it be the same index? Why?

efiop · 2023-04-30T16:01:07Z

It looks like an extension of the state, isn't it? Why these two should be separate? An example would be helpful.

It is not an extension of state. State is key-value storage that is an afterthought for data management, while index is hierarchical and can be used directly. The best analogy I can come up with is our state vs git index.

Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag.

But etag is not a hash either. We have misused it as hash for external outputs and stuff, but that was wrong, since etag is a checksum not a hash. Same with version_ids. etag/version_id/etc is a metadata that we capture that serves like a checksum in a sense that it allows us to detect changes, but it can't be used for universal content identification (version_ids with a particular path on particular cloud allow you to get access to particular data, but that's not strictly speaking universal even for file renames or copies on the same cloud). I think git index vs git objects is the best analogy here.

An extra question. A similar index can significantly speed up metrics and plots. Should it be the same index? Why?

Since we cache index - yes, it can speed up those operations. But for metrics and plots there is an additional step of "rendering", so it will benefit from caching that separately as well.

dmpetrov · 2023-05-01T20:21:37Z

It is not an extension of state. State is key-value storage that is an afterthought for data management, while index is hierarchical and can be used directly.

The state is just a table in DB. Why the hierarchical index cannot be added to the same DB?

The best analogy I can come up with is our state vs git index.

I'm not sure I understand the analogy. I got the hierarchy part (is it the Git analogy). However, I don't understanding the need in a separate DB instance/file.

Could you please elaborate on this? Why version_ids is special? It is conceptually the same as etag.

But etag is not a hash either. We have misused it as hash for external outputs and stuff

Also from Dave's comment:

dvc fetch now because it's needed for studio backend to use cloud versioning

I got an impression that this index was introduced for cloud versioning. Is my impression correct? The question is - can cloud versioning be implemented without this index?

A similar index can significantly speed up metrics and plots. Should it be the same index? Why?

Since we cache index - yes, it can speed up those operations. But for metrics and plots there is an additional step of "rendering", so it will benefit from caching that separately as well.

The question is - will we need another DB instance/file to cache the metrics or the same file can be re-used?

So far, it feel like we are introducing to many DB files while the best practice is to keep a single DB.

efiop · 2023-05-01T21:28:01Z

@dmpetrov

The state is just a table in DB. Why the hierarchical index cannot be added to the same DB?

It can be, but there is no reason. We already have state and legacy remote index dbs and new index is going to replace both of those, so it didn't make sense to try to shove it into state. Instead we'll complete migration to the new index and drop legacy state/remote indexes.

I'm not sure I understand the analogy. I got the hierarchy part (is it the Git analogy). However, I don't understanding the need in a separate DB instance/file.

There is no need in separate new file, see answer above.

I got an impression that this index was introduced for cloud versioning. Is my impression correct? The question is - can cloud versioning be implemented without this index?

It was needed for cloud versioning because it doesn't fit into old ways of doing things.

The question is - will we need another DB instance/file to cache the metrics or the same file can be re-used?

Could be reused, sure, but I don't know what size and format will metrics/plots try to cache, so I can't say how it will be stored and where right now.

dmpetrov · 2023-05-01T21:41:44Z

It can be, but there is no reason.

Introducing a new sqlite file(DB) instead of migrating a schema for existing one? Please elaborate why is this better.

It was needed for cloud versioning because it doesn't fit into old ways of doing things.

I'd appreciate an explanation.

efiop · 2023-05-01T22:15:13Z

Introducing a new sqlite file(DB) instead of migrating a schema for existing one? Please elaborate why is this better.

State and remote indexes are using diskcache and not pure sqlite, while new index is pure sqlite. diskcache makes some assumptions about db structure so it would be messy to try to shove them into one db. Considering that state and remote indexes are going to be dropped, we can just have a fresh start in the new index. So 1 new file is better than 2+ old ones.

I'd appreciate an explanation.

I've elaborated above in this issue. This is about how it works internally, with no user implication.

dmpetrov · 2023-05-02T01:31:08Z

State and remote indexes are using diskcache and not pure sqlite, while new index is pure sqlite.

That's helpful! Additionally, it would be great to know the options that were considered - is there is an option to do a DB migration (these DBs are compatible I assume).

This is about how it works internally, with no user implication.

🤔 It's not for users, it is for the team to understand the motivation. Based on the discussion and the questions,
it feels there is a lack of understanding of the motivation behind the changes. Some design decisions were made without proper collaboration with the team and without alignment with the broader vision (how about metrics and plots?).

The initiative looks great but It requires a discussion with the team. I strongly encourage the team to discuss and carefully consider such design decisions before proceeding with implementation.

shcheklein · 2023-05-02T02:08:12Z

btw (since I asked @efiop to create this to actually better understand the timeline for the -j and parallelism for some use cases and even in the first place have place to refer users to), and in case it's not 100% - this epic by itself is not something new, as far as I understand most of the checkboxes were done a long time ago and as @dberenbaum mentioned we are migrating fetch (?) because it's needed for Studio cloud versioning. We are not making any new global decision (right, @efiop ?.

It would be very great to have some tech design specs behind the index, and other systems for the whole team.

efiop · 2023-05-02T12:31:57Z

@dmpetrov

That's helpful! Additionally, it would be great to know the options that were considered - is there is an option to do a DB migration (these DBs are compatible I assume).

diskcache is a sqlite db, but it has its own assumptions about tables. It is not worth trying to shove them together. We are getting rid of it completely in favor of 1 file, so it is unreasonable to try to migrate or merge together with discache. Those are really all the options that we had. I don't think this issue of the temporary number of hidden db files is so critical.

🤔 It's not for users, it is for the team to understand the motivation. Based on the discussion and the questions,
it feels there is a lack of understanding of the motivation behind the changes. Some design decisions were made without proper collaboration with the team and without alignment with the broader vision (how about metrics and plots?).

Sure, there wasn't an official process with clear prelude, but this was a gradual process and we've talked about it in private for more than a year and lots of things were already implemented with that (e.g. cloud versioning).

@shcheklein

We are not making any new global decision (right, @efiop ?.

This issue is just for visibility for the things that were going on for a very long time already. This is not a brand-new effort or anything like that.

It would be very great to have some tech design specs behind the index, and other systems for the whole team.

Not sure how to approach that at this point. There is https://github.com/iterative/dvc-data and how it is used in dvc. The analogies are from well-known git concepts. Do we have some great examples from other teams maybe that I could take as a template?

efiop · 2023-05-03T18:41:53Z

For the record: had a meeting with @skshetry and agreed that the legacy obj-based implementation even though would work for regular outputs, won't work for cloud versioning/import/etc (there are no hashes, no objects there, hence no Tree objects), so it makes sense to also use index there as discussed before.

The missing part in our toolset is serializing dvc's Index back into dvcfiles. We have some hacks in cloud versioning for that and maybe something like that would do, but probably need to make a new separate function for that, similar to how we serialize stages into dvc.yaml and lockfiles.

dberenbaum · 2023-05-03T19:00:19Z

Sounds reasonable.

@skshetry Can we keep the serialization part as minimal as possible so it doesn't become an entire separate story for now?

Very basic implementation to start with. Will be tweaked later. Pre-requisite for iterative/dvc#9333

dberenbaum · 2023-10-17T12:34:47Z

Aiming to do push this sprint

efiop added the epic Umbrella issue (high level). Include here: list of tasks/PRs, details, Qs, timelines, etc label Apr 15, 2023

daavoo mentioned this issue Apr 24, 2023

gc: Add --not-in-remote arg. #9350

Merged

skshetry mentioned this issue May 9, 2023

improve performance of data status #9428

Closed

This was referenced May 11, 2023

checkout: use index checkout #9444

Merged

fetch: use index fetch #9424

Merged

efiop mentioned this issue Jul 5, 2023

SCM doesn't reflect remote status. iterative/vscode-dvc#1769

Closed

efiop mentioned this issue Jul 25, 2023

fetch/push/status: not handling config from other revisions #9754

Open

3 tasks

efiop added a commit to efiop/dvc-data that referenced this issue Jul 30, 2023

index: introduce basic push

473b2dc

Very basic implementation to start with. Will be tweaked later. Pre-requisite for iterative/dvc#9333

efiop mentioned this issue Jul 31, 2023

index: introduce basic push iterative/dvc-data#411

Merged

efiop added a commit to iterative/dvc-data that referenced this issue Jul 31, 2023

index: introduce basic push (#411)

290781d

Very basic implementation to start with. Will be tweaked later. Pre-requisite for iterative/dvc#9333

efiop mentioned this issue Aug 5, 2023

push: use index push #9807

Merged

efiop mentioned this issue Aug 17, 2023

Faster index building #9813

Closed

skshetry mentioned this issue Dec 8, 2023

Better control over jobs number for dvc add #8030

Closed

efiop mentioned this issue Feb 19, 2024

index: check dep.hash_name for imports #10270

Merged

dberenbaum added this to DVC Feb 19, 2024

github-project-automation bot moved this to Backlog in DVC Feb 19, 2024

dberenbaum moved this from Backlog to In Progress in DVC Feb 19, 2024

dberenbaum assigned efiop Feb 22, 2024

skshetry unassigned efiop Mar 6, 2024

skshetry mentioned this issue Mar 25, 2024

replace index and state with one (workspace? fs?) index #6916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: migrating to dvc-data's index #9333

Epic: migrating to dvc-data's index #9333

efiop commented Apr 15, 2023 •

edited

Loading

efiop commented Apr 15, 2023

dberenbaum commented Apr 18, 2023

shcheklein commented Apr 18, 2023

daavoo commented Apr 19, 2023 •

edited

Loading

efiop commented Apr 19, 2023

daavoo commented Apr 19, 2023

skshetry commented Apr 24, 2023 •

edited

Loading

efiop commented Apr 24, 2023

skshetry commented Apr 26, 2023

dberenbaum commented Apr 26, 2023

skshetry commented Apr 26, 2023

dberenbaum commented Apr 26, 2023

efiop commented Apr 26, 2023

skshetry commented Apr 27, 2023

efiop commented Apr 28, 2023

dmpetrov commented Apr 29, 2023

efiop commented Apr 29, 2023

dmpetrov commented Apr 30, 2023

efiop commented Apr 30, 2023

dmpetrov commented May 1, 2023

efiop commented May 1, 2023

dmpetrov commented May 1, 2023

efiop commented May 1, 2023

dmpetrov commented May 2, 2023

shcheklein commented May 2, 2023

efiop commented May 2, 2023

efiop commented May 3, 2023

dberenbaum commented May 3, 2023

dberenbaum commented Oct 17, 2023

Epic: migrating to dvc-data's index #9333

Epic: migrating to dvc-data's index #9333

Comments

efiop commented Apr 15, 2023 • edited Loading

Summary / Background

Scope

Assumptions

Open Questions

Blockers / Dependencies

General Approach

Steps

Must have (p1)

Optional / followup (p2)

Timelines

efiop commented Apr 15, 2023

dberenbaum commented Apr 18, 2023

shcheklein commented Apr 18, 2023

daavoo commented Apr 19, 2023 • edited Loading

efiop commented Apr 19, 2023

daavoo commented Apr 19, 2023

skshetry commented Apr 24, 2023 • edited Loading

efiop commented Apr 24, 2023

skshetry commented Apr 26, 2023

dberenbaum commented Apr 26, 2023

skshetry commented Apr 26, 2023

dberenbaum commented Apr 26, 2023

efiop commented Apr 26, 2023

skshetry commented Apr 27, 2023

efiop commented Apr 28, 2023

dmpetrov commented Apr 29, 2023

efiop commented Apr 29, 2023

dmpetrov commented Apr 30, 2023

efiop commented Apr 30, 2023

dmpetrov commented May 1, 2023

efiop commented May 1, 2023

dmpetrov commented May 1, 2023

efiop commented May 1, 2023

dmpetrov commented May 2, 2023

shcheklein commented May 2, 2023

efiop commented May 2, 2023

efiop commented May 3, 2023

dberenbaum commented May 3, 2023

dberenbaum commented Oct 17, 2023

efiop commented Apr 15, 2023 •

edited

Loading

daavoo commented Apr 19, 2023 •

edited

Loading

skshetry commented Apr 24, 2023 •

edited

Loading