Some ideas for the next major version #6653
Replies: 6 comments 22 replies
-
Regarding hash algorithms/moving away from MD5, there is an existing discussion here: #3069 The tldr version is that DVC will probably have to use one of the SHA variants. Something like Blake2 would be better from a performance standpoint, but since none of the Blake variants are FIPS/NIST certified, going with a Blake variant would end up making it so DVC could not be used anywhere that is required to be US-government/NIST compliant (which makes up a significant portion of the potential enterprise userbase) Regarding cache structure - file chunking and other backwards incompatible cache/remote structure changes has already been planned for a while: #829 (and research plus other related core dvc prerequisite work has been ongoing for the past several months) |
Beta Was this translation helpful? Give feedback.
-
WorkspaceThere have been discussions to keep everything in This isn't just about performance, it also helps with:
|
Beta Was this translation helpful? Give feedback.
-
RemotesCan you clarify what you see as the major benefits of tying the DVC remote to the Git repo? I can see a couple of benefits:
My concerns are:
I would rather see a middle ground where we document patterns for pushing git repos to the same place as the dvc remote and include some features that make this easy to use. |
Beta Was this translation helpful? Give feedback.
-
ExperimentsCan you clarify why you think experiments in Git is bad? DVC is heavily tied to a Git workflow, so I think if we were to implement experiments outside of Git we would encounter even more clunkiness in trying to ask users to later move their experiments back into a Git workflow. I also think confusion is more likely either UX that we need to improve or something that would be hard to implement on our own because it's relying on the complexity of Git. |
Beta Was this translation helpful? Give feedback.
-
TechnologyRegarding analytics, we do use them, and they are anonymized, but there's no doubt that they make some people uncomfortable. We should also consider that they may be even more of a concern to a potential enterprise customer than to a typical user. |
Beta Was this translation helpful? Give feedback.
-
What would go into these read-only workspace snapshots? We had similar discussions about how to store experiments when the feature was first started. The reason we went with the git-based implementation is that the conclusions were that for experiments, we need a "snapshot" that gives us both the state of the repo code (and DVC params files) as well as the state of DVC tracked data (meaning .dvc/dvc.lock files). Git commits give us exactly that, as diffs, and allow us to take advantage of git's squash/merge/rebase capabilities as well. We considered generating/storing/applying our own diffs/patches, but decided that there was no reason to re-implement git ourselves. Especially when we considered the checkpoints use case - for checkpoints we need a linear patch set, and in order to support the use case of starting from some intermediate checkpoint and then making some code/param modification, we would just be re-implementing git branches.
To me this still sounds like the issue is with UI/UX (in the existing exp sharing workflows based on And even aside from
I disagree that anything we do in git for experiments is "tampering with the user's setup". We do not modify any existing git commits or refs in the user's repo and do not modify any git configuration. Using custom refs in a user's local repo is nothing new, there are plenty of other tools that do the same thing. Custom refs are a core git feature intended to be used by 3rd party tools to extend git functionality, which is exactly what DVC is doing. |
Beta Was this translation helpful? Give feedback.
-
While doing some work for #6547, reading performance tickets and the documentation, I'm having some backward incompatible ideas. These are actually questions open for discussion. Please feel free to show me the downsides of these ideas from the user's perspective.
Workspace
dvc.yaml
and.dvcignore
files are expected to be edited by the user. Other than these, all tracking information should be moved to.dvc/
*.dvc
files to track the contents should be moved to.dvc/tree
. If there is a reason (that I can't think of) to keep separate files, they can copy the file structure of the user's directory under.dvc/tree
. Otherwise, we can keep all the file hierarchy in a few files. This will relieve the burden to walk all the files to detect the changes. This would be a major improvement for large projects..dvc/tree
, data structures for single files and directories can be merged. We don't need a.dir
file to keep the track of directories and we can simplify operations on directories.Hashing
BLAKE2 seems like a good candidate.Something "NIST-compliant and fast" might be better.Cache
cache.type
option should be granular. A project may have 1.000.000 files to read but writes 2 of these. Instead ofdvc unprotect
, these two files can be set incache.type=copy
.ABCDEF1234...
, the two directories areAB/CD/
, but currently we split the filename toEF1234...
. This makes cache pathAB/CD/EF1234...
but we can keep the filename intactAB/CD/ABCDEF1234...
. This will simplify lookup operations and we can increase the number of directories as needs arise, e.g., "if there are more than 1000 files in a cache dir, create dirs and move files". We can have expanding /shrinking cache.ABCDEF...1234
, the cache structure can be:Here
whole.ext
is the file's content andpart1
...part3
are partial contents,part.meta
is the file that contains parts' hash and other metadata. Local operations can use thewhole
and remote operations can work onpart1
... for parallel upload/downloads.file1.jpg
has a hashABC123
andfile2.jpg
hasEFE321
as hash values, if these are stored in agroup132.tar
file with the hash789BFF
, their cache structure becomes likeWhen one of the files are to be checked out, archive is extracted. When one of the files change, a new archive can be created from the all newer files, keeping the old one in place. (So that, older version may be extracted in case.) In general remotes/cache can use "group files less than 1MB together if they are not checked out" rule.
Remotes
Experiments
.dvc/exps/METADATAHASH/
is enough and this satisfies most semantics and simplifies operations. Currently, we have to differentiate "git remote" and "dvc remote" in every sentence we write in the docs. This shows there is something wrong with the current design..dvc/exps
and use this clone. Tampering with the user's setup is not something I'm comfortable with.Git
dvc-UNIQUE-ID
branch that can also be used in the remotes.) If the user wants to peek at what's inside it, they can checkout and look.File Formats
Technology
dvc stats
command to show the stats and ask the user to send these anonymously to us. Instead of analytics, we need some way to get user's exceptions/errors/benchmarks when something goes awry.@shcheklein @dmpetrov @dberenbaum
Beta Was this translation helpful? Give feedback.
All reactions