Some ideas for the next major version #6653

iesahin · 2021-09-20T08:15:52Z

iesahin
Sep 20, 2021

While doing some work for #6547, reading performance tickets and the documentation, I'm having some backward incompatible ideas. These are actually questions open for discussion. Please feel free to show me the downsides of these ideas from the user's perspective.

Workspace

We shouldn't add files to the user's directory if they are not expected to edit it. AFAIK only dvc.yaml and .dvcignore files are expected to be edited by the user. Other than these, all tracking information should be moved to .dvc/
*.dvc files to track the contents should be moved to .dvc/tree. If there is a reason (that I can't think of) to keep separate files, they can copy the file structure of the user's directory under .dvc/tree. Otherwise, we can keep all the file hierarchy in a few files. This will relieve the burden to walk all the files to detect the changes. This would be a major improvement for large projects.
If content tracking can be moved to .dvc/tree, data structures for single files and directories can be merged. We don't need a .dir file to keep the track of directories and we can simplify operations on directories.

Hashing

We need to replace MD5 with something modern. Something Git uses looks more natural to me, but ~~BLAKE2 seems like a good candidate.~~ Something "NIST-compliant and fast" might be better.
Instead of content-hash, we can use metadata-hash to detect changes for performance.

Cache

cache.type option should be granular. A project may have 1.000.000 files to read but writes 2 of these. Instead of dvc unprotect, these two files can be set in cache.type=copy.
Cache should be able to accommodate 1.000.000 files without performance hit, so instead of 256 first level directories it can use first 3 digits for 4K directories or use 2/2/ for a deeper structure.
For a file with a content-hash ABCDEF1234..., the two directories are AB/CD/, but currently we split the filename to EF1234.... This makes cache path AB/CD/EF1234... but we can keep the filename intact AB/CD/ABCDEF1234.... This will simplify lookup operations and we can increase the number of directories as needs arise, e.g., "if there are more than 1000 files in a cache dir, create dirs and move files". We can have expanding /shrinking cache.
The cache can use the hash as a directory name and put the file inside it. For a file with the hash ABCDEF...1234, the cache structure can be:

cache/
  AB/
     CD/
        ABCDE...1234/
              whole.ext
              part.meta
              part1
              part2
              part3

Here whole.ext is the file's content and part1...part3 are partial contents, part.meta is the file that contains parts' hash and other metadata. Local operations can use the whole and remote operations can work on part1... for parallel upload/downloads.

Another feature may be for "many small files" use case. The cache should be able to group these files into a single tar file and use these to upload to/download from remotes. It's possible to track these archives and their updates by "cache pointers." Supposing file1.jpg has a hash ABC123 and file2.jpg has EFE321 as hash values, if these are stored in a group132.tar file with the hash 789BFF, their cache structure becomes like

AB/C1/ABC123/archived-in (that contains 789BFF)
EF/E3/EFE321/archived-in (that contains 789BFF)
78/9B/789BFF/archive.tar

When one of the files are to be checked out, archive is extracted. When one of the files change, a new archive can be created from the all newer files, keeping the old one in place. (So that, older version may be extracted in case.) In general remotes/cache can use "group files less than 1MB together if they are not checked out" rule.

Remotes

Remotes must know which DVC project uses it. Each DVC repository should have a unique ID and remotes should be able to list which repositories it serves.
Remotes should support splitting/merging the large files. This is possible by introducing the above cache structure.
Remotes can be Git repositories + DVC cache in essence to keep track of metafiles of different remotes. It can have different branches for each DVC repository and pull/push metafiles first what has changed for each remote.

Experiments

Experimentation doesn't need Git. Putting read only snapshots of the workspace to .dvc/exps/METADATAHASH/ is enough and this satisfies most semantics and simplifies operations. Currently, we have to differentiate "git remote" and "dvc remote" in every sentence we write in the docs. This shows there is something wrong with the current design.
Experiments shouldn't tamper with the user's Git directory. If we really need Git operations for experiments, we can clone the user's repo to .dvc/exps and use this clone. Tampering with the user's setup is not something I'm comfortable with.

Git

Git operations should be automated as much as possible. DVC shouldn't ask its metafiles to be committed into Git repositories after doing its job.
DVC can keep track of its metafiles in Git in a completely separate branch. (dvc-UNIQUE-ID branch that can also be used in the remotes.) If the user wants to peek at what's inside it, they can checkout and look.

File Formats

I believe it's better to use a single format for config, pipelines etc. Instead of "ini", we can use YAML for the config files as well. Or, as editing YAML is a bit cumbersome and lead to parsing errors now and then, (dvc.yaml: very hard to edit, can we detect and do human readable/helpful schema errors? #5371) we can use TOML as default for all user-editable file formats. For serialization/deserialization, it might be better to use JSON.
I believe adding for each and similar constructs to configuration files are a step in the wrong direction. We can support some kind of "user-generated config" functionality in the languages we support, like templating the pipeline definitions. But adding for-each to the YAML isn't a good way to do this. While trying to make it a "good configuration format", it becomes a "bad programming language."

Technology

We need to use a "systems programming language" to write the basic operations and use this as a library. Currently, even tracking the requirements becomes a burden.
We can release this library bundled for Python, Julia, R, JavaScript etc. to use DVC within different products.
We better remove analytics. IMO it provides less value than it takes away. We can have a dvc stats command to show the stats and ask the user to send these anonymously to us. Instead of analytics, we need some way to get user's exceptions/errors/benchmarks when something goes awry.

@shcheklein @dmpetrov @dberenbaum

pmrowla · 2021-09-21T02:56:46Z

pmrowla
Sep 21, 2021

Regarding hash algorithms/moving away from MD5, there is an existing discussion here: #3069

The tldr version is that DVC will probably have to use one of the SHA variants. Something like Blake2 would be better from a performance standpoint, but since none of the Blake variants are FIPS/NIST certified, going with a Blake variant would end up making it so DVC could not be used anywhere that is required to be US-government/NIST compliant (which makes up a significant portion of the potential enterprise userbase)

Regarding cache structure - file chunking and other backwards incompatible cache/remote structure changes has already been planned for a while: #829 (and research plus other related core dvc prerequisite work has been ongoing for the past several months)

2 replies

iesahin Sep 21, 2021
Author

since none of the Blake variants are FIPS/NIST certified, going with a Blake variant would end up making it so DVC could not be used anywhere that is required to be US-government/NIST compliant

I wasn't aware of this. I thought it was only BLAKE3 not being compliant. That's a deal-breaker. Thank you.

We can go for "fastest NIST compliant" one in this case.

iesahin Sep 21, 2021
Author

Regarding cache structure - file chunking and other backwards incompatible cache/remote structure changes has already been planned for a while: #829 (and research plus other related core dvc prerequisite work has been ongoing for the past several months)

Yes, these ideas are not that new. Some of them are expressed elsewhere already. File chunking affects all other operations I believe.

dberenbaum · 2021-09-23T20:56:49Z

dberenbaum
Sep 23, 2021
Collaborator

Workspace

There have been discussions to keep everything in dvc.yaml and dvc.lock, although most likely we will have to continue to support .dvc files regardless for some time to avoid breaking older repos. It's possible that dvc.lock could also be moved inside of .dvc, although I don't think this has been discussed much.

This isn't just about performance, it also helps with:

Consistency. It's not that clear why we have version info spread across .dvc files and dvc.lock.
Readability: Having all paths specified within dvc.yaml makes it easier for users to see everything that dvc is tracking.
Decluttering: No need for repos to be littered with .dvc files.
Separation of concerns: Although adding data to dvc.yaml seems like it's moving towards combining data management and pipelines, it might actually better separate them. If we had a data section within dvc.yaml, stages could be simplified to avoid data management options like cache: false.
Gitignore issues: This can be annoying for users when the .dvc file ends up in a path that is gitignored.

0 replies

dberenbaum · 2021-09-23T21:07:29Z

dberenbaum
Sep 23, 2021
Collaborator

Remotes

Can you clarify what you see as the major benefits of tying the DVC remote to the Git repo?

I can see a couple of benefits:

Users can potentially push/pull from one location.
Users can see everything about the repo in one location.

My concerns are:

Users will have their own repos locally and in other places like Github, which will create confusion when they are out of sync with the DVC remote.
Users will want features to manage the Git repos (this seems like it leads to being another Dagshub).

I would rather see a middle ground where we document patterns for pushing git repos to the same place as the dvc remote and include some features that make this easy to use.

4 replies

iesahin Sep 24, 2021
Author

I may not have expressed myself fully but what I mean was DVC remotes should be tracking the status of their data and where everything resides. It's an internal Git repository to keep track of metafiles, not a remote in some other place.

That was an afterthought about making remotes aware of their data and DVC repositories they serve to. A set of metafiles in remotes like dvc-<ID>-<timestamp>.meta is probably enough to show what actions are taken by repositories, what data pushed/pulled, from which IP number etc. Putting timestamp to these files should be enough. These can be collected and processed first to see what's inside the DVC repository, before attempting to download the actual data files.

I was thinking, instead of timestamps, we may be keeping these metafiles in internal Git repositories.

Thinking again, I see this was a bad idea. We can't support local Git repositories in all kinds of remotes (AWS, Azure, etc.). Defining a file format to list the files and operations is enough.

dberenbaum Sep 25, 2021
Collaborator

I'm still unclear on what problem this is intended to solve. Most of this metadata is being tracked in a Git repo already. Is that information insufficient? Is there some issue because the metadata is detached from the data itself?

iesahin Sep 27, 2021
Author

DVC remotes should be aware of the data they contain and to which DVC repository these belong to. This can simplify a lot, like getting the list of the files in remotes, deleting only the project related elements and add currently missing functionality, like tracking who uploaded a certain file to a remote. They can be used to track the incomplete transfers as well. We need self-aware remotes to solve most of the upload/download related issues.

shcheklein Sep 27, 2021
Maintainer

We have a plan to reconsider how data storage and transfer is organized in 3.0 (e.g. chunk objects, etc). I'm not sure if we have an epic ticket or doc for this (cc @efiop @dberenbaum ), but it would be nice to hear some specific suggestions/ideas about the implementation. As you mentioned, it's hard for us to support Git repo on top of AWS - we had quite a lot of discussions about this in the past and it always felt too complicated.

dberenbaum · 2021-09-23T21:13:16Z

dberenbaum
Sep 23, 2021
Collaborator

Experiments

Can you clarify why you think experiments in Git is bad?

DVC is heavily tied to a Git workflow, so I think if we were to implement experiments outside of Git we would encounter even more clunkiness in trying to ask users to later move their experiments back into a Git workflow. I also think confusion is more likely either UX that we need to improve or something that would be hard to implement on our own because it's relying on the complexity of Git.

10 replies

dberenbaum Oct 1, 2021
Collaborator

Rather than focus on whether to use Git, could we focus on the problems with experiments today?

Feel free to make a plan if it's useful for you, but I think more concrete motivation is needed to consider it. Non-Git workflows are rare for DVC users, so this doesn't add enough value to me to overhaul experiments. A lot of this was already considered when implementing experiments, and as @pmrowla mentioned, most issues would be present with or without Git. Very little of the feedback we've gotten on experiments from users has been about problems with merging or Git workflows.

The non-Git proposal doesn't seem that different from the current implementation, other than leaving it to either users or the DVC team to deal with the complexities that Git is handling. To new users, experiments should look like the proposals you made above, but DVC should also handle and guide users through more complex merges.

iesahin Oct 4, 2021
Author

Very little of the feedback we've gotten on experiments from users has been about problems with merging or Git workflows.

I believe this is the case too, but my interpretation is different: Currently, only the Git users are able to use the experiments, so we don't get input from non-Git users.

The non-Git proposal doesn't seem that different from the current implementation, other than leaving it to either users or the DVC team to deal with the complexities that Git is handling.

I tend to leave these complexities to the user. Currently, the team is trying to handle DVC + Git problems. I'd like them to focus on DVC only. The user may decide to track the experiments in Git, that's a good path we also should support, but the current implementation requires to use Git. That's a hindrance to expanding the user base and also to the flexibility of the design.

Without Git, someone who knows Git per the current required level could do

git checkout -b exp-$(date +%s)
dvc exp run -S myparam=myvalue
git add .
git commit -m "experiment"

to get all the benefits of a Git workflow.

However, separating Git from the current dvc exp run requires much higher levels of Git and DVC knowledge. We are intentionally locking ourselves to Git.

Other than Git, the experiments only had one major flaw, which is requiring a dvc init and a pipeline to run them. You're already doing great work on that front, so I didn't mention it.

dberenbaum Oct 20, 2021
Collaborator

Thanks for all this discussion! I wanted to take some more time to think about it before responding.

It got me thinking about whether DVC experiments should provide a traditional experiment tracking experience or something more. I would argue that DVC can provide "experiment versioning," which has benefits over traditional tracking. However, I agree with you that this is more complex for users. I also think that currently experiments suffer both from asking users to understand too much of Git and other implementation details and for a lack of clarity about the expected workflow. I tried to summarize my thoughts in https://www.notion.so/iterative/Experiment-Workflows-3873c4f3cc2e49c6a7871a831bc8302b.

shcheklein Oct 20, 2021
Maintainer

@jorgeorpinel here we can get some idea for benefits of using DVC for the use case ^^

jorgeorpinel Nov 2, 2021

Currently, we have to differentiate "git remote" and "dvc remote" in every sentence we write in the docs. This shows there is something wrong with the current design.

@iesahin this is no longer the case as we have discussed. That was more of an issue with the explanation than with dvc exp I think.

We are having feature creep by adding seemingly free features from Git to DVC...

This I have argued for too actually, but about specific utils that just wrap Git like exp branch (rel. #5896).

We don't have to solve the user's Git problems in experiments.
DS people don't have to know about the intricacies of Git.

Those 2 statements seem contradicting though.

Let's say exp apply fails because of conflicts. This creates confusion...

@dberenbaum to be fair it can also be confusing when you apply an experiment and it deletes all your unrelated unstaged files without warning (can be destructive). But this could be addressed with a confirmation prompt.

dberenbaum · 2021-09-23T21:18:24Z

dberenbaum
Sep 23, 2021
Collaborator

Technology

Regarding analytics, we do use them, and they are anonymized, but there's no doubt that they make some people uncomfortable. We should also consider that they may be even more of a concern to a potential enterprise customer than to a typical user.

1 reply

iesahin Sep 24, 2021
Author

I'm not against it, but the way it's presented now is too "alarming." The message looks like "we collect analytics and you probably don't want it." This can turn into "you don't want this tool anyway" quickly.

It should better be opt-in. We can provide some benefit if they opt-in and ask for this in the beginning. For the CLI applications, this kind of analytics collection is rare even for the typical user.

pmrowla · 2021-09-24T02:21:50Z

pmrowla
Sep 24, 2021

Putting read only snapshots of the workspace to .dvc/exps/METADATAHASH/ is enough and this satisfies most semantics and simplifies operations.

What would go into these read-only workspace snapshots?

We had similar discussions about how to store experiments when the feature was first started. The reason we went with the git-based implementation is that the conclusions were that for experiments, we need a "snapshot" that gives us both the state of the repo code (and DVC params files) as well as the state of DVC tracked data (meaning .dvc/dvc.lock files). Git commits give us exactly that, as diffs, and allow us to take advantage of git's squash/merge/rebase capabilities as well.

We considered generating/storing/applying our own diffs/patches, but decided that there was no reason to re-implement git ourselves. Especially when we considered the checkpoints use case - for checkpoints we need a linear patch set, and in order to support the use case of starting from some intermediate checkpoint and then making some code/param modification, we would just be re-implementing git branches.

Currently, we have to differentiate "git remote" and "dvc remote" in every sentence we write in the docs. This shows there is something wrong with the current design.

To me this still sounds like the issue is with UI/UX (in the existing exp sharing workflows based on exp push/pull) and is unrelated to the actual internal implementation of experiments.

And even aside from exp push pull, the user is already used to differentiating between git and DVC remotes - any time the user runs git clone/push/pull they have to specify git remotes (or define the per-branch default), and any time they run dvc push/pull they have to specify DVC remotes (or define a default).

Experiments shouldn't tamper with the user's Git directory. If we really need Git operations for experiments, we can clone the user's repo to .dvc/exps and use this clone. Tampering with the user's setup is not something I'm comfortable with.

I disagree that anything we do in git for experiments is "tampering with the user's setup". We do not modify any existing git commits or refs in the user's repo and do not modify any git configuration. Using custom refs in a user's local repo is nothing new, there are plenty of other tools that do the same thing. Custom refs are a core git feature intended to be used by 3rd party tools to extend git functionality, which is exactly what DVC is doing.

5 replies

iesahin Sep 24, 2021
Author

What would go into these read-only workspace snapshots?

Experiment results.

In essence, experiments are running a command/pipeline and getting models/plots/metrics. When we introduce Git to this workflow, it doesn't come free, it comes with its own concepts like branch/tag/commit. If the user is savvy enough to know what a Git branch is, probably they are also savvy enough to create an experiment (without DVC) to create a separate branch for it after running.

With Git, a certain way of thought comes. We are thinking experiments as "Git objects", and this complicates the basic idea unnecessarily.

Suppose we design experiments to work like

copy the current pipeline and dependents to a temp dir in .dvc/exps/$(date +%s)
run the pipeline (or given command)
keep the changed artifacts there, delete the rest (or keep them)

What will change from the user's perspective? It will simplify a lot of things on our part too. If the user knows Git and wants to create a branch, we can help them by dvc exp branch, but I believe most of our potential users don't use Git, or don't want to tie these experiments to Git.

DVC is perfectly usable for the data without Git. The user can keep their data versions in different directories and DVC will work that way too.

we need a "snapshot" that gives us both the state of the repo code (and DVC params files) as well as the state of DVC tracked data (meaning .dvc/dvc.lock files). Git commits give us exactly that, as diffs, and allow us to take advantage of git's squash/merge/rebase capabilities as well.

"Snapshot" is the copy of the current directory, and we are able to get the dependents of the pipeline already and can also just copy them. I believe, the user doesn't want to "make another branch" or "create a Git object", they want to run the pipeline and get results.

Do we really need "squash/merge/rebase" capabilities here? Another disadvantage is when we begin to think in terms of Git objects, everything becomes a Git object.

iesahin Sep 24, 2021
Author

Currently, we have to differentiate "git remote" and "dvc remote" in every sentence we write in the docs. This shows there is something wrong with the current design.

To me this still sounds like the issue is with UI/UX (in the existing exp sharing workflows based on exp push/pull) and is unrelated to the actual internal implementation of experiments.

And even aside from exp push pull, the user is already used to differentiating between git and DVC remotes - any time the user runs git clone/push/pull they have to specify git remotes (or define the per-branch default), and any time they run dvc push/pull they have to specify DVC remotes (or define a default).

"Remote" had a single meaning before experiments. Now we have two kinds of remotes that we have to differentiate in all actions. You can't use that kind of remote in this command, this kind of remote is configured by that command etc. We have two, completely separate ideas of remotes. This is simply not a UI/UX problem. There is a concept drift here.

Before this anytime they run Git commands, they were using Git remotes. Anytime they run DVC commands, they were using DVC remotes. Now when they run some DVC commands, they may be using Git remotes, and sometimes those commands may be using the DVC remotes or sometimes not. Instead of simplifying experiments for a Data Scientist, now we have complicated everything.

iesahin Sep 24, 2021
Author

Experiments shouldn't tamper with the user's Git directory. If we really need Git operations for experiments, we can clone the user's repo to .dvc/exps and use this clone. Tampering with the user's setup is not something I'm comfortable with.

I disagree that anything we do in git for experiments is "tampering with the user's setup". We do not modify any existing git commits or refs in the user's repo and do not modify any git configuration. Using custom refs in a user's local repo is nothing new, there are plenty of other tools that do the same thing. Custom refs are a core git feature intended to be used by 3rd party tools to extend git functionality, which is exactly what DVC is doing.

They are probably Git tools and the user's intention is to provide some extra functionality for Git. Git is primarily used for textual data. These all are basic assumptions that we don't share.

Our user base doesn't have to know Git or its model, doesn't have to know Git objects. They may be using only "add/commit/push" commands to keep a backup. They might be never heard of a "branch."

Suppose we begin to support "pipeline-free experiments" like dvc exp run --command python train py. How will we differ if some artifact is to be put in Git remote or DVC remote? Do we also consider 500MB model files as Git objects? Where will we push them?

Before "Git-dependent experiments", we had clear answers for all remote-related tasks. Now we have to think twice about each of these questions.

pmrowla Sep 24, 2021

What will change from the user's perspective?

If we are only storing artifacts, we lose reproducibility, which IMO is one of the more important features provided by DVC pipelines and by extension, experiments.

Do we really need "squash/merge/rebase" capabilities here?

If we are only storing DVC artifacts, then no most likely we don't.

But if we consider code/dependency changes to be part of the experiment, then yes, I'd say that we do. As soon as we get into needing to apply someone else's shared experiment changes into a user's workspace, we will run into merge conflicts. Git is usually capable of resolving these conflicts itself since it has information about shared commit/branch history, if we are storing experiments as standalone patches, we lose that history and essentially have to just force apply everything and hope that we aren't overwriting anything important.

DVC is perfectly usable for the data without Git.

DVC does function for simple data management scenarios without git, but IMO I would not really consider the --no-scm usage to be "perfectly usable". It restricts users to a very limited subset of DVC functionality (anything that requires actual historical versioning is disabled)

I would argue that we should focus on educating users on best practices for data/pipeline management with Git+DVC rather than bending over backwards to accommodate --no-scm scenarios.

"Remote" had a single meaning before experiments. Now we have two kinds of remotes that we have to differentiate in all actions. ... This is simply not a UI/UX problem.

Before this anytime they run Git commands, they were using Git remotes. Anytime they run DVC commands, they were using DVC remotes.

This still sounds like a UI/UX problem to me. Experiment sharing in its current form was built on the assumption that users are already used to using git push/pull + dvc push/pull for sharing experiments as git commits (which was how Git+DVC experiment/pipeline state would be shared before dvc exp existed) - exp push/pull is just a shortcut for git push/pull + dvc push/pull, and uses 2 different remotes because git and DVC use different remotes.

If that assumption is flawed (or exp push/pull as a shortcut is too confusing), we should just revisit the experiment sharing scenario and come up with a better workflow.

Suppose we begin to support "pipeline-free experiments" like dvc exp run --command python train py. How will we differ if some artifact is to be put in Git remote or DVC remote? Do we also consider 500MB model files as Git objects? Where will we push them?

My understanding of the "pipeline-free experiment" scenario is that it's not actually pipeline-free, it's just DVC pipelines with "sane defaults" dependency & output paths.

I haven't been involved in the product discussions for this feature, but I would think that in this case we would just handle all output artifacts as DVC-tracked data. Since in this case, while Git may be somewhat better than DVC at handling small text-only artifacts (like typical metrics files), Git is significantly worse than DVC at handling large binary artifacts. When choosing the best of the 2 as a sane default, it seems like the obvious choice would be to default to track all outputs with DVC.

(And cache: True is already the default state for any output in DVC)

iesahin Sep 27, 2021
Author

If we are only storing artifacts, we lose reproducibility, which IMO is one of the more important features provided by DVC pipelines and by extension, experiments.

Snapshots must contain all the elements for experiments for reproducibility. If this means copying/symlinking the entire directory, that's OK. Local disk space is cheap. We can also show the amount of disk space each experiment takes and user can delete them.

But if we consider code/dependency changes to be part of the experiment, then yes, I'd say that we do. As soon as we get into needing to apply someone else's shared experiment changes into a user's workspace, we will run into merge conflicts.

If there happen to be merge conflicts, that's the Git user's issue to merge the two different elements. In Git repositories, experiments can keep track of the last commit ID and require dvc exp apply the code/params/text on top of that. Then it becomes user's responsibility to keep a clean commit history or different branches. We won't need to solve these.

DVC does function for simple data management scenarios without git, but IMO I would not really consider the --no-scm usage to be "perfectly usable". It restricts users to a very limited subset of DVC functionality (anything that requires actual historical versioning is disabled)

If I want to share my music collection or datasets with you, I can dvc add them, dvc push to Google Drive, zip the directory, and send you. You can unzip the directory and pull the ones you need.

I can also send you my latest experiment if I run the pipeline with dvc repro, but not with dvc exp run. For history, we may need Git but I don't believe the majority of our intended user base has that interest/knowledge.

Git certainly must be supported but not required. That's a blocker for many people who may use DVC otherwise.

If that assumption is flawed (or exp push/pull as a shortcut is too confusing), we should just revisit the experiment sharing scenario and come up with a better workflow.

Suppose dvc exp push zips the .dvc/exps/exp-12345 directory and uploads to DVC remote. The user can have an option to use this zip file in another way or share them to someone who's not using DVC. They can do this currently as well, but needs more "Git talent" than simple file upload/downloads.

This still sounds like a UI/UX problem to me. Experiment sharing in its current form was built on the assumption that users are already used to using git push/pull + dvc push/pull for sharing experiments as git commits

An experienced Git+DVC user can know the difference in these two different kinds of push/pull commands, but I believe there is a conceptual difference between these two. "It's possible" doesn't mean "it's easy" or "natural." I know every feature that I mention is possible, but these are not streamlined and easy to describe, hence the user may lose their interest halfway if we push Git to them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some ideas for the next major version #6653

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 22 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some ideas for the next major version #6653

iesahin Sep 20, 2021

Workspace

Hashing

Cache

Remotes

Experiments

Git

File Formats

Technology

Replies: 6 comments · 22 replies

pmrowla Sep 21, 2021

iesahin Sep 21, 2021 Author

iesahin Sep 21, 2021 Author

dberenbaum Sep 23, 2021 Collaborator

Workspace

dberenbaum Sep 23, 2021 Collaborator

Remotes

iesahin Sep 24, 2021 Author

dberenbaum Sep 25, 2021 Collaborator

iesahin Sep 27, 2021 Author

shcheklein Sep 27, 2021 Maintainer

dberenbaum Sep 23, 2021 Collaborator

Experiments

dberenbaum Oct 1, 2021 Collaborator

iesahin Oct 4, 2021 Author

dberenbaum Oct 20, 2021 Collaborator

shcheklein Oct 20, 2021 Maintainer

jorgeorpinel Nov 2, 2021

dberenbaum Sep 23, 2021 Collaborator

Technology

iesahin Sep 24, 2021 Author

pmrowla Sep 24, 2021

iesahin Sep 24, 2021 Author

iesahin Sep 24, 2021 Author

iesahin Sep 24, 2021 Author

pmrowla Sep 24, 2021

iesahin Sep 27, 2021 Author

iesahin
Sep 20, 2021

Replies: 6 comments 22 replies

pmrowla
Sep 21, 2021

iesahin Sep 21, 2021
Author

iesahin Sep 21, 2021
Author

dberenbaum
Sep 23, 2021
Collaborator

dberenbaum
Sep 23, 2021
Collaborator

iesahin Sep 24, 2021
Author

dberenbaum Sep 25, 2021
Collaborator

iesahin Sep 27, 2021
Author

shcheklein Sep 27, 2021
Maintainer

dberenbaum
Sep 23, 2021
Collaborator

dberenbaum Oct 1, 2021
Collaborator

iesahin Oct 4, 2021
Author

dberenbaum Oct 20, 2021
Collaborator

shcheklein Oct 20, 2021
Maintainer

dberenbaum
Sep 23, 2021
Collaborator

iesahin Sep 24, 2021
Author

pmrowla
Sep 24, 2021

iesahin Sep 24, 2021
Author

iesahin Sep 24, 2021
Author

iesahin Sep 24, 2021
Author

iesahin Sep 27, 2021
Author