[QUESTION & DOCS]: Comparison with DVC #168

vanangamudi · 2019-11-22T13:07:00Z

Executive Summary
How does this compare with DVC.

Additional Context / Explantation
I am not trying to start a flame war, but we spent quite a lot of time in investigating DVC[1] for our purpose. But one of my friend suggested to take a look at hangar. One key thing we really like about DVC is the metrics features. I read through the hangar docs, it looks a lot different from DVC but lot similiar to the dat project[2]. I may be wrong. Need some help with understanding the difference.

External Links
[1] https://github.com/iterative/dvc
[2] https://datproject.org/

hhsecond · 2019-11-22T20:16:04Z

Hi Selva,
Thanks for trying out hangar and raising the issue. Assuming you are reaching here through Adam. If you have tried out DVC, you must be familiar with few downsides of DVC especially when it comes to performance. Hangar has another approach to version your data and it is built from ground up instead of relying on Git for versioning (storing tensors rather than blobs). We completely understand DVC's approach is useful for few folks and especially making data versioning go hand in hand with Git is specifically useful (we are trying to have a solution for this now which should also help with the metrics feature). We are making hangar to be part of user's code base and work easily with existing frameworks as well. I would be happy to take a call with you to guide you through few examples I had discussed with Adam. Please be aware that we have a slack user group where you'll get faster responses than here. Here is the link to the slack group.
Also @rlizzo might have few more things to add here

rlizzo · 2019-11-26T12:26:35Z

Hey @vanangamudi,

Thanks for waiting on a reply here! I'll expand a bit upon @hhsecond's excellent summary above.

Executive Summary

At first glance, Hangar, DVC, and DAT might appear to solve similar(ish) problems: versioning, making use of, and distributing/collaborating on "data". However, the implementation/design and world-view of each tool are drastically different; drastically impacting end-user workflows and performance.

Direct Comparisons

Hangar vs DVC

Philosophy

The simplest way to understand why/how Hangar and DVC differ might be:

Hangar is what results when you start from nothing and ask "how would I go about building a version control system (with similar semantics to Git) but specifically designed to deal with arbitrary numeric 'data'?"
DVC is the result of starting with Git and asking: "how can i add additional components and modules to this existing version control system in order to allow it to deal with arbitrary binary 'data' files?"

A really important point in the above statements is the difference between what Hangar and DVC consider "data"

Hangar made the decision to think about "data" in the way it is used in computations, ie. as numeric arrays/tensors. The user sends numbers in, hangar stores them, and sends identical numbers out when asked for them. This is much more thoroughly explained in How Hangar Thinks About Data
DVC models itself after git, and tracks "data" as files the user creates and uses on disk. The user sends DVC a list of files, dvc creates a snapshot, and then restores those files in the same place when asked.

This is massively affects every aspect of usage, performance, and scaling abilities (as explained below).

Workflow

Hangar

Because Hangar thinks of data only as numeric arrays; there is no need for Hangar to consider domain specific storage conventions or formats! With a small set of schemas and technologies, Hangar genereates highly optimized storage containers to read/write data on disk completely automatically and transparently for the user.

As a Hangar user, all you do is say:

>>> # write data
>>> checkout.arrayset['some_sample'] = np.array([0, 1, 2])  # some data
>>> # read data
>>> res = checkout.arrayset['some_sample']
>>> res
array([0, 1, 2])

In a Hangar workflow, there is

No need to write domain specific file data readers / writers
No need to write multithreaded / multiprocess code in order to saturate cpu cores (reading in hangar is infinitely parallelizable across processes/threads/machines by default)
No need to transform data into a "computable" form - what you stored is immediately ready for computation as a np.array (or torch.Tensor / tf.Tensor if using our built-in torch/tensorflow dataloaders.
No need to manage a directory of data at all. Hangar manages it's data directory itself (in a hidden .Hangar directory). You'll never have to deal with file at all if data is stored in Hangar.

Most importantly in Hangar: the numeric data you put in is exactly the numeric data you get out. While explaining how data is indexed/stored in the Hangar backends is well beyond the scope of this article, it should be noted that the method which hangar stores the data is nearly completely arbitrary. Over time, the backend some piece of data resides in can (any often will) change or update. It is up to hangar to ensure that when you ask for some data sample, that numeric data is returned exactly as it went in. How it is stored is irrelevant to the end user (and to the majority of the Hangar core).

DVC

At it's core, DVC is dependent on the Git project's code, as well as it's worldview. For the sake of brevity, I'll avoid venturing into inner workings / interactions between Git / DVC; just know that the design as explained below follows from Git's implementation & fundamental design.

Essentially, all DVC does is create a snapshot of some set of files (which the user marks as being "data" files, identified by either a filename suffix or via manually adding the file path to DVC). Because DVC operates in a Git directory, AND because it thinks of "data" as some collection of bytes in a "file" on disk, any commit DVC generates of the "data files" will always return an exact snapshot of that "data"'s bytes (the file contents), file format (suffix), file name, and directory path.

In the DVC model, regardless of how the needs / processing pipeline / usage of some piece of data changes in the future, if you want to see data from some previous point in time, you get files written for the processes that exist at that point in time. In DVC, you must always retain:

The custom code to read the data from some file
Knowledge of folder names / files names / project structure mapping to some piece of data you want (as it existed at that point in time)
Transformation pipeline/code in order to make the data usable in computations,
You MUST ALWAYS have access to compatible versions of any library binaries / environments which was used to write that file.

This is a fundamental limitation of DVC because Git was written to Handle text files representing pieces of code. Thinking of "data" and "text" as analogous entities is a fallacy disproved by the following argument:

A text file is universally stable, the encoding is universally agreed upon, and it is a prerequisite for EVERY computer to be able to read a file containing text. This is NOT true for data files (and by extension, DVC). Data files are domain specific, ever changing, and can be very complex. Assuming that it will be readable in 10 years, with the same tools and code which we have today is just not reasonable, advisable, or good-practice in any way.

What you really want is the data itself, the directly computable set of numbers representing some piece of information in the real world, not the container in which it is stored. (ie. what you want is a system like Hangar)

Performance

Hangar

The Hangar backends storage methods are highly optimized. Storing numeric data is our specialty, and the team has spent countless hours (and relied on many years of experience) to write backends which are highly performant for reads while balancing compression, shuffling, integrity checking, & multi-threading/processing considerations. Performance is a main consideration, and much work has gone into making sure that Hangar has some of the highest performance reads and compression levels around. I would suggest seeing this on a sample set of data you deal with in the real world.

Further, most hangar book-keeping operations (checkout, commit, diff, merge, fetch, clone, etc), do not actually require reading the numeric data files (which can be very large) from disk in order to execute. The vast majority of operations occur on only book-keeping "metadata" (very small - ~40 bytes each - structures acting to describe a commit / sample / data location). Combined with highly performant algorithms (similar to those used in the Git core itself), means that common tasks in Hangar (checkout, diff, merge, etc) occur in less than a second, even for very large repositories / checkouts. Any operation which requires the actual numeric data on disk to be touched (ie. writing new data, or reading old), is an O(1) operation (which generally has a small (1) time).

Disk space is also further preserved by automatically deduplicating data. If you add some sample which has identical contents to any sample existing in the entirety of the Hangar repository history, only the reference will be saved as an addition, the actual reference points to the previously saved sample which would be read upon request.

DVC

DVC stores what it is given. read speed / compression ratios are only as good as the files added to it. Without dedicated engineering efforts this commonly results in sub-par usability and increased costs through disk usage and cpu requirements during the reading / decoding phases. Also, for many operations DVC scales with O(N) or O(N^2) computational time complexity.

Direct Comparisons

Feature Set

Hangar can both "Branch" and "Merge" changes together, automatically. (DVC cannot "merge" branches together).
Hangar is extensible through CLI plugins and open-source backend contributions.
Hangar repositories are stand-alone, no need to interplay both Git and Hangar commands in the same context.

Hangar vs DAT

Many of the same points in relation to performance / workflow are analogous in the comparison of Hangar vs DVC as to Hangar vs DAT. However, DAT isn't even really a formal "version control" system. It is a networking protocol which handles arbitrary data in the form of "files". While it is certainly relevant to Hangar (more on this if interested), I don't see them as filling the same use case:

Hangar = Version Control + Distribution Protocol
DAT = Distribution Protocol

More Info

For further reading on the details above, I would encourage you to read up on the following section of the Hangar ReadTheDocs Site:

rlizzo · 2019-11-26T13:13:26Z

Also, since I never addressed your comment on DVC "metrics".

Hangar is a much more focused project than DVC. Rather than try to handle both execution, results tracking, and pipelining of specific workflow (ML graphs / training) in the same tool which is responsible for versioning and accessing your Data, we limit out scope to putting/retriving data on/from disk, versioning it, and enabling distribution/collaboration.

I liken adding pipeline/run features directly to the Hangar core to be akin to Git building in the functionality of Jenkins CI right into itself. It would be problematic for a few reasons:

Version control is complex enough for the average user without additional bloat to learn.
If the addition is not sufficiently general (like text is to code for versioning) to cover every potential application, then the temptation for domains to fork and build custom ports would be huge. All of a sudden the community would have 100s of (potentially inoperable) version control programs to deal with. It would be a mess for both users and maintainers.
Many projects which try to do too much have a strange tendency to do end up doing nothing well. Maintaining a project over the long term requires people with specific expertise. If the demands of keeping a codebase clean and optimized are too much, or suddenly become ouside the scope of the dev team, the project is in serious trouble. This only gets worse over time.

solution

While there isn't any built in support for "metrics" like DVC, the general nature of Hangar makes writing your own metrics alongside any hangar commit trivially easy:

Method 1: define an arrayset with some descriptive name, and write values with relevant sample keys.. At every commit, if the data generated some new metric, overwrite that sample.

>>> co.arraysets['metrics']['AUC'] = np.array([2.414])
>>> co.arraysets['metrics']['ROC'] = np.array([0.4522])
>>> # continue as needed
>>> co.arraysets['metrics']['AUC']
array([2.414])

Method 2: Use metadata key/value pairs in a similar fashion. Note: metadata values must be string typed, though they can assume any arbitrary value, and once returned can be converted to any dtype the calling application desires

>>> co.metadata['model1-AUC'] = str(2.414)
>>> res = co.metadata['model1-AUC']
>>> res
'2.414'
>>> float(res)
2.414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION & DOCS]: Comparison with DVC #168

[QUESTION & DOCS]: Comparison with DVC #168

vanangamudi commented Nov 22, 2019

hhsecond commented Nov 22, 2019

rlizzo commented Nov 26, 2019

rlizzo commented Nov 26, 2019 •

edited

Loading

vanangamudi commented Nov 26, 2019

addisonklinke commented Mar 9, 2022

[QUESTION & DOCS]: Comparison with DVC #168

[QUESTION & DOCS]: Comparison with DVC #168

Comments

vanangamudi commented Nov 22, 2019

External Links [1] https://github.com/iterative/dvc [2] https://datproject.org/

hhsecond commented Nov 22, 2019

rlizzo commented Nov 26, 2019

Executive Summary

Direct Comparisons

Hangar vs DVC

Philosophy

Workflow

Hangar

DVC

Performance

Hangar

DVC

Direct Comparisons

Feature Set

Hangar vs DAT

More Info

rlizzo commented Nov 26, 2019 • edited Loading

solution

Other Questions?

vanangamudi commented Nov 26, 2019

addisonklinke commented Mar 9, 2022

External Links
[1] https://github.com/iterative/dvc
[2] https://datproject.org/

rlizzo commented Nov 26, 2019 •

edited

Loading