Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data deduplication / CAS storage #163

Open
killua-eu opened this issue Jul 2, 2017 · 9 comments
Open

Data deduplication / CAS storage #163

killua-eu opened this issue Jul 2, 2017 · 9 comments
Labels

Comments

@killua-eu
Copy link

killua-eu commented Jul 2, 2017

Hi, a quick question: does meteor-file-collection have data deduplication based on hash comparisons buit-in, or in other words, is it a content-adressable storage? Did you consider choosing a different hash (i.e. sha256) or adding another one to protect from (quite theoretical for a file based storage, I admit) md5 collisions? Thanks in advance, P.

@vsivsi
Copy link
Owner

vsivsi commented Jul 3, 2017

Thanks for your question. No it doesn't implement de-dup/CAS. This package is meant to be a (relatively) simple package to expose the MongoDB gridFS implementation to Meteor, and gridFS does not implement these features. The use of MD5 is specified by the gridFS specification and implemented in the MongoDB drivers and DB server software.

You are free to add another hash (or any other information) as file metadata to meet the needs of your application. It should be possible/straighforward to implement a CAS/dedup solution on top of gridFS, and this has probably already been done (or could be added easily enough by writing a MongoDB backend for one of the many such systems that are under active development (e.g. Dat, Noms, Restic, etc.) But fileCollection will not support any of these directly because it is outside the scope of this project to do so.

@vsivsi vsivsi added the question label Jul 3, 2017
@killua-eu
Copy link
Author

Thanks for the answers! As for the CAS/dedup - I'm didn't read enough gridFS related docs, but isn't the dedup there kind of by default (though only relying on md5)? ... i can use md5 to request a file from girdFS via fileCollection, so I was assuming, that fileCollection would, upon inserting the same data under a different filename, just ignore the data and just add a new filename - basically deduplication upon insertion.

@vsivsi
Copy link
Owner

vsivsi commented Jul 3, 2017

Nope. There's no deduping or reference counting or anything in gridFS. Each file has its own chunks regardless of the MD5 sum. And the chunks themselves are not deduped or individually hashed in any way. It's very simple. So simple in fact that it is not inherently safe for concurrent writes (e.g. there is no locking of any kind). MongoDB has been "talking" about redesigning it for years, but I've seen no recent progress on that either.

@vsivsi
Copy link
Owner

vsivsi commented Jul 3, 2017

To clarify: not safe for concurrent writes/reads to any given file. FileCollection does implement locking on top of MongoDB to make such operations safe, although more recently MongoDB has actually been de-featured in this respect (rather than fixing it) because of the mythical replacement technology that has yet to materialize.

@killua-eu
Copy link
Author

Aah, thats a bit of a letdown on Mongo's side. Please correct me if I'm wrong, could a "poor-mans" dedup-on-write be simply implemented as a few-liner if one would query FileCollection first for the md5 to be written and decide if to write it or if to only update references?

@vsivsi
Copy link
Owner

vsivsi commented Jul 3, 2017

Sure, if all you want is file-level dedup, that could work (probably a bit more than a "few liner" though). You'd need to implement reference counting and ensure that the inc/dec logic is safe for concurrent operations, and come up with a scheme to implement "per copy" metadata, etc. In general, gridFS is very simple, but it is also pretty flexible in terms of making it possible for lots of higher level functionality to be built on top at the application level, precisely because it specifies so little.

@killua-eu
Copy link
Author

Oh, I believed/hoped that I could rely on Mongo for concurrency safety. Thanks for the info, I'll have to read a bit more on gridFS to figure out the limitations.

@vsivsi
Copy link
Owner

vsivsi commented Jul 4, 2017

You should check out my gridFS locking package (and the sister gridFS streaming package). Lots of good info there, and file-collection is built on top of it.

https://github.com/vsivsi/gridfs-locks

@killua-eu
Copy link
Author

Lovely! Thanks lots for all the infos :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants