Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountMinSketch[K] assumes working equals method on K. #497

Open
johnynek opened this issue Nov 1, 2015 · 11 comments
Open

CountMinSketch[K] assumes working equals method on K. #497

johnynek opened this issue Nov 1, 2015 · 11 comments
Labels

Comments

@johnynek
Copy link
Collaborator

johnynek commented Nov 1, 2015

For small count-min sketches, you create CMSItem:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L467

But to get the frequency of K we use ==:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L479

if you have CMSItem[Array[Byte]] as happens in scalding for sketches of tiny key spaces (which might come up in unit tests or toy examples), this means we always reurn a frequency of 0 for CMSItem[Array[Byte]].frequency unless you happen to pass the same exact array in.

trying to bite my tongue about the design flaw in Java that equals is eq for Arrays.

@johnynek
Copy link
Collaborator Author

johnynek commented Nov 2, 2015

A possible solution here is to add an def equiv(a: K, b: K): Boolean method to CMSHasher[K] and then implement this for Array[Byte].

PS: @non this is why I think any Hashing typeclass should have equivalence: you often need it, and it seems impossible to state any law on the typeclass without it (well, you could make probabilisitic statements about collisions).

@johnynek
Copy link
Collaborator Author

johnynek commented Nov 2, 2015

Also, the SparseCMS does not seem fixable at all with Array[T] keys. Since we are actively using it for Array[Byte] in scalding, and I guess CMSHasherByteArray: is broken if we allow the SparseCMS.
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L1353

We could fix this by making the table have the key which is the hash, and then check collisions using the equivalence in the CMSHasher, or we could remove the sparse business.

@ianoc
Copy link
Collaborator

ianoc commented Nov 12, 2015

We internally use structures often where having a graceful degradation from exact down to approximate is nice. (TSAR uses these heavily), so it sort of would be nice to keep it if we could. Requiring the Hasher type class to provide the equality also i think would be most preferable if not too painful. (We'd need to be careful of course about using any built in maps or other scala stuff... which is annoying)

@isnotinvain
Copy link
Contributor

This is old, but, here is the ugly wrapper class I created in scalding to make arrays have efficient orderings and hash codes:

https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/HashEqualsArrayWrapper.scala

The problem is you have to make sure to wrap your arrays in these wrappers.

@johnynek
Copy link
Collaborator Author

that may be worth adding. The core issue is unsolved: we are using equality without communicating in the types that we are doing so. We need something like https://github.com/non/algebra/blob/master/core/src/main/scala/algebra/Eq.scala#L12

@ianoc
Copy link
Collaborator

ianoc commented Jan 21, 2016

When/if should we take the algebra dep into algebird I wonder, that was the original intent right? Or should we start going for more Equiv typeclass usage and less =='s anywhere?

@johnynek
Copy link
Collaborator Author

Good question. I hoped soon, then this cats/algebra conflict came up:

typelevel/algebra#142

The problem with Equiv is that there is a default for all types falling back to ==. So it is not much safer.

@isnotinvain
Copy link
Contributor

Yeah unless we get away from use of == then users (and library implementors) just have to take care to wrap types when they have a bad ==, no way around that really.

Using an Equiv makes sense to me, maybe the migration path could be to require an Equiv wherever we use ==, and if one isn't provided, fall back to ==, then deprecate the fall back path... etc

@johnynek
Copy link
Collaborator Author

Scala provides a fallback Equiv (==) in the companion object already so
that approach won't be much safer.

On Thu, Jan 21, 2016 at 3:38 PM, Alex Levenson [email protected]
wrote:

Yeah unless we get away from use of == then users (and library
implementors) just have to take care to wrap types when they have a bad ==,
no way around that really.

Using an Equiv makes sense to me, maybe the migration path could be to
require an Equiv wherever we use ==, and if one isn't provided, fall back
to ==, then deprecate the fall back path... etc


Reply to this email directly or view it on GitHub
#497 (comment).

P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin

@isnotinvain
Copy link
Contributor

Ok, well, what I said but replace scala's Equiv with something similar that has no fallback on == then?

@sritchie sritchie added the bug label Oct 24, 2016
@sritchie
Copy link
Collaborator

sritchie commented Dec 10, 2016

@isnotinvain we can definitely pick this up now that we have the algebra dep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants