-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Selector gains and duplicate elements in Selector ranking in case of Facility Locations function in v0.6.0 #19
Comments
That's weird, looks like there is a bug in what's being reported in the selector1 = FacilityLocationSelection(800, metric='precomputed', optimizer='naive')
selector2 = FacilityLocationSelection(800, metric='precomputed', optimizer='lazy')
selector1.fit(x.numpy())
selector2.fit(x.numpy())
selector1.ranking[selector1.gains != 0]
# array([ 0, 671, 228, 165, 930, 563, 534, 610, 41, 675, 213, 500, 509,
255, 293, 608, 963, 204, 253, 15, 662, 284, 503, 909, 369, 649,
701, 241, 643, 877, 153, 463, 754, 388, 679, 724, 219, 573, 758,
259, 706, 34, 703, 32, 105, 694, 792, 380, 603, 214, 74, 147,
845, 159, 80, 892, 911, 844, 919, 622, 149, 258, 464, 203, 956,
595, 405, 972, 114, 789, 131, 809, 815, 328, 808, 211, 329, 605,
123, 496, 653, 82, 280, 311])
selector2.ranking[selector2.gains != 0]
# array([ 0, 671, 228, 165, 930, 563, 534, 610, 41, 675, 213, 500, 509,
255, 293, 608, 963, 204, 253, 15, 662, 284, 503, 909, 369, 649,
701, 241, 643, 877, 153, 463, 754, 388, 679, 724, 219, 573, 758,
259, 706, 34, 703, 32, 105, 694, 792, 380, 603, 214, 74, 147,
845, 159, 80, 892, 911, 844, 919, 622, 149, 258, 464, 203, 956,
595, 405, 972, 114, 789, 131, 809, 815, 328, 808, 211, 329, 605,
123, 496, 653, 82, 280, 311]) you get the right answer. Both the gains and the rankings are right there. I'll have to look into why a bunch of 0 gains are added in the front for the lazy selection. As for what you can do now, I'd suggest switching to the naive optimizer. The naive optimizer is robust against almost all issues that will effect the lazy optimizer, because the lazy optimizer has to deal with the hassle and edge cases of maintaining a priority queue. I had a question about your data though. apricot uses kernel matrices, in the sense that 1 means perfect similarity and 0 means perfect dissimilarity. Are you sure that you're not feeding in a dissimilarity or distance matrix? There are a ton of 1's in here, meaning your data is super redundant. |
Hi,
Coming to your query,
|
Gotcha, just checking that the redundancy is expected. Yes, using 0.5 should be fine. I think that 0.6 just added in new selectors and made the existing ones faster (but apparently introduced this issue that the unit tests missed). |
Alright, I will use v0.5. Are there any other differences between v0.5 and 0.6? |
I looked into this a bit more and the issue looks somewhat benign. Essentially, there is a bug in apricot, but it's an edge case related to the structure of your data. If you make a heatmap of your matrix you get the following: You can see that, basically, there are rows and columns comprised of exactly the same value. This means that, essentially, a single one of the redundant samples can explain every sample just as well as any of the other samples. Thus, submodular optimization involves first choosing one of the redundant samples (with no real preference for which one). Then, it chooses the non-redundant ones in reverse order of their similarity to the redundant examples, with each selection only providing a gain by explaining itself. Basically, after the first step, adding more items doesn't explain any of the OTHER examples. This shouldn't be a big problem, normally, because submodular optimization doesn't require that the gain of every item goes down each iteration-- it can also stay the same. However, in v0.6.0 I added a small optimization to the priority queue that terminates it when the gain it finds is 0 and the change is 0. Obviously this is somewhat an issue for data structured like yours, where a bunch of items are perfectly redundant and after the first selection the items are almost perfectly non-redundant. You can check this hypothesis by adding a small amount of random Gaussian noise to the data to mean that the gain is never perfectly 0-- and that's exactly what you see. x = aa['gram'][0] + numpy.abs(numpy.random.normal(0, 0.00001, size=(1000, 1000)))
selector1 = FacilityLocationSelection(800, metric='precomputed', optimizer='naive')
selector2 = FacilityLocationSelection(800, metric='precomputed', optimizer='lazy')
selector1.fit(x.numpy())
selector2.fit(x.numpy())
(selector1.ranking[:50] != selector2.ranking[:50]).sum()
# 0 tl;dr the issue is minor so I might not get around to it for a bit. you can use 0.5.0 because it doesn't include this "optimization." Here's a twitter thread about what I've added in v0.6: https://twitter.com/jmschreiber91/status/1311358116353642496?s=20 |
Thanks for the detailed response! |
Hi,
I was working with two versions of Apricot, 0.5.0 and 0.6.0. I am using the Facility Locations Function, with a precomputed setting, by passing my own matrix. I have two kinds of matrices - one highly sparse (which even contains zero diagonal elements) and another one with a lot of 1s. The dimension of each matrix is (1000,1000) and I have to select 800 elements.
I have attached a folder containing input matrices and the jupyter notebooks, including the output runs
Problem: Inconsistent Selector gains and duplicate elements in Selector ranking in case of Facility Locations function in v0.6.0
Now when I pass these matrices, to get the selector gains and ranking of indices, there are the following problems
This finding is consistent for both the matrices (sparse, and the one with a lot of 1s).
However, this issue is not there in v0.5.0; gains are in descending order, and the ranking array does not contain any duplicate element.
Please find the attached Jupiter notebook and the corresponding file in the folder, to reproduce the error.
Issue.zip
The text was updated successfully, but these errors were encountered: