Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Selector gains and duplicate elements in Selector ranking in case of Facility Locations function in v0.6.0 #19

Open
bhattg opened this issue Dec 6, 2020 · 6 comments

Comments

@bhattg
Copy link

bhattg commented Dec 6, 2020

Hi,

I was working with two versions of Apricot, 0.5.0 and 0.6.0. I am using the Facility Locations Function, with a precomputed setting, by passing my own matrix. I have two kinds of matrices - one highly sparse (which even contains zero diagonal elements) and another one with a lot of 1s. The dimension of each matrix is (1000,1000) and I have to select 800 elements.
I have attached a folder containing input matrices and the jupyter notebooks, including the output runs

Problem: Inconsistent Selector gains and duplicate elements in Selector ranking in case of Facility Locations function in v0.6.0

Now when I pass these matrices, to get the selector gains and ranking of indices, there are the following problems

  • The selector gains, which should have been in descending order, in the case of v0.6.0, are not in descending order.
  • Moreover, the rankings array contains duplicate elements (which shouldn't have happened)

This finding is consistent for both the matrices (sparse, and the one with a lot of 1s).

However, this issue is not there in v0.5.0; gains are in descending order, and the ranking array does not contain any duplicate element.

Please find the attached Jupiter notebook and the corresponding file in the folder, to reproduce the error.
Issue.zip

@jmschrei
Copy link
Owner

jmschrei commented Dec 7, 2020

That's weird, looks like there is a bug in what's being reported in the ranking and gains attributes. For instance, if you do the following with the 0.6 notebook:

selector1 = FacilityLocationSelection(800, metric='precomputed', optimizer='naive')
selector2 = FacilityLocationSelection(800, metric='precomputed', optimizer='lazy')

selector1.fit(x.numpy())
selector2.fit(x.numpy())

selector1.ranking[selector1.gains != 0]
# array([  0, 671, 228, 165, 930, 563, 534, 610,  41, 675, 213, 500, 509,
       255, 293, 608, 963, 204, 253,  15, 662, 284, 503, 909, 369, 649,
       701, 241, 643, 877, 153, 463, 754, 388, 679, 724, 219, 573, 758,
       259, 706,  34, 703,  32, 105, 694, 792, 380, 603, 214,  74, 147,
       845, 159,  80, 892, 911, 844, 919, 622, 149, 258, 464, 203, 956,
       595, 405, 972, 114, 789, 131, 809, 815, 328, 808, 211, 329, 605,
       123, 496, 653,  82, 280, 311])

selector2.ranking[selector2.gains != 0]
# array([  0, 671, 228, 165, 930, 563, 534, 610,  41, 675, 213, 500, 509,
       255, 293, 608, 963, 204, 253,  15, 662, 284, 503, 909, 369, 649,
       701, 241, 643, 877, 153, 463, 754, 388, 679, 724, 219, 573, 758,
       259, 706,  34, 703,  32, 105, 694, 792, 380, 603, 214,  74, 147,
       845, 159,  80, 892, 911, 844, 919, 622, 149, 258, 464, 203, 956,
       595, 405, 972, 114, 789, 131, 809, 815, 328, 808, 211, 329, 605,
       123, 496, 653,  82, 280, 311])

you get the right answer. Both the gains and the rankings are right there. I'll have to look into why a bunch of 0 gains are added in the front for the lazy selection.

As for what you can do now, I'd suggest switching to the naive optimizer. The naive optimizer is robust against almost all issues that will effect the lazy optimizer, because the lazy optimizer has to deal with the hassle and edge cases of maintaining a priority queue.

I had a question about your data though. apricot uses kernel matrices, in the sense that 1 means perfect similarity and 0 means perfect dissimilarity. Are you sure that you're not feeding in a dissimilarity or distance matrix? There are a ton of 1's in here, meaning your data is super redundant.

@bhattg
Copy link
Author

bhattg commented Dec 7, 2020

Hi,
Thanks for a quick response. I have a question related to v0.5.0

  • Is it OK to use v0.5.0 for the time being? I did not find this issue in that version.

Coming to your query,

  • Yes the data actually IS highly redundant, and only very few of the entries are actually important, and even though here I am choosing 800 entries, they too have high redundancy. This is kind of a corner case in the work that I am dealing with, however, it should not matter from a submodularity point of view.

@jmschrei
Copy link
Owner

jmschrei commented Dec 7, 2020

Gotcha, just checking that the redundancy is expected.

Yes, using 0.5 should be fine. I think that 0.6 just added in new selectors and made the existing ones faster (but apparently introduced this issue that the unit tests missed).

@bhattg
Copy link
Author

bhattg commented Dec 7, 2020

Alright, I will use v0.5. Are there any other differences between v0.5 and 0.6?

@jmschrei
Copy link
Owner

jmschrei commented Dec 7, 2020

I looked into this a bit more and the issue looks somewhat benign. Essentially, there is a bug in apricot, but it's an edge case related to the structure of your data. If you make a heatmap of your matrix you get the following:

image

You can see that, basically, there are rows and columns comprised of exactly the same value. This means that, essentially, a single one of the redundant samples can explain every sample just as well as any of the other samples. Thus, submodular optimization involves first choosing one of the redundant samples (with no real preference for which one). Then, it chooses the non-redundant ones in reverse order of their similarity to the redundant examples, with each selection only providing a gain by explaining itself. Basically, after the first step, adding more items doesn't explain any of the OTHER examples.

This shouldn't be a big problem, normally, because submodular optimization doesn't require that the gain of every item goes down each iteration-- it can also stay the same. However, in v0.6.0 I added a small optimization to the priority queue that terminates it when the gain it finds is 0 and the change is 0. Obviously this is somewhat an issue for data structured like yours, where a bunch of items are perfectly redundant and after the first selection the items are almost perfectly non-redundant.

You can check this hypothesis by adding a small amount of random Gaussian noise to the data to mean that the gain is never perfectly 0-- and that's exactly what you see.

x = aa['gram'][0] + numpy.abs(numpy.random.normal(0, 0.00001, size=(1000, 1000)))

selector1 = FacilityLocationSelection(800, metric='precomputed', optimizer='naive')
selector2 = FacilityLocationSelection(800, metric='precomputed', optimizer='lazy')

selector1.fit(x.numpy())
selector2.fit(x.numpy())

(selector1.ranking[:50] != selector2.ranking[:50]).sum()
# 0

tl;dr the issue is minor so I might not get around to it for a bit. you can use 0.5.0 because it doesn't include this "optimization."

Here's a twitter thread about what I've added in v0.6: https://twitter.com/jmschreiber91/status/1311358116353642496?s=20

@bhattg
Copy link
Author

bhattg commented Dec 10, 2020

Thanks for the detailed response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants