PCA-Based Binning via Clustering #32

dylanagreen · 2020-08-31T10:15:30Z

In true astrophysics fashion I have forced an acronym, and hence affectionately refer to my algorithm as "BAG" or more accurately "Binning As clusterinG."

In lieu of explaining thoroughly how my method works, I have created a jupyter notebook which steps through math and code iteratively, in notebooks/binning_as_clustering.ipynb. In essence I sought to find the method that would classify points the fastest, and here present a method that bins galaxies by first reducing the dimensionality of their color and magnitude data to 3-dimensions and then assigning a galaxy to a bin by finding which centroid it is closest to. Classification then simply requires only an argmax and calculating n vector distances, where n is the number of bins. This can be done in about 3 seconds for 3 bins! Using JAX to jit compile the classification function reduces this time to a blistering 0.5s on average. That's about the best thing I have going for this method.

Training is done to find centroid positions that optimize the requested metric, using gradient descent and JAX. Learning rate for gradient descent is found using a range test, and a few little tricks I implemented.

Training is slow, although I do not know if it is any slower than training a NN. Generally training for 3 bins takes between ~5 and 10 minutes. Training takes longer at higher bins, as computing the metric takes longer for more bins. The range test takes some time but does make the training itself slightly faster.

FOM

BAG achieves its best results when training for FOM, but shows instability above 7 bins. By 8 bins, the method is trying to optimize parameters in 24 dimensions (8 bins * 3 dimensional coordinates per bin centroid) and starts to get extremely finicky. I've spent most of the past week ironing out instabilities between 4 and 7 bins, beyond that has eluded me at the moment. This is my most recent performance plot, after 8 bins it seems to plateau a bit at about the same value as 7 bins. 7 seems to be the unlucky number that already comes in below the trend the previous 5 bin numbers establish. To be investigated...?

An example of the binning generated for 3 bins when optimized for FOM:

FOM_DETF

I only experimented with the DETF Figure of Merit quite late. I was able to improve performance on the DETF metric with some tweaks to learning rate and some approximations. Best case performance is stable and better than the forest up to 6 bins or so. I haven't been able to improve performance beyond that above the plateau, but anyone is welcome to take up the challenge:

Example binning:

SNR

Improvements I made to the method to optimize FOM also improved its performance when optimizing for SNR (which was pretty abysmal when I started). BAG is definitely best used to classify for an optimized Figure of Merit, but I present the SNR results for completeness:

Example binning:

- This method trains forests on the four smaller "islands" that appear in the data when reduced using PCA.

- Various other improvements as well, like restarts and fom optimization.

- In the case where one is passed, it uses that. - Otherwise range tests to find a "good" one.

- Notebook is a bit messy right now because I'm working fast and late at night.

EiffL · 2020-08-31T11:26:53Z

Wow! This is really cool :-D Thanks so much for your entry @dylanagreen clustering by gradient descent :-D I like that a lot!

dylanagreen · 2020-09-01T06:16:47Z

@EiffL Thank you very much! I appreciate it! I'm not convinced my training is the most efficient possible since it's so sensitive to hyperparameters like learning rate and starting centroids, and if I had more time I'd probably dedicate it to refining the method further.

- Additionally brings it in line with the new training methods.

- Use one-cycle policy rather than range test for training. - Do not train over beta, approximate it as n_bins. - Also speeds training to ~1/3 the time before.

dylanagreen · 2020-09-14T00:00:17Z

My method with no changes performs about the same (in comparison to the forest) on the Buzzard dataset (the above plots were generated for DC2):

In order to get this DETF performance, however, I had to modify the lr very slightly. I will be adding a commit that adds a "buzzard" parameter to the .yaml file to implement this change as a toggle. The method itself is unchanged, but the learning rate is slightly higher for the Buzzard DETF training. I'm not sure why only the DETF performs better with this change, both the FOM and SNR here are generated using the exact same learning rate scheme as the DC2 data. It might not even be worth it to implement, commit and push the change, since the improvement over the identical DC2 lr is somewhat marginal and irrelevant to whether the method "works" or not but I will do it anyway just so the option to enable it is there!

- This parameter controls whether we should use the Buzzard learning rate scheme for the FOM_DETF metric (and that metric only, as it is the only one that shows improvement in this scheme).

- Also fixed a misspelling of the word "verbose".

dylanagreen and others added 11 commits August 13, 2020 12:49

Add a notebook outlining the pca_forest method.

1ac33a3

- This method trains forests on the four smaller "islands" that appear in the data when reduced using PCA.

Move pca_forest to notebooks folder.

40fbffa

Add a notebook outlining the pca based clustering method.

fdfa4c2

Update pca_cluster with lr annealing.

588f8c5

- Various other improvements as well, like restarts and fom optimization.

Add a learning rate range test to find optimal lr.

3068bcc

Implement range test into training.

0696746

Make passing a learning rate optional.

16c2edb

- In the case where one is passed, it uses that. - Otherwise range tests to find a "good" one.

Properly implement the range test for lr.

7aa4243

Add notebook outlining binning as clustering method.

05be1e4

Correctly implement range test.

f43057b

- Notebook is a bit messy right now because I'm working fast and late at night.

Add script version of pca_clustering.

c4461da

EiffL added the entry Challenge entry label Aug 31, 2020

Improve performance on FOM_DETF training.

89a386e

dylanagreen added 3 commits September 1, 2020 00:57

Improvements to 8 bins (and beyond..?)

79e73e5

Clean some docstrings in pca_cluster.py

703b37c

Cleanup binning_as_clustering.

7d8772d

- Additionally brings it in line with the new training methods.

dylanagreen marked this pull request as ready for review September 1, 2020 18:34

dylanagreen added 2 commits September 2, 2020 14:40

Fix an embarrasing bug choosing learning rate.

61c8028

Massively improve training performance.

462716b

- Use one-cycle policy rather than range test for training. - Do not train over beta, approximate it as n_bins. - Also speeds training to ~1/3 the time before.

dylanagreen added 2 commits September 14, 2020 21:50

Add optional buzzard parameter

acfc3e4

- This parameter controls whether we should use the Buzzard learning rate scheme for the FOM_DETF metric (and that metric only, as it is the only one that shows improvement in this scheme).

Enable riz training.

e1f330c

- Also fixed a misspelling of the word "verbose".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA-Based Binning via Clustering #32

PCA-Based Binning via Clustering #32

dylanagreen commented Aug 31, 2020 •

edited

Loading

EiffL commented Aug 31, 2020

dylanagreen commented Sep 1, 2020 •

edited

Loading

dylanagreen commented Sep 14, 2020

PCA-Based Binning via Clustering #32

Are you sure you want to change the base?

PCA-Based Binning via Clustering #32

Conversation

dylanagreen commented Aug 31, 2020 • edited Loading

FOM

FOM_DETF

SNR

EiffL commented Aug 31, 2020

dylanagreen commented Sep 1, 2020 • edited Loading

dylanagreen commented Sep 14, 2020

dylanagreen commented Aug 31, 2020 •

edited

Loading

dylanagreen commented Sep 1, 2020 •

edited

Loading