Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA-Based Binning via Clustering #32

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

dylanagreen
Copy link

@dylanagreen dylanagreen commented Aug 31, 2020

In true astrophysics fashion I have forced an acronym, and hence affectionately refer to my algorithm as "BAG" or more accurately "Binning As clusterinG."

In lieu of explaining thoroughly how my method works, I have created a jupyter notebook which steps through math and code iteratively, in notebooks/binning_as_clustering.ipynb. In essence I sought to find the method that would classify points the fastest, and here present a method that bins galaxies by first reducing the dimensionality of their color and magnitude data to 3-dimensions and then assigning a galaxy to a bin by finding which centroid it is closest to. Classification then simply requires only an argmax and calculating n vector distances, where n is the number of bins. This can be done in about 3 seconds for 3 bins! Using JAX to jit compile the classification function reduces this time to a blistering 0.5s on average. That's about the best thing I have going for this method.

Training is done to find centroid positions that optimize the requested metric, using gradient descent and JAX. Learning rate for gradient descent is found using a range test, and a few little tricks I implemented.

Training is slow, although I do not know if it is any slower than training a NN. Generally training for 3 bins takes between ~5 and 10 minutes. Training takes longer at higher bins, as computing the metric takes longer for more bins. The range test takes some time but does make the training itself slightly faster.

FOM

BAG achieves its best results when training for FOM, but shows instability above 7 bins. By 8 bins, the method is trying to optimize parameters in 24 dimensions (8 bins * 3 dimensional coordinates per bin centroid) and starts to get extremely finicky. I've spent most of the past week ironing out instabilities between 4 and 7 bins, beyond that has eluded me at the moment. This is my most recent performance plot, after 8 bins it seems to plateau a bit at about the same value as 7 bins. 7 seems to be the unlucky number that already comes in below the trend the previous 5 bin numbers establish. To be investigated...?

hist_fom

An example of the binning generated for 3 bins when optimized for FOM:
bins_fom

FOM_DETF

I only experimented with the DETF Figure of Merit quite late. I was able to improve performance on the DETF metric with some tweaks to learning rate and some approximations. Best case performance is stable and better than the forest up to 6 bins or so. I haven't been able to improve performance beyond that above the plateau, but anyone is welcome to take up the challenge:

hist_fom_detf

Example binning:
bins_fom_detf

SNR

Improvements I made to the method to optimize FOM also improved its performance when optimizing for SNR (which was pretty abysmal when I started). BAG is definitely best used to classify for an optimized Figure of Merit, but I present the SNR results for completeness:

hist_snr

Example binning:
bins_snr

dylanagreen and others added 11 commits August 13, 2020 12:49
- This method trains forests on the four smaller "islands" that appear in the data when reduced using PCA.
- Various other improvements as well, like restarts and fom
optimization.
- In the case where one is passed, it uses that.
- Otherwise range tests to find a "good" one.
- Notebook is a bit messy right now because I'm working fast and late at
night.
@EiffL EiffL added the entry Challenge entry label Aug 31, 2020
@EiffL
Copy link
Member

EiffL commented Aug 31, 2020

Wow! This is really cool :-D Thanks so much for your entry @dylanagreen clustering by gradient descent :-D I like that a lot!

@dylanagreen
Copy link
Author

dylanagreen commented Sep 1, 2020

@EiffL Thank you very much! I appreciate it! I'm not convinced my training is the most efficient possible since it's so sensitive to hyperparameters like learning rate and starting centroids, and if I had more time I'd probably dedicate it to refining the method further.

@dylanagreen dylanagreen marked this pull request as ready for review September 1, 2020 18:34
- Use one-cycle policy rather than range test for training.
- Do not train over beta, approximate it as n_bins.
- Also speeds training to ~1/3 the time before.
@dylanagreen
Copy link
Author

My method with no changes performs about the same (in comparison to the forest) on the Buzzard dataset (the above plots were generated for DC2):

hist_fom(1)
hist_detf
hist_snr(1)

In order to get this DETF performance, however, I had to modify the lr very slightly. I will be adding a commit that adds a "buzzard" parameter to the .yaml file to implement this change as a toggle. The method itself is unchanged, but the learning rate is slightly higher for the Buzzard DETF training. I'm not sure why only the DETF performs better with this change, both the FOM and SNR here are generated using the exact same learning rate scheme as the DC2 data. It might not even be worth it to implement, commit and push the change, since the improvement over the identical DC2 lr is somewhat marginal and irrelevant to whether the method "works" or not but I will do it anyway just so the option to enable it is there!

- This parameter controls whether we should use the Buzzard learning
rate scheme for the FOM_DETF metric (and that metric only, as it is the
only one that shows improvement in this scheme).
- Also fixed a misspelling of the word "verbose".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
entry Challenge entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants