-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA-Based Binning via Clustering #32
base: master
Are you sure you want to change the base?
Conversation
- This method trains forests on the four smaller "islands" that appear in the data when reduced using PCA.
- Various other improvements as well, like restarts and fom optimization.
- In the case where one is passed, it uses that. - Otherwise range tests to find a "good" one.
- Notebook is a bit messy right now because I'm working fast and late at night.
Wow! This is really cool :-D Thanks so much for your entry @dylanagreen clustering by gradient descent :-D I like that a lot! |
@EiffL Thank you very much! I appreciate it! I'm not convinced my training is the most efficient possible since it's so sensitive to hyperparameters like learning rate and starting centroids, and if I had more time I'd probably dedicate it to refining the method further. |
- Additionally brings it in line with the new training methods.
- Use one-cycle policy rather than range test for training. - Do not train over beta, approximate it as n_bins. - Also speeds training to ~1/3 the time before.
My method with no changes performs about the same (in comparison to the forest) on the Buzzard dataset (the above plots were generated for DC2): In order to get this DETF performance, however, I had to modify the lr very slightly. I will be adding a commit that adds a "buzzard" parameter to the .yaml file to implement this change as a toggle. The method itself is unchanged, but the learning rate is slightly higher for the Buzzard DETF training. I'm not sure why only the DETF performs better with this change, both the FOM and SNR here are generated using the exact same learning rate scheme as the DC2 data. It might not even be worth it to implement, commit and push the change, since the improvement over the identical DC2 lr is somewhat marginal and irrelevant to whether the method "works" or not but I will do it anyway just so the option to enable it is there! |
- This parameter controls whether we should use the Buzzard learning rate scheme for the FOM_DETF metric (and that metric only, as it is the only one that shows improvement in this scheme).
- Also fixed a misspelling of the word "verbose".
In true astrophysics fashion I have forced an acronym, and hence affectionately refer to my algorithm as "BAG" or more accurately "Binning As clusterinG."
In lieu of explaining thoroughly how my method works, I have created a jupyter notebook which steps through math and code iteratively, in
notebooks/binning_as_clustering.ipynb
. In essence I sought to find the method that would classify points the fastest, and here present a method that bins galaxies by first reducing the dimensionality of their color and magnitude data to 3-dimensions and then assigning a galaxy to a bin by finding which centroid it is closest to. Classification then simply requires only an argmax and calculatingn
vector distances, wheren
is the number of bins. This can be done in about 3 seconds for 3 bins! Using JAX to jit compile the classification function reduces this time to a blistering 0.5s on average. That's about the best thing I have going for this method.Training is done to find centroid positions that optimize the requested metric, using gradient descent and JAX. Learning rate for gradient descent is found using a range test, and a few little tricks I implemented.
Training is slow, although I do not know if it is any slower than training a NN. Generally training for 3 bins takes between ~5 and 10 minutes. Training takes longer at higher bins, as computing the metric takes longer for more bins. The range test takes some time but does make the training itself slightly faster.
FOM
BAG achieves its best results when training for FOM, but shows instability above 7 bins. By 8 bins, the method is trying to optimize parameters in 24 dimensions (8 bins * 3 dimensional coordinates per bin centroid) and starts to get extremely finicky. I've spent most of the past week ironing out instabilities between 4 and 7 bins, beyond that has eluded me at the moment. This is my most recent performance plot, after 8 bins it seems to plateau a bit at about the same value as 7 bins. 7 seems to be the unlucky number that already comes in below the trend the previous 5 bin numbers establish. To be investigated...?
An example of the binning generated for 3 bins when optimized for FOM:
FOM_DETF
I only experimented with the DETF Figure of Merit quite late. I was able to improve performance on the DETF metric with some tweaks to learning rate and some approximations. Best case performance is stable and better than the forest up to 6 bins or so. I haven't been able to improve performance beyond that above the plateau, but anyone is welcome to take up the challenge:
Example binning:
SNR
Improvements I made to the method to optimize FOM also improved its performance when optimizing for SNR (which was pretty abysmal when I started). BAG is definitely best used to classify for an optimized Figure of Merit, but I present the SNR results for completeness:
Example binning: