-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pymatviz figures need refinement before they can be shown on main website/MP #172
Comments
Also tagging @mkhorton as some of these issues might be important when these figures are saved as json as loaded into the mp website... |
@janosh Another cool plot might be a clustering based on structure similarity, with the size of the dots being the value of the target or similar |
Looks like I was able to get a more reasonable plot for the violins: Seems like the column index for selecting target variables was set to But it seems like these plots only make sense for regression problems, they're not super informative for classification: It also takes quite a while to load/render these plots, esp. for the larger datasets... |
@ardunn Sorry for the silence here. Wanted to reply yesterday that the compositions along the y-axis must be coming from wrong column index but forgot. Too many things going on. 😄 Good thing you already fixed it! 👍 Definitely still committed to addressing your points and getting these plots onto the website. What would be the best way to collaborate on a PR? I can open one and then you as maintainer can commit to it but sounds like you already have some local changes. You want to merge them first? Or just push the branch and I branch off of that? |
@janosh Yes please! Just open a PR and then make sure the box that says "allow edits from maintainers" etc. is checked and I think I should be able to push commits to it as well. Also to reiterate this should be done on the Also for simplicity, let me summarize and itemize the issues I had with the original plots:
|
|
Ok great! i think we should invert the title text as well
This sounds good to me! I am generally not a fan of having links to outside services (i.e., plots hosted on plotly) bc. links are always susceptible to link rot or random stuff happening. If possible I'd like to have some static assets stored on the repo which can be loaded by the website quickly (~1s of loading page to render them). I'm not particularly concerned with how exactly this would be done (iframe, something else, etc.). So definitely open to other options
I don't think the clustering itself needs to be fast as these figures (at least how I have it written in What is more important is that the figures be informative! |
Any objections to @ardunn I might explore making it interactive (i.e. you can change some of the clustering parameters), but to make such a figure static, the HTML file could get large (not exactly sure how large without trying). Any sense of an upper limit I should try to stay within if I go down that route? 100 MB? 10 MB? 5 MB? I suggest looking at this figure for an example of what I have in mind. In terms of being informative, I think something missing (that is on my backlog) is showing the N most frequent anonymous formulas and the N most frequent elements present in each cluster. I think that would help a lot with immediate interpretation. Another visualization worth mentioning is target vs. (a proxy for) chemical novelty; see this example. |
How much slower would it be with the |
@CompRhys I haven't compared to |
just tested it with |
@sgbaird Cool, thanks for the great suggestions! 👍
I'm a bit reluctant about |
@CompRhys I based it off of @janosh sure thing! What about JAX as a dependency? It would be easy enough to whip up a JAX implementation given the Numpy support there, though there are no guarantees on the speedup. Sounds good about jumping on the |
I'd prefer not to have it as a main dependency for matbench as numba can be finnicky and it is not really needed for the core matbench functionality. If we 100% absolutely truly need it, we can add it as a dependency for thedocs only or as a codependency for pymatviz only (though that would of course be @janosh 's call). I'd like to keep the dependencies for matbench itself very minimal.
Eh I'd say something like 5MB should be the max. Otherwise maybe we can just use the "cdn" argument to the plotly write html method so that the extra js/css is referenced from plotly's servers. I think this is what @janosh did in the original commits and it reduced most html file sizes from like 10MB+ to a few hundred KB.
Yes! That seems like a really good idea!
@CompRhys @sgbaird @janosh Perhaps before we go down the rabbit hole of optimizing this, how long does a typical clustering take without numba/JAX/pot/etc for one of the large datasets (e.g.,
... I'd choose (a) |
Avoiding numba sounds OK to me👍 Didn't know about the "cdn" stuff. I'll keep that in mind. Thanks!
Based on @CompRhys's estimated time of 3 min for I should probably mention that UMAP (and by extension DensMAP) also depends on Numba and HDBSCAN depends on Cython (I'm guessing the latter isn't an issue). Part of our discussion here might be spilling a bit past Matbench's scope, since the runtime is a much bigger consideration for What about the option of uploading hard-coded 2D embeddings to FigShare (or even just the Matbench repo) for each of the Matbench datasets? |
That option seems pretty attactive. We could just keep it with the other matbench metadata and it would have about the same frequency of updates as that metadata (i.e., infrequently). It might be worth first seeing what some of these clustering plots look like in regard to the target variables. For example, does a clustering of the expt_gap dataset where the points are scaled by target value actually reveal anything about the dataset, or is it just pretty? I'd be glad to look into that myself if you have some starter-ish code for doing so |
Sounds good! I think we should go with that.
See I think the target values expressed in the manifold could provide a visual interpretation of the regression task's difficulty (i.e. the complexity of the response surface). Normally I think this is revealed in more quantitative measures like comparing the ratio between a dummy score and a baseline model score (which is already a nice part of Matbench); maybe this could reveal some other trends like "the model tends to struggle with these types of compositions", or "this indicates the model is highly structure-dependent because the response surface in composition space exhibits virtually no trend". Despite these comments, I think your comment "or is it just pretty?" is well-justified. Suppose the conclusions based on those visualizations don't corroborate tried-and-tested quantitative measures like the ratio between dummy score and a baseline model or something like Having invested in composition-based models, I recognize I have some bias: for example, a tendency to ask "what's the best way to visualize just the compositional information?" rather than "how useful is it really to visualize composition-only information?" cc also @SurgeArrester who might have some suggestions |
When I've been testing For the speed differences I haven't done a deep dive into why The simplex algorithm has a O(2^n) running time for the worst case counter example, but the average running time is empirically shown to be less than O(n^3) (often less than O(n^2)) for randomly initialised problems. I haven't found a direct reference for empirical running times of the network simplex, but I believe a similar phenomena is occurring for composition matching problems (Vanderbrei, Linear Programming, '20) If it is just the distance that is required without the transportation plan and a monotonic 1D elemental scale is acceptable (e.g. mendeleev numbers but not elemental emebddings) then the method discussed in this issue should be the fastest. I can't find a reference for how I derived this implementation so I'm not 100% sure whether this aligns with the method given in the literature though, https://www.imagedatascience.com/transport/OTCrashCourse.pdf slide 45, https://arxiv.org/pdf/1804.01947.pdf section IV A. For linear compositional clustering I used to create these by generating a full distance matrix and projecting these to 1D PCA embeddings, but I've found that simply computing the distance to hydrogen and sorting based on this value is a much faster method that gives a comparable ordering. A list of ElMD objects can be sorted this way using the built in |
I've removed the |
Wow, super cool @SurgeArrester! I'm learning a lot just from reading this thread. 😄 I just tried
Maybe worth adding a note in the readme for py3.8+ users. Also, GH comments support KaTeX math using Anyways, haven't started working on this yet but it's high on my list. |
Ahh excellent suggestion that's a much cleaner solution many thanks, pushed to latest version |
Another option is https://github.com/ptooley/numbasub if you wanted to use additional numba features. Looks unmaintained but I imagine it still works, unfortunate it's not on PyPI. (Edit: given the license, could probably also just copy the |
@SurgeArrester nice! I ran a comparison and it looks like there's ~2x speedup for @ardunn back to the topic of computing it once and loading it, here's a notebook that does the clustering for |
@faris-k mentioned https://projector.tensorflow.org/ to me. For compositions, a heatmap image of the periodic table could be used. A 2D image of the crystal structure could also be used for structures. |
@janosh thanks for the PR and the work on pymatviz! A lot of the plots for pymatviz which are not yet in matbench (esp. the uncertainty ones) will come in very handy for matbench in the near future :) But for now the EDA ones seem like a good start
There are still some rough edges I'd like to iron out before we move it to be automatically shown on the main website
I made some edits to your code so the information artifacts (the bz2s), as well as the plots, are only generated if the script specifies two arguments. There is also some renaming of stuff (mostly appending
pymatviz
to the front of everything) to keep things consistent for naming, specifying a particular version of pymatviz to use, and copying a static set of htmls to the static docs dir instead of keeping them directly in the static docs dir (bc.*html
s get purged bynuke_docs
on every run and the plots then need to be regenerated instead of just copied).I was planning on showing the figures beneath the leaderboards for the "Per Task Leaderboards" by just stuffing them in iframes but this doesn't look particularly good at the moment (see screenshots).
For this reason, all the code (incl. artifacts etc.) for actually generating and putting the plots in the docs automatically is in the
pymatviz_eda
branch currently, notmain
.The Problems (screenshots)
Colors are clashing with dark background
Width of frame is more than width of readable area, causing nasty looking horiz scroll bar and can't see colorbar)
Colors are clashing with dark background
I also don't get what these plots are actually saying? Like the y axis is composition(!?), but if I hover over individual compositions along that y axis they rarely match the y axis... I'm just not really sure what the y axis represents here?
Maybe we should change the targets in this case to be the target variable rather than composition? so like in your original PR #126 we could see breakdown of refractive index by crystal system etc.
This plot looks generally ok, but could we make the text bigger, and have the title text not blend into dark background?
The Solutions (I need help)
It seems like some of these things can be fixed by changing the pymatviz arguments, but I'm not sure the best way to do that.
Some of the other stuff seems like it needs to be edited in the iframe...I'm not so great at frontend web design so I was just formatting iframes based on the html filenames like so:
...but there surely has to be a better way...
Do you know how we could fix these things?
The text was updated successfully, but these errors were encountered: