Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Better handling of integers distribution in TableReport #1164

Open
Vincent-Maladiere opened this issue Nov 28, 2024 · 10 comments
Open
Labels
enhancement New feature or request

Comments

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Nov 28, 2024

Problem Description

The xticks locations of integer distributions are often off, spacing the bars irregularly, which looks visually inconsistent.

Screenshot 2024-11-28 at 12 23 58

The years in the plot above are floats, but converting to integers doesn't help.

Screenshot 2024-11-28 at 12 27 00

Feature Description

We could display the bars with regularity for integers (and floats?), especially when the number of bins is < 10. We can come up with simple heuristic/fix at first

Alternative Solutions

.

Additional Context

skrub 0.4.0 :))

@Vincent-Maladiere Vincent-Maladiere added the enhancement New feature or request label Nov 28, 2024
@jeromedockes
Copy link
Member

thanks @Vincent-Maladiere . to help look for a solution, here is a minimal reproducer of the issue that does not require generating a report:

from matplotlib import pyplot as plt
import numpy as np

x = np.arange(9)
fig, ax = plt.subplots()
ax.hist(x)

histogram

@jeromedockes
Copy link
Member

also to try out solutions, could you share the "Year" column you used above?

@jeromedockes
Copy link
Member

I think maybe when there are few unique values we shouldn't plot a histogram but a stem plot instead: https://matplotlib.org/stable/plot_types/basic/stem.html#sphx-glr-plot-types-basic-stem-py

or treat the variable as categorical and do a bar plot 🤔 if there was some way to detect that the actual values don't matter too much besides their ordering

@rcap107
Copy link
Contributor

rcap107 commented Nov 28, 2024

FWIW, the misalignment between bins and labels is something I've seen in general matplotlib use, so I don't know how it could be addressed specifically in the TableReport

I think maybe when there are few unique values we shouldn't plot a histogram but a stem plot instead: https://matplotlib.org/stable/plot_types/basic/stem.html#sphx-glr-plot-types-basic-stem-py

or treat the variable as categorical and do a bar plot 🤔 if there was some way to detect that the actual values don't matter too much besides their ordering

I like the idea of using stem plots

@Vincent-Maladiere
Copy link
Member Author

Maybe we could derive good heuristics using np.hist and plt.bar instead of plt.hist directly

@jeromedockes
Copy link
Member

sure, I don't think it will make much of a difference -- plt.hist just forwards all arguments to np.hist

@Vincent-Maladiere
Copy link
Member Author

What I meant is that we might have a better control of the bins by decoupling the hist computing from the bar plot. I don't have anything against stem plot though, as long as they are easy to see on small plots

@jeromedockes
Copy link
Member

one question with the stem plots is how to handle outliers -- add a red stem on the side of the axis?

@jeromedockes
Copy link
Member

btw here's another example in the "day of the week" column in this other issue

@Vincent-Maladiere
Copy link
Member Author

one question with the stem plots is how to handle outliers -- add a red stem on the side of the axis?

This is where I prefer bars as well, although a red stem thingy looks fine I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants