Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TableReport outliers improvement? #1167

Open
Vincent-Maladiere opened this issue Nov 29, 2024 · 3 comments
Open

TableReport outliers improvement? #1167

Vincent-Maladiere opened this issue Nov 29, 2024 · 3 comments

Comments

@Vincent-Maladiere
Copy link
Member

Describe the bug

I don't think this is a bug, I'm just creating this issue so that we can observe a different behaviour I just saw.

Here is a series where the outlier detection is slightly off. This is an integer series comprised of {0, 80, 100}. Here, the values are actually categorical because they just distinguish states, and the choice of the state values 0 and 80 is arbitrary (it could have been 0, 1, 2, but for some reason it's 0, 80, 100).

Unless we tune the heuristic a little bit to avoid counting outliers when the cardinality is "very low", I don't see an easy improvement.

WDYT?

Screenshot 2024-11-29 at 09 42 52

The column dataset:

state.csv

Steps/Code to Reproduce

import pandas as pd
from skrub import TableReport

df = pd.read_csv("state.csv")
TableReport(df)

Expected Results

No outliers

Actual Results

Some outliers

Versions

0.4.0 :)))
@Vincent-Maladiere Vincent-Maladiere added the bug Something isn't working label Nov 29, 2024
@jeromedockes
Copy link
Member

thanks @Vincent-Maladiere . I had considered turning off outlier detection when there are few unique values but actually having a wide range of values is still a problem if you have few. imagine you had values 1, 2, 3, and -1000 say to indicate some invalid value. if you don't cut the axis due to the -1000 1, 2, and 3 will all get squished together and you will lose the information.

what we would like in this case is to realize that the actual values don't matter and treat the variable as categorical as you say. but I'm not sure that its' reasonable to assume that is the case whenever there are few unique values 🤔

@jeromedockes jeromedockes removed the bug Something isn't working label Nov 29, 2024
@jeromedockes
Copy link
Member

here is another example from the "titanic" dataset

screenshot_2024-11-29T10:43:55+01:00

@Vincent-Maladiere
Copy link
Member Author

Yes, that's tricky I agree. The best scenario would be for the user to notice that and convert to string or category dtypes. The main difference between my screenshot and yours is that on mine the outlier segment is bigger than the category displayed on the left, so the outliers are not really outliers, if that makes sense?

what we would like in this case is to realize that the actual values don't matter and treat the variable as categorical as you say. but I'm not sure that its' reasonable to assume that is the case whenever there are few unique values 🤔

Let's wait a bit to get feedback and decide about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants