TableReport outliers improvement? #1167

Vincent-Maladiere · 2024-11-29T08:52:03Z

Describe the bug

I don't think this is a bug, I'm just creating this issue so that we can observe a different behaviour I just saw.

Here is a series where the outlier detection is slightly off. This is an integer series comprised of {0, 80, 100}. Here, the values are actually categorical because they just distinguish states, and the choice of the state values 0 and 80 is arbitrary (it could have been 0, 1, 2, but for some reason it's 0, 80, 100).

Unless we tune the heuristic a little bit to avoid counting outliers when the cardinality is "very low", I don't see an easy improvement.

WDYT?

The column dataset:

state.csv

Steps/Code to Reproduce

import pandas as pd
from skrub import TableReport

df = pd.read_csv("state.csv")
TableReport(df)

Expected Results

No outliers

Actual Results

Some outliers

Versions

0.4.0 :)))

jeromedockes · 2024-11-29T09:40:50Z

thanks @Vincent-Maladiere . I had considered turning off outlier detection when there are few unique values but actually having a wide range of values is still a problem if you have few. imagine you had values 1, 2, 3, and -1000 say to indicate some invalid value. if you don't cut the axis due to the -1000 1, 2, and 3 will all get squished together and you will lose the information.

what we would like in this case is to realize that the actual values don't matter and treat the variable as categorical as you say. but I'm not sure that its' reasonable to assume that is the case whenever there are few unique values 🤔

jeromedockes · 2024-11-29T09:44:28Z

here is another example from the "titanic" dataset

Vincent-Maladiere · 2024-11-29T16:28:41Z

Yes, that's tricky I agree. The best scenario would be for the user to notice that and convert to string or category dtypes. The main difference between my screenshot and yours is that on mine the outlier segment is bigger than the category displayed on the left, so the outliers are not really outliers, if that makes sense?

what we would like in this case is to realize that the actual values don't matter and treat the variable as categorical as you say. but I'm not sure that its' reasonable to assume that is the case whenever there are few unique values 🤔

Let's wait a bit to get feedback and decide about this.

Vincent-Maladiere added the bug Something isn't working label Nov 29, 2024

jeromedockes removed the bug Something isn't working label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TableReport outliers improvement? #1167

TableReport outliers improvement? #1167

Vincent-Maladiere commented Nov 29, 2024

jeromedockes commented Nov 29, 2024

jeromedockes commented Nov 29, 2024

Vincent-Maladiere commented Nov 29, 2024

TableReport outliers improvement? #1167

TableReport outliers improvement? #1167

Comments

Vincent-Maladiere commented Nov 29, 2024

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jeromedockes commented Nov 29, 2024

jeromedockes commented Nov 29, 2024

Vincent-Maladiere commented Nov 29, 2024