Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running analysis on data subsets #66

Open
dubmix opened this issue Feb 19, 2024 · 2 comments
Open

Running analysis on data subsets #66

dubmix opened this issue Feb 19, 2024 · 2 comments

Comments

@dubmix
Copy link
Collaborator

dubmix commented Feb 19, 2024

Noticed some unexpected behaviour in the code after trying to run the analysis on a very small subset of the data (~25 models).

@dubmix
Copy link
Collaborator Author

dubmix commented Feb 19, 2024

In the modelling notation part, this line was causing an issue:
df_meta_selected = df_meta_selected.groupby('namespace').resample('Y').sum(numeric_only=True).reset_index()

Screenshot 2024-02-16 at 15 58 04 (2)

As we can see in the above picture, I found out the particular case of the count being 0 for a specific date, a duplicate namespace column filled with NaN values would be added to the dataframe.

To palliate this issue, I added the min_count option and subsequently filled the NaNs with 0.

@dubmix
Copy link
Collaborator Author

dubmix commented Feb 19, 2024

In the element types section, the small quantity of models used for the analysis raised another issue. After the data crunching, Seaborn interprets the dataframe with the original number of rows instead of the actual one. This leads to an error throw in the form of:
AttributeError: 'NoneType' object has no attribute 'get_bbox'

This part of the error message gives us a hint:
The palette list has fewer values (18) than needed (27) and will cycle, which may produce an uninterpretable plot.

18 containers are expected but the list has 27 containers, hence the error.

The current solution is to remove hue="category" under an arbitrary threshold of models. Seems like this particular issue only applies to small data subsets. Potential fix could be updating pandas to v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant