Running analysis on data subsets #66

dubmix · 2024-02-19T13:33:44Z

Noticed some unexpected behaviour in the code after trying to run the analysis on a very small subset of the data (~25 models).

dubmix · 2024-02-19T13:46:27Z

In the modelling notation part, this line was causing an issue:
df_meta_selected = df_meta_selected.groupby('namespace').resample('Y').sum(numeric_only=True).reset_index()

As we can see in the above picture, I found out the particular case of the count being 0 for a specific date, a duplicate namespace column filled with NaN values would be added to the dataframe.

To palliate this issue, I added the min_count option and subsequently filled the NaNs with 0.

dubmix · 2024-02-19T14:30:52Z

In the element types section, the small quantity of models used for the analysis raised another issue. After the data crunching, Seaborn interprets the dataframe with the original number of rows instead of the actual one. This leads to an error throw in the form of:
AttributeError: 'NoneType' object has no attribute 'get_bbox'

This part of the error message gives us a hint:
The palette list has fewer values (18) than needed (27) and will cycle, which may produce an uninterpretable plot.

18 containers are expected but the list has 27 containers, hence the error.

The current solution is to remove hue="category" under an arbitrary threshold of models. Seems like this particular issue only applies to small data subsets. Potential fix could be updating pandas to v2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running analysis on data subsets #66

Running analysis on data subsets #66

dubmix commented Feb 19, 2024

dubmix commented Feb 19, 2024

dubmix commented Feb 19, 2024

Running analysis on data subsets #66

Running analysis on data subsets #66

Comments

dubmix commented Feb 19, 2024

dubmix commented Feb 19, 2024

dubmix commented Feb 19, 2024