-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor rvars #258
Factor rvars #258
Conversation
Codecov Report
@@ Coverage Diff @@
## master #258 +/- ##
==========================================
+ Coverage 95.05% 95.83% +0.78%
==========================================
Files 42 44 +2
Lines 2932 3291 +359
==========================================
+ Hits 2787 3154 +367
+ Misses 145 137 -8
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thank you for working on this! I don't have much input right now but at least a few thoughts.
|
works for me
Thinking about the different formats:
I think in the short term we could add support for factors to |
Matthew, we chatted on Twitter about possible ways to summarize information on posterior distributions of categorical variables. In particular, I wondered if the entropy reported for the posterior distribution of a categorical variable was computed differently depending on whether the categorical variable was nominal or ordinal. (It didn't seem like it was.) In doing some more digging about this, I found an R package called agrmt (agrmt: Calculate Concentration and Dispersion in Ordered Rating Scales) which purports to do the following:
On the surface of it, it sounds like this package should give you some more options in terms of describing posterior distributions of ordinal categorical variables. Here's some R code I put together to get you started - it is based on the package vignette available at https://cran.r-project.org/web/packages/agrmt/vignettes/agrmt.pdf and on the help files available for this package. In the code, I used a frequency vector mentioned at the beginning of the vignette, frequencies <- c(10, 20, 30, 15, 4), which corresponds to the following 5-category bar plot:
IMPORTANT NOTE: _Some functions in the agrmt package require frequency vectors while others require Where standardized frequency vectors are required, frequency
My thinking is that the user could be given a menu of possibilities for describing the posterior distribution corresponding to an ordinal categorical variable via some of the functions in the agrmt package: agreement(), polarization(), Leik(), consensus(), entropy(), etc. (Note that the package includes other functions - I haven't explored them all.) Thanks, Matthew - I hope some of the above will be helpful to you. Isabella |
Thank you @isabellaghement for checking out all these options! |
I've come to the conclusion that this isn't really worth it (I added rationale to the top comment). For now, I have gone with lossy conversion + a warning. Long term, we may want a way of storing type information for each variable in a draws_array/matrix, which would allow us to address this issue. I believe this PR has now reached a "ready" state. I have updated the summary comment to be more comprehensive, as well as NEWS.md. Feedback welcome! |
The factor_rvar branch of {ggdist} now has experimental support for rvar_factors as well. Simple example of brms + posterior + ggdist with an ordinal model: library(brms)
library(posterior)
library(ggdist)
df <- esoph
m <- brm(tobgp ~ agegp, data = df, family = cumulative, seed = 12345, backend = "cmdstanr")
grid <- unique(df[,"agegp", drop = FALSE])
grid$pred <- rvar_ordered(posterior_predict(m, newdata = grid))
grid |>
ggplot(aes(x = agegp, ydist = pred)) +
stat_slab() Relevant issue over there is: mjskay/ggdist#108 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks already really good, up to a few minor things, mostly style related.
I am impressed by the level of thought (and effort) you put into getting factors and ordered factors to work as natively as possible!
Some of the rvar internals remain a bit of black magic to me, so I don't understand every detail of it, but I trust you and the tests you put in place on that one.
I agree with your assessment of using a lossy conversion to draws_matrix and draws_array, at least for the time being.
as_draws_rvars.draws_df <- function(x, ...) { | ||
x_df <- as.data.frame(x, optional = TRUE)[, variables(x, reserved = TRUE), drop = FALSE] | ||
data_frame_to_matrix <- function(df) { | ||
if (any(vapply(df, is.factor, logical(1)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we have a separate function/method that does the check whether any of the variables in non-numeric in a draws object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like, a generic function that does this for any draws type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think I'm not seeing where to use it currently, so I'm not sure exactly the signature / definition)
Thanks for the really thorough review! I've addressed everything except two comments (see those replies).
Heh, yeah, I agree it is pretty arcane. Getting rvar to mimic everything it does so far has been quite a journey (base R datatypes are... certainly something!), and I should probably clean up and document the rvar internals a bit more with everything I've had to figure out to get it to work (plus I sometimes forget when I've been away from it a bit). |
Thank you for your changes and explanations of what you what to keep the way it is. It does all make sense to me. I want to give people a bit more time to check out the new features and perhaps add a review. So I will put a reminder in my calender for start of next week to merge this PR if no new things have come up until then. Would that sound good to you? |
Sounds great, thanks! Should we ping some folks who might be able to provide feedback? Off the top of my head: @isabellaghement, @bwiernik, @jgabry, @avehtari? If any of you have a chance to "kick the tires" of |
pinging @TeemuSailynoja as he has been running recently some categorical and ordinal models |
Try to run some tests with it this weekend |
@mjskay From my side, this is ready to merge. Anything you want to change before I merge? |
Sorry for the delayed response - yes, I think this is good for now. I've built some stuff in ggdist on top of it and haven't run into anything new that needs changing, so that's the best I can say at the moment without other folks testing it. Thanks! |
Lovely, merging now. Thank you again for your enormous effort to make this feature possible! |
Thanks! |
Summary
This is a PR for adding factor-like
rvar
s, for #149.Basic idea is that the backing array can be a
factor
orordered
with a"levels"
attribute. These are auto-detected when objects with a"levels"
attribute are passed torvar()
, or can be created explicitly withrvar_factor()
,rvar_ordered()
,as_rvar_factor()
, oras_rvar_ordered()
.A few other changes are in here, summarized in NEWS.md.
Examples
Currently the summary output in vector / array layouts is modal category (
modal_category()
) with normalized entropy (entropy()
, which is normalized to be between 0 (all one category) and 1 (max entropy: uniform)).Ordered factors are similar, but show
mode <dissent>
:Where
dissent()
is Tastle and Wierman's "dissention" measure (thanks @isabellaghement), and ranges from 0 (all in one category) to 1 (bimodal at opposite ends of the scale).Here's another example on output of
posterior_predict()
, which should work for bothrstanarm::stan_polr()
(which uses strings) andbrms()
(which uses an array with a"levels"
attribute):This PR also provides an implementation of
match()
and%in%
forrvar
based on a conversation with @bwiernik, which is helpful especially forrvar_factor
; e.g.:(this required adding generics for
match()
and%in%
, but I think that is reasonable).Support in other formats
Currently,
draws_rvars
,draws_list
, anddraws_df
can all support discrete variables (when converting to the list and df formats, factors and ordereds are used). Thedraws_matrix
anddraws_array
formats do not support discrete variables. I thought about @paul-buerkner's suggestion to allow them to support such formats when all variables are of the same type, but concluded this would be pretty hacky: they would have to be either character arrays (which lose information about the order of levels, so could only be used for factors --- and even then, lose information about levels with 0 realizations) or would have to be factor-like arrays, which have poor support in base R (most operations on factor-like arrays just convert them to vectors). And even then, it would only work in situations where the levels of all variables are the same. The only way I can think of to support them comprehensively in those formats would be to add a parallel data structure as an attribute that stores type information for each column (is it a factor, is it ordered, what levels does it have, ...). I think we may end up having to do that anyway (e.g. to support #239), so maybe we'll get there eventually, but I think this PR is already pretty big.So, ultimately, I decided to stick with lossy conversion for now: when converting a
draws_rvars
,draws_list
, ordraws_df
to adraws_array
ordraws_matrix
, if the source object contains factors or ordereds, a warning is given and the variable is converted to numeric.For example:
One current issue created by the fact that draws_array does not support discrete variables is that this conversion warning will be generated whenever discrete variables are used with
summarise_draws()
, as all formats currently convert to draws_array insummarise_draws()
:I think that for now at least the warning is helpful to indicate that the output may be problematic. Long term, I'm not sure I see a way around this without either re-writing internals of
summarise_draws()
, adding variable-level type information to draws_array, or providing an alternative function, since currentlydraws_array
fundamentally can't support heterogeneous variable types.Notes / Feedback
There are still several places feedback would be helpful:
summarise_draws()
(warning on discrete variables) okay?modal_category()
,entropy()
(which is normalized entropy; not sure if the name should benormalized_entropy()
instead though that is rather wordy), anddissent()
. I don't want to add a whole bunch, but would like the defaults for printing to be sensible in some way.Copyright and Licensing
By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses: