Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor rvars #258

Merged
merged 56 commits into from
Dec 20, 2022
Merged

Factor rvars #258

merged 56 commits into from
Dec 20, 2022

Conversation

mjskay
Copy link
Collaborator

@mjskay mjskay commented Oct 9, 2022

Summary

This is a PR for adding factor-like rvars, for #149.

Basic idea is that the backing array can be a factor or ordered with a "levels" attribute. These are auto-detected when objects with a "levels" attribute are passed to rvar(), or can be created explicitly with rvar_factor(), rvar_ordered(), as_rvar_factor(), or as_rvar_ordered().

A few other changes are in here, summarized in NEWS.md.

Examples

x = rvar(array(c("a","a","b","c","d","d","d","a"), dim = c(4,2)))
x
#> rvar_factor<4>[2] mode <entropy>:
#> [1] a <0.75>  d <0.41> 
#> 4 levels: a b c d

Currently the summary output in vector / array layouts is modal category (modal_category()) with normalized entropy (entropy(), which is normalized to be between 0 (all one category) and 1 (max entropy: uniform)).

Ordered factors are similar, but show mode <dissent>:

x = rvar_ordered(array(c("a","a","b","c","d","d","d","a"), dim = c(4,2)))
x
#> rvar_ordered<4>[2] mode <dissent>:
#> [1] a <0.43>  d <0.81> 
#> 4 levels: a < b < c < d

Where dissent() is Tastle and Wierman's "dissention" measure (thanks @isabellaghement), and ranges from 0 (all in one category) to 1 (bimodal at opposite ends of the scale).

Here's another example on output of posterior_predict(), which should work for both rstanarm::stan_polr() (which uses strings) and brms() (which uses an array with a "levels" attribute):

library(brms)

df <- esoph
m <- brm(tobgp ~ agegp, data = df, family = cumulative, seed = 12345, backend = "cmdstanr")

df$y_rep = rvar_ordered(posterior_predict(m))
head(df, n = 15)
#>    agegp     alcgp    tobgp ncases ncontrols           y_rep
#> 1  25-34 0-39g/day 0-9g/day      0        40    10-19 <0.63>
#> 2  25-34 0-39g/day    10-19      0        10    10-19 <0.63>
#> 3  25-34 0-39g/day    20-29      0         6    10-19 <0.62>
#> 4  25-34 0-39g/day      30+      0         5 0-9g/day <0.64>
#> 5  25-34     40-79 0-9g/day      0        27    10-19 <0.63>
#> 6  25-34     40-79    10-19      0         7 0-9g/day <0.64>
#> 7  25-34     40-79    20-29      0         4    10-19 <0.62>
#> 8  25-34     40-79      30+      0         7    10-19 <0.63>
#> 9  25-34    80-119 0-9g/day      0         2    10-19 <0.64>
#> 10 25-34    80-119    10-19      0         1 0-9g/day <0.64>
#> 11 25-34    80-119      30+      0         2    10-19 <0.63>
#> 12 25-34      120+ 0-9g/day      0         1    10-19 <0.64>
#> 13 25-34      120+    10-19      1         0    10-19 <0.64>
#> 14 25-34      120+    20-29      0         1    10-19 <0.62>
#> 15 25-34      120+      30+      0         2    10-19 <0.62>

This PR also provides an implementation of match() and %in% for rvar based on a conversation with @bwiernik, which is helpful especially for rvar_factor; e.g.:

x
#> rvar_ordered<4>[2] mode <dissent>:
#> [1] a <0.43>  d <0.81> 
#> 4 levels: a < b < c < d
x %in% c("a", "b")
#> rvar<4>[2] mean ± sd:
#> [1] 0.75 ± 0.5  0.25 ± 0.5

(this required adding generics for match() and %in%, but I think that is reasonable).

Support in other formats

Currently, draws_rvars, draws_list, and draws_df can all support discrete variables (when converting to the list and df formats, factors and ordereds are used). The draws_matrix and draws_array formats do not support discrete variables. I thought about @paul-buerkner's suggestion to allow them to support such formats when all variables are of the same type, but concluded this would be pretty hacky: they would have to be either character arrays (which lose information about the order of levels, so could only be used for factors --- and even then, lose information about levels with 0 realizations) or would have to be factor-like arrays, which have poor support in base R (most operations on factor-like arrays just convert them to vectors). And even then, it would only work in situations where the levels of all variables are the same. The only way I can think of to support them comprehensively in those formats would be to add a parallel data structure as an attribute that stores type information for each column (is it a factor, is it ordered, what levels does it have, ...). I think we may end up having to do that anyway (e.g. to support #239), so maybe we'll get there eventually, but I think this PR is already pretty big.

So, ultimately, I decided to stick with lossy conversion for now: when converting a draws_rvars, draws_list, or draws_df to a draws_array or draws_matrix, if the source object contains factors or ordereds, a warning is given and the variable is converted to numeric.

For example:

d = draws_rvars(x = rvar(c("a","a","b","c")))
d
#> # A draws_rvars: 4 iterations, 1 chains, and 1 variables
#> $x: rvar_factor<4>[1] mode <entropy>:
#> [1] a <0.95> 
#> 3 levels: a b c
as_draws_df(d)
#> # A draws_df: 4 iterations, 1 chains, and 1 variables
#>   x
#> 1 a
#> 2 a
#> 3 b
#> 4 c
#> # ... hidden reserved variables {'.chain', '.iteration', '.draw'}
as_draws_array(d)
#> Warning: draws_array does not support non-numeric variables (e.g., factors).
#> Converting non-numeric variables to numeric.
#> # A draws_array: 4 iterations, 1 chains, and 1 variables
#> , , variable = x
#> 
#>          chain
#> iteration 1
#>         1 1
#>         2 1
#>         3 2
#>         4 3

One current issue created by the fact that draws_array does not support discrete variables is that this conversion warning will be generated whenever discrete variables are used with summarise_draws(), as all formats currently convert to draws_array in summarise_draws():

summarise_draws(d)
#> # A tibble: 1 × 10
#>   variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
#>   <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
#> 1 x         1.75    1.5 0.957 0.741     1  2.85  2.56       NA       NA
#> Warning message:
#> draws_array does not support non-numeric variables (e.g., factors). Converting non-numeric variables to numeric. 

I think that for now at least the warning is helpful to indicate that the output may be problematic. Long term, I'm not sure I see a way around this without either re-writing internals of summarise_draws(), adding variable-level type information to draws_array, or providing an alternative function, since currently draws_array fundamentally can't support heterogeneous variable types.

Notes / Feedback

There are still several places feedback would be helpful:

  • If anyone using categorical models can give this a test drive that'd be useful.
  • Is the current solution to conversion to/from other draws formats acceptable? My thinking is, if we want more comprehensive support in all formats (i.e. array and matrix), that could be a future change related to How to handle discrete parameters? #17 and Attribute or other way to distinguish MCMC vs MC #239.
  • Is the current solution for summarise_draws() (warning on discrete variables) okay?
  • Thoughts on summary functions for discrete parameters? Currently there is modal_category(), entropy() (which is normalized entropy; not sure if the name should be normalized_entropy() instead though that is rather wordy), and dissent(). I don't want to add a whole bunch, but would like the defaults for printing to be sensible in some way.
  • Currently these things probably won't be that easy to visualize. {ggdist} will get some kind of support for them but I haven't figured that out yet.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2022

Codecov Report

Merging #258 (3c4728b) into master (f343850) will increase coverage by 0.78%.
The diff coverage is 99.57%.

@@            Coverage Diff             @@
##           master     #258      +/-   ##
==========================================
+ Coverage   95.05%   95.83%   +0.78%     
==========================================
  Files          42       44       +2     
  Lines        2932     3291     +359     
==========================================
+ Hits         2787     3154     +367     
+ Misses        145      137       -8     
Impacted Files Coverage Δ
R/rvar-factor.R 97.29% <97.29%> (ø)
R/rvar-print.R 97.61% <98.36%> (+0.05%) ⬆️
R/as_draws_array.R 100.00% <100.00%> (+5.00%) ⬆️
R/as_draws_df.R 99.01% <100.00%> (+0.09%) ⬆️
R/as_draws_list.R 88.52% <100.00%> (ø)
R/as_draws_matrix.R 97.84% <100.00%> (+0.17%) ⬆️
R/as_draws_rvars.R 100.00% <100.00%> (ø)
R/discrete-summaries.R 100.00% <100.00%> (ø)
R/for_each_draw.R 100.00% <100.00%> (ø)
R/rvar-.R 97.13% <100.00%> (+0.43%) ⬆️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@paul-buerkner
Copy link
Collaborator

Thank you for working on this! I don't have much input right now but at least a few thoughts.

  • I like the mode <entropy> printing.
  • Do we want to support categorical variables more generally in all formats? If yes, it could make sense to design a consistent interface across all at this point. Said differently, could there be any problems we run into later in other formats if we only implement categorical variables in rvars now?

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 15, 2022

I like the mode <entropy> printing.

works for me

Do we want to support categorical variables more generally in all formats? If yes, it could make sense to design a consistent interface across all at this point. Said differently, could there be any problems we run into later in other formats if we only implement categorical variables in rvars now?

Thinking about the different formats:

  • draws_list and draws_df should have no problems with categorical variables, as they could just use factor and ordered types for those variables
  • draws_matrix and draws_array I'm not sure what to do with, as every variable in an array/matrix must have the same type. One option is that we make converting to them be lossy by changing factors to integers. Another option would be to store a list of levels for each variable as an attribute of the draws object (e.g. similar to the way that dimnames are stored). But we would also need to store if they are ordered or not, and possibly handle modifications of the variables too... ultimately it might amount to reimplementing the factor type and not be worth it.

I think in the short term we could add support for factors to draws_list and draws_df easily, and go with the lossy-cast-to-integer approach for the other two formats. Then we can decide later if we want to deal with the complexity of adding support to them. I think this should be future-proof...?

@isabellaghement
Copy link

isabellaghement commented Oct 19, 2022

Matthew, we chatted on Twitter about possible ways to summarize information on posterior distributions of categorical variables. In particular, I wondered if the entropy reported for the posterior distribution of a categorical variable was computed differently depending on whether the categorical variable was nominal or ordinal. (It didn't seem like it was.)

In doing some more digging about this, I found an R package called agrmt (agrmt: Calculate Concentration and Dispersion in Ordered Rating Scales) which purports to do the following:

Calculates concentration and dispersion in ordered rating scales. It implements various measures of concentration and dispersion to describe what researchers variably call agreement, concentration, consensus, dispersion, or polarization among respondents in ordered data. It also implements other related measures to classify distributions. In addition to a generic city-block based concentration measure and a generic dispersion measure, the package implements various measures, including van der Eijk's (2001) <doi:10.1023/A:1010374114305> measure of agreement A, measures of concentration by Leik, Tatsle and Wierman, Blair and Lacy, Kvalseth, Berry and Mielke, Reardon, and Garcia-Montalvo and Reynal-Querol. Furthermore, the package provides an implementation of Galtungs AJUS-system to classify distributions, as well as a function to identify the position of multiple modes.

On the surface of it, it sounds like this package should give you some more options in terms of describing posterior distributions of ordinal categorical variables.

Here's some R code I put together to get you started - it is based on the package vignette available at https://cran.r-project.org/web/packages/agrmt/vignettes/agrmt.pdf and on the help files available for this package. In the code, I used a frequency vector mentioned at the beginning of the vignette, frequencies <- c(10, 20, 30, 15, 4), which corresponds to the following 5-category bar plot:

image


library(agrmt)

categories <- c(1, 2, 3, 4, 5)
frequencies <- c(10, 20, 30, 15, 4)

freqtable <- data.frame(categories, frequencies)

freqtable$categories <- factor(freqtable$categories, ordered = TRUE)

library(ggplot2) 
library(magrittr)
theme_set(theme_bw())

plot_freqtable <- 
freqtable %>% 
  ggplot() + 
  aes(x = categories, y = frequencies, fill = categories) + 
  geom_col() +
  labs(x = "Category", y = "Frequency") + 
  theme(legend.position="none") 

plot_freqtable

IMPORTANT NOTE:

_Some functions in the agrmt package require frequency vectors while others require
standardized frequency vectors, that is the frequencies expressed as a proportion.

Where standardized frequency vectors are required, frequency
vectors that do not sum to 1 (i.e. that are not standardized) are automatically
standardized by dividing each element of the frequency vector by the sum of
the vector. This assumes complete data; a general assumption of the measures
in this package._

#==============================================================
# 1) Agreement: Levels of agreement range from −1 to 1
# Agreement is only defined if there are at least three response categories, 
# and it does not tell you which of the categories is the most common one. 
# For this reason, it is also advisable to look at an appropriate measure of 
#central tendency, such as the interpolated median.
#==============================================================

# To calculate agreement, we simply call agreement with the frequency vector as
# the argument. There is normally no reason to use the old algorithm, which can
# be set using old=TRUE. This is the ‘old’ algorithm as described by Van der Eijk.

help(agreement, package = "agrmt") 

library(dplyr)

agreement_freqtable <- 
freqtable %>% 
  pull(frequencies) %>% 
  agreement() %>% 
  round(2)

agreement_freqtable  # 0.39

plot_freqtable + 
  ggtitle(paste("Agreement:", agreement_freqtable))
  

#==============================================================
# 2) Polarization
#
# The agreement scores are suitable to express polarization, but the numbers are
# not intuitive to interpret. The function polarization offers easily interpretable
# values by re-scaling agreement A. More precisely, it gives (1 − agreement)/2.
# This means that a polarization value of 1 means perfect polarization, 
# and a value of 0 means perfect agreement. A value of 0.5 corresponds to  
# ‘no agreement’.
#==============================================================

polarization_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  polarization() %>% 
  round(2)

polarization_freqtable  # 0.30

#==============================================================
# 3) Leik’s Consensus (measure of ordinal dispersion):
# 
# If all observations are in the same category, ordinal dispersion is 0. 
# With half the observations in one extreme category, and half the observations 
# in the other extreme, Leik’s measure gives a value of 1.
#
# Leik’s ordinal dispersion measure is a percentage, and can be interpreted
# accordingly. [Important Note: R reports it as a proportion, not as a percentage!]
# 
# Ordinal dispersion can be used to express consensus or agreement, 
# simply by taking: 
#    1 - ordinal dispersion.
#==============================================================

help(Leik, package = "agrmt")

leik_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  Leik() %>% 
  round(2)

leik_freqtable  # 0.40

#==============================================================
# 4) Tatsle and Wierman’s Consensus (computed using the Shannon entropy as the basis)
# 
#  It is 1 if there is perfect uniformity; 
#  it is 0 if there is perfect bimodality (=lack of agreement)
#==============================================================

help(consensus, package = "agrmt")

consensus_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  consensus() %>% 
  round(2)

consensus_freqtable  # 0.62

#==============================================================
# 5) Entropy
# 
# The Shannon entropy can be calculated using the entropy() function. 
# This function follows TW (i.e., Tatsle and Wierman) and ignores categories 
# with zero observations. 
#
# The Shannon entropy ignores the order of categories; use the consensus() 
# function to consider the order of categories.
#==============================================================

help(entropy, package = "agrmt")

entropy_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  entropy() %>% 
  round(2)

entropy_freqtable  # 2.08


#==============================================================
# Aside:   We can also classify a distribution using the AJUS-system introduced by Galtung (1969), 
# with the aim of reducing complexity. 
# 
# Distributions are classified as A if they are unimodal with a peak in the centre, 
# as J if they are unimodal with a peak at either end, 
# as U if they are bimodal with a peak at both ends, and as S if they are multimodal. 
# 
# In addition to Galtung's classification, the function classifies distributions 
# as F if there is no peak and all values are more or less the same (flat). 
# 
# Furthermore, a distinction is drawn between J and L distributions, 
# depending on whether they increase or decrease: J types have a peak on the right, 
# L types have the peak on the left. The skew is given as +1 for a positive skew, 
# as 0 for no skew, and -1 for a negative skew.
# 
# The skew is identified by comparing the sum of values left and right of the midpoint 
# respectively. For J-type of distributions, the skew is identified on the basis of the 
# changes between values. This way, long tails cannot influence the skew, 
# and a single peak at the left and right-hand end can be differentiated in all cases.
# 
#==============================================================

help(ajus, package = "agrmt")

freqtable %>% 
  pull(frequencies) %>% 
  ajus() 

My thinking is that the user could be given a menu of possibilities for describing the posterior distribution corresponding to an ordinal categorical variable via some of the functions in the agrmt package: agreement(), polarization(), Leik(), consensus(), entropy(), etc. (Note that the package includes other functions - I haven't explored them all.)

Thanks, Matthew - I hope some of the above will be helpful to you.

Isabella

@paul-buerkner
Copy link
Collaborator

Thank you @isabellaghement for checking out all these options!

@mjskay
Copy link
Collaborator Author

mjskay commented Nov 24, 2022

With regard to conversion, it should be possible to support categorical only draws_array and draws_matrix, right? Only if we mixed categorical and numerical variables, the conversion would be lossy (and then trigger a warning).

Hmm, fair point. It could get a little hairy since we'd probably have to apply a class (or at least an attribute) to remember that the arrays are factors or ordereds... and base-R factor() and ordered() typically do not work well as arrays (they tend to lose their dimensions). I can think about it and/or try something out to see what might work.

I've come to the conclusion that this isn't really worth it (I added rationale to the top comment). For now, I have gone with lossy conversion + a warning. Long term, we may want a way of storing type information for each variable in a draws_array/matrix, which would allow us to address this issue.

I believe this PR has now reached a "ready" state. I have updated the summary comment to be more comprehensive, as well as NEWS.md. Feedback welcome!

@mjskay mjskay marked this pull request as ready for review November 24, 2022 04:59
@mjskay
Copy link
Collaborator Author

mjskay commented Nov 25, 2022

The factor_rvar branch of {ggdist} now has experimental support for rvar_factors as well. Simple example of brms + posterior + ggdist with an ordinal model:

library(brms)
library(posterior)
library(ggdist)

df <- esoph
m <- brm(tobgp ~ agegp, data = df, family = cumulative, seed = 12345, backend = "cmdstanr")

grid <- unique(df[,"agegp", drop = FALSE])
grid$pred <- rvar_ordered(posterior_predict(m, newdata = grid))

grid |>
  ggplot(aes(x = agegp, ydist = pred)) +
  stat_slab()

image

Relevant issue over there is: mjskay/ggdist#108

Copy link
Collaborator

@paul-buerkner paul-buerkner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks already really good, up to a few minor things, mostly style related.

I am impressed by the level of thought (and effort) you put into getting factors and ordered factors to work as natively as possible!

Some of the rvar internals remain a bit of black magic to me, so I don't understand every detail of it, but I trust you and the tests you put in place on that one.

I agree with your assessment of using a lossy conversion to draws_matrix and draws_array, at least for the time being.

R/as_draws_array.R Outdated Show resolved Hide resolved
R/as_draws_array.R Outdated Show resolved Hide resolved
as_draws_rvars.draws_df <- function(x, ...) {
x_df <- as.data.frame(x, optional = TRUE)[, variables(x, reserved = TRUE), drop = FALSE]
data_frame_to_matrix <- function(df) {
if (any(vapply(df, is.factor, logical(1)))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have a separate function/method that does the check whether any of the variables in non-numeric in a draws object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like, a generic function that does this for any draws type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think I'm not seeing where to use it currently, so I'm not sure exactly the signature / definition)

R/as_draws_rvars.R Outdated Show resolved Hide resolved
R/discrete-summaries.R Outdated Show resolved Hide resolved
R/rvar-summaries-over-draws.R Outdated Show resolved Hide resolved
R/rvar-summaries-within-draws.R Show resolved Hide resolved
tests/testthat/test-rvar-apply.R Show resolved Hide resolved
tests/testthat/test-rvar-bind.R Outdated Show resolved Hide resolved
DESCRIPTION Outdated Show resolved Hide resolved
R/rvar-dist.R Outdated Show resolved Hide resolved
@mjskay
Copy link
Collaborator Author

mjskay commented Nov 29, 2022

This PR looks already really good, up to a few minor things, mostly style related.

Thanks for the really thorough review! I've addressed everything except two comments (see those replies).

Some of the rvar internals remain a bit of black magic to me, so I don't understand every detail of it, but I trust you and the tests you put in place on that one.

Heh, yeah, I agree it is pretty arcane. Getting rvar to mimic everything it does so far has been quite a journey (base R datatypes are... certainly something!), and I should probably clean up and document the rvar internals a bit more with everything I've had to figure out to get it to work (plus I sometimes forget when I've been away from it a bit).

@paul-buerkner
Copy link
Collaborator

Thank you for your changes and explanations of what you what to keep the way it is. It does all make sense to me. I want to give people a bit more time to check out the new features and perhaps add a review. So I will put a reminder in my calender for start of next week to merge this PR if no new things have come up until then. Would that sound good to you?

@mjskay
Copy link
Collaborator Author

mjskay commented Nov 29, 2022

Sounds great, thanks!

Should we ping some folks who might be able to provide feedback? Off the top of my head: @isabellaghement, @bwiernik, @jgabry, @avehtari? If any of you have a chance to "kick the tires" of rvar_factor() / rvar_ordered() on this branch sometime in the next week, it would be much appreciated (of course, no pressure!).

@avehtari
Copy link
Collaborator

pinging @TeemuSailynoja as he has been running recently some categorical and ordinal models

@bwiernik
Copy link

Try to run some tests with it this weekend

@paul-buerkner
Copy link
Collaborator

@mjskay From my side, this is ready to merge. Anything you want to change before I merge?

@mjskay
Copy link
Collaborator Author

mjskay commented Dec 20, 2022

Sorry for the delayed response - yes, I think this is good for now. I've built some stuff in ggdist on top of it and haven't run into anything new that needs changing, so that's the best I can say at the moment without other folks testing it. Thanks!

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Dec 20, 2022

Lovely, merging now. Thank you again for your enormous effort to make this feature possible!

@paul-buerkner paul-buerkner merged commit ef688e3 into master Dec 20, 2022
@mjskay
Copy link
Collaborator Author

mjskay commented Dec 20, 2022

Thanks!

@mjskay mjskay deleted the factor branch December 21, 2022 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants