Time to include tidymodels in R-Instat? #7203

rdstern · 2022-02-06T07:41:11Z

rdstern
Feb 6, 2022
Maintainer

In prepare we are fitting with the tidydata suite of packages by default. The tidymodels suite relates to our describe and particularly to our model menu. It is still changing, but has moved on attractively in the past couple of years. I suggest it is time to assess whether these packages provide our defaults here as well. If so, then how should they be incorporated.

We already use some of the packages in this suite, particularly ggplot2 and also broom (a bit). There are quite a few more. Below I mention infer, skimr, parsnip and yardstick. There is also an interesting package of data sets we should add. It is called modeldata.

A good place to start is here, with the first set of articles, called Perform Statistical Analysis.

The first article is particularly to show the importance of the by group analyses. I would like to include this feature in R-Instat soon - by Version 0.8. That's so we can give our own version of this article, because I would like to improve on the ideas for some situations. I suggest it is a mistake to start these articles with one that shows how to break down data into groups for separate analyses, as though this is the natural default. I think it is very clever computing-wise, and sometimes very useful. But we can show 2 alternatives, namely a) doing the analyses one-at-a-time, so the individual results are examined as they are produced. We want to encourage interactive work. b) Keeping the sets together so (for example) parallel line models can be examined.

I digress slightly to mention the interesting online book and package called Statistical Inference via Data Science. There is a companion package called moderndive which has a geom for parallel-line models called geom_parallel_slopes, which I think we should add. (It also has geom_categorical model. I am less sure what that does.

The other articles use the infer package that is part of the tidymodels suite. It looks as though we can incorporate this, (including the exciting visualize function) in our one-variable and two variable dialogues. The improvements are perhaps better versions of some of the functions - so fitting in with tidymodels - plus considerable improvements in the use of these dialogues for teaching purposes. Also the use of repeated sampling as well as theory to produce the results. I suggest @volloholic will like that a lot - correctly - and it is an important addition.

So there is quite a big task. Quite important, because it could make the one variable and two variable modelling dialogues pretty impressive, and hence a good parallel to the one-variable and two variables describe.

Next topic is smaller! skimr is a replacement for the summary function, which we use in the Describe > One Variable > Summarise. Now I really like our existing default summarise, but perhaps this is better. Like summarise it gives sensible summaries for all types of data. I wonder if it can provide a third alternative in the One Variable dialogue?

Of course it works with pipes, e.g.

iris %>%
dplyr::group_by(Species) %>%
skim()

I wonder if it could enhance our two variable summarise - possibly a new dialogue, and also for 3-variable. Our current dialogue has the multiple receiver only allowing a single type of variable. This could allow all types!

Back to modelling! A key package in the tidymodels suite is parsnip. This could be our passport to something David has been aiming for, as an enhancement to our general fitting dialogue. There is now an interesting book that shows its use. It also implies we should include the rsample package in the Prepare > Data Reshape Menu. Nice for us - we make 2 data frames, for Train and Test, and the chapter in the book discusses stratifying, time series and multiple levels in splitting the data. Perhaps it should be called Split? Also we can try using parsnip immediately, by installing the package and then using the fit model and the use model keyboards. I suggest one task will be to add a parsnip keyboard each time. The tests can be used to define the keys needed.

Then, I am not sure where it fits, but I do like the look of yardstick! I think it may go naturally in the general Use Models dialogue?

rdstern · 2022-02-07T07:48:42Z

rdstern
Feb 7, 2022
Maintainer Author

So, let me follow this thread with a more detailed set of suggestions that, I hope can become a set of additional features on our modelling dialogues.
This is one-variable > Fit, 2 variables Fit and 3 Variables Fit. We add to them so they become even more useful teaching tools. First a simple task to include the 2 new geoms mentioned above. But let's use it (or them) in the 3-variable fit. It is perfect there. First simply add the geoms.

Next the proposed changes in one- variable and 2 variable. We may incorporate their t-test etc, but we already have a pretty good one from the mosaic package - so that isn't clear to me. (We use tidymodels when it is a real improvement only. But we certainly add visualise, which looks great to show significance levels.

And the big one now, we also add the randomisation tests equivalent. Perhaps they are not even a separate radio button, but perhaps that will be simpler. And it may be easier in teaching if it is obviously a different approach? There is a nice vignette that shows the 2 methods together.

I suggest this is all work to be led by @jkmusyoka supporting one of the interns. Happy to help, and @volloholic needs to check the conceptualisation.

0 replies

rdstern · 2022-02-08T08:08:30Z

rdstern
Feb 8, 2022
Maintainer Author

And another area that can be considered here is simple Bayesian methods. There is a useful book here. This states it is a companion to the Bayesian Statistics in the Coursera course(s) on Statistics with R. There is also a package (for the courses?) called statsr. This includes a function namely:

bayes_inference which is for "Bayes hypothesis tests and credible regions." I suggest it can be another radio button, called Bayes in our Model > One Variable > Fit and Two Variables Fit.

So there will be a Bootstrap (from the ideas above - from Infer package) and a Bayes button (from the statsr package) on these 2 dialogues.

The places and ways we do for further Bayesian modelling is not quite so clear to me. There is a popular package called BAS that adds bas_lm and bas_glm, which looks nice, but we should also check what is possible with Parsnip. There is also a package called bayesplot which is all ggplot2 stuff. Exciting.

0 replies

rdstern · 2022-07-20T17:59:42Z

rdstern
Jul 20, 2022
Maintainer Author

For Bayesian methods Gelman seems a key person. He also has a recent book here:

1 reply

shadrackkibet Jul 21, 2022
Collaborator

@rdstern I agree that Andrew Gelman is the Key person on Bayesian. I have been following him on social media and listening to his talks on Bayesian stats for some time now. He has a number of discussions on youtube but this one impressed me https://www.youtube.com/watch?v=cuE9eHSbjNI&t=6s

For Bayesian, it might be good to look into Stan software https://mc-stan.org/users/interfaces/ which has an R Interface https://mc-stan.org/users/interfaces/rstan.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time to include tidymodels in R-Instat? #7203

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Time to include tidymodels in R-Instat? #7203

rdstern Feb 6, 2022 Maintainer

Replies: 3 comments · 1 reply

rdstern Feb 7, 2022 Maintainer Author

rdstern Feb 8, 2022 Maintainer Author

rdstern Jul 20, 2022 Maintainer Author

shadrackkibet Jul 21, 2022 Collaborator

rdstern
Feb 6, 2022
Maintainer

Replies: 3 comments 1 reply

rdstern
Feb 7, 2022
Maintainer Author

rdstern
Feb 8, 2022
Maintainer Author

rdstern
Jul 20, 2022
Maintainer Author

shadrackkibet Jul 21, 2022
Collaborator