Large data (in R-Instat) #8473

rdstern · 2023-08-04T09:38:32Z

rdstern
Aug 4, 2023
Maintainer

This is particularly for a video - one that addresses different ideas of "large" in an attractive way. Having written the items below I would like to combine it with another idea, namely that we would like to promote having data in a spreadsheet to view the data and to help data exploration. So, what are the limits if you wish to do this with R.

Aha that will add the pivot tables in R to the topics below!

The theme is then that you pay a price for having a menu-driven front end and making it easy to use R. So, what is the price. So we will need an introduction, and also provide alternatives. We give 2 extremes.

a) Why not just use the spreadsheet?
b) Why not just use R?

a) I propose to start with the diamonds data as we use that in the first tutorial. 50,000 rows. It has 3 factors with 5, 7 and 8 levels. We really like factors and guite nice for graphs. Could we make another variable into a factor - carat. 273 levels. Then bar charts - x-axis is a mess too many levels. What can we do about it? - wait till the discrete sub-dialog is completed, we can do lots!
b) 273 is quite a lot of levels. price takes this further. Making it a factor it has 11602 levels. Levels/labels ok? Graph as before takes longer, but all seems ok still. Round the numbers - use -1 in transform to round to the nearest 10. Maybe more sensible?
c) There was a time when 50,000 would be quite a long column. Let's make it longer and see if that's still ok. 4 times - 200,000 rows. Still seems fine. Could we do that 100 times? That would be a test of whether R-Instat could easily handle many data frames too. That's sort of large in the number of data frames at the same time. We suggest 100 data frames is quite a lot for a single study. That's also a large number of times to repeat importing the data. Then the columns will be quite long - so over 5 million rows. Some spreadsheets seem to have a limit of 1 million rows, so this is checking 3 different components of large at the same time!
d) Loop with the importing code simple? That prevents repeating a dialog a large number of times.
e) Seems ok. This data view seems to switch quite well.
f) In R-Instat variables can be of many different types. The three most important are numeric, factor and character. We have already looked at large for a factor - it can have a large number of levels. Now we look at large numbers. Examples from elsewhere that shows limited accuracy, but can do better in R. Discussed in detail elsewhere.
g) Character variable - check ok, perhaps use a JaneAustin book. make it wide. Then discuss with wordwrap - good challenge to see how it copes! That's even more evident with polygons, so we could mention them here too, and discuss their use in graphs..
h) It would be nice to look at lists too. Large can be wide in many ways! Can we mention the calculations on lists using a more modern approach to loops in the calculator?
i) Finally, in this tour, we consider large in terms of the number of variables. Here R-Instat is limited, - compared with using R through RStudio. Why is that? Is it largely the metadata? But could be ok for most studies. Largest for the number of variables - DHS is 7000. Also emphasise select and using names and metadata - comes free - compared with just a spreadsheet.
j) Do we want to include a large number of types of misssing value as a very esoteric application. DHS has stopped supplying data in SPSS format, because they only permit 3 types - though for each variable. Stat and SAS - and R permit more!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large data (in R-Instat) #8473

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Large data (in R-Instat) #8473

rdstern Aug 4, 2023 Maintainer

Replies: 0 comments

rdstern
Aug 4, 2023
Maintainer