NOTE: This package and wiki are being phased out. Please see research.coursekata.org for more info, and the mo for a more efficient data cleaner.
This is the source code and an introduction to the CourseKataData
package. If you are trying to learn more about the contents of the
actual data, or how other researchers are using this package, head to
the Wiki.
The goal of CourseKataData
is to help researchers working with
CourseKata courses on data science, statistics, and modeling. The data
downloaded from CourseKata can be useful
in the more-or-less “raw” form that it comes in, but you will usually
want to process the data before working with it. This package includes
useful functions for cleaning and tidying the raw course data from
completed courses.
Download the package directly from this repository using devtools
:
library(devtools)
install_github("UCLATALL/CourseKataData")
This section details the usage of the functions in the CourseKataData
package. If you would like to read more about the actual structure of
the data downloaded from CourseKata, check out the associated Data
Structure page in the
Wiki.
The easiest way to get started working with your data is to just
download it and run process_data
with the path to the zip file (no
need to unzip the file first).
library(CourseKataData)
process_data("path/to/downloaded/data.zip")
If you would like to unzip the file first (which can sometimes be
faster), you can also pass the name of the directory to
process_data()
. If your data was in path/to/downloaded/data.zip
and
you extracted it to path/to/downloaded/data
, then the call should be
to the latter path:
process_data("path/to/downloaded/data")
When you run process_data()
, it will load all of the data and create
six data frames (actually tibble
s,
but they work pretty much the same). These are the names of the tables
that are created:
classes
responses
items
tags
media_views
page_views
If there is already an R object with one of these names, you will be
given the opportunity to abort the processing or overwrite the existing
object. Advanced users can check the R documentation (?process_data
)
to learn how to create these variables in a different environment.
If you have downloaded multiple data download bundles from CourseKata, don’t worry, this package takes care of merging the files for you. You can specify a vector of zip files or directories to load:
paths <- c("path/to/first/zip", "path/to/second/zip", "path/to/a/directory")
process_data(paths)
Or, you can put all of the data into a common directory and specify that directory:
# assuming you put all of the downloaded bundles in the same directory
process_data("path/to/directory/containing/bundles")
Note: this process relies on the fact that the data bundles you download have a consistent format. It may get confused if you reorganize things too much within the downloaded bundles. (See more about the structure of the data download and what is in each file in the Data Structure page in the Wiki.)
By default the data is parsed using the “UTC” time zone. If you would like to convert the date time data in each table to a different time zone you can specify it. Here is an example with Los Angeles, and another showing how to let your computer decide what time zone you are in:
# specifically Los Angeles
process_data("path/to/downloaded/data", time_zone = "America/Los_Angeles")
# let the system figure it out for you
process_data("path/to/downloaded/data", time_zone = Sys.timezone())
If you are having trouble with this, see ?timezones
for more
information.
By default, all of the responses in the course are included in the
responses
table. These responses can be semantically split into three
parts: the surveys at the beginning and end of the course, the practice
quizzes at the end of each chapter (these are now called review
questions, but this function still calls them quizzes because they used
to be called that), and the rest of the items in the text. If you would
like to split the responses like this, there is a handy
split_responses()
function that will return a list of three tables:
parts <- split_responses(responses)
parts$in_text
parts$quizzes
parts$surveys
If you would instead like to process only part of the data, you can use the helpful sub-process functions that exist for each of the file types in the data download:
process_classes()
process_responses()
process_items()
process_tags()
process_media_views()
process_page_views()
As with process_data()
, each of these functions will accept the path
to a CourseKata data bundle zip file or the directory of an extracted
zip file, a directory full of bundles, or you can simply supply the name
of the file that you want to process with the function.
# zip file
process_classes("path/to/downloaded/data.zip")
# directory
process_classes("path/to/downloaded/data)
The classes data file contains information about each class in the data
download. However, the other files are specific to each class included
in the download. If you want to extract a specific class, you can
specify the class_id
, which should correspond to the name of the
class’s folder in the classes
folder of the download. (See more about
the structure of the data download and what is in each file in the Data
Structure page in the
Wiki.)
# a single class
process_page_views(
"path/to/downloaded/data.zip",
class_id = "dad3c954-3cb5-11ea-b81d-1d385b778009"
)
# two classes
process_page_views(
"path/to/downloaded/data.zip",
class_id = c(
"dad3c954-3cb5-11ea-b81d-1d385b778009",
"1a99bb15-4d19-4e3f-bb60-6332141573ed"
)
)
If you are not going to use R to perform your data analysis, you will
likely want to export the processed data back to CSV or some other
format. Note that many of the base functions in R will have trouble
writing the processed data because many of the processed tables have
list-columns. A list-column is a column of a data frame where each row /
element is itself a list
. Here is an example:
tbl_with_lists <- tibble::tibble(x = list(
# first element/row of column x
list(
list(1, 2, 3),
list(4, 5, 6)
),
# second element/row of column x
list(
list(7, 8, 9)
)
))
These columns need to be converted back to a format that can be written
as a string if you want to export the data to CSV or a similar format
that does not have clear rules for exporting lists. For convenience, the
convert_lists()
function is provided to help conversion, and we load a
custom CSV writer to handle this for you:
# if you just want to convert lists
convert_lists(tbl_with_lists)
#> # A tibble: 2 × 1
#> x
#> <chr>
#> 1 [[[1],[2],[3]],[[4],[5],[6]]]
#> 2 [[[7],[8],[9]]]
If you specifically want to export to CSV, we have included a function
that can handle this for you. It automatically masks the standard
write.csv()
with one that will handle these list columns:
# make sure you have CourseKataData loaded
library(CourseKataData)
responses <- process_responses("path/to/some/data")
write.csv(responses, "responses.csv")
If you see an issue, problem, or improvement that you think we should know about, or you think would fit with this package, please let us know on our issues page. Alternatively, if you are up for a little coding of your own, submit a pull request:
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D