# Contents 1. [Installing ccdata](installing-ccdata) 2. [Using ccdata](using-ccdata) 3. [Additional functions in `ccfun`](ccfun) --- ## Before you start If you have some experience with R then feel free to carry on through the introductory tutorial. However if you are new to programming or R then we would recommend you spend a little time looking at the basics of R programming with some of the online tutorials such as [DataCamp](https://www.datacamp.com/) or the [R Studio](https://www.rstudio.com/online-learning/) resources. # Working with the `ccd` object Data from the HIC project is shared in XML format from each site, and then re-assembled into an RData file which contains a single object that we normally call `ccd`. You will be provided with a copy of this data, or be using the raw data on the UCL IDHS. The `ccd` object contains the raw data with no processing. It is up to you how you want to represent these data. The typical steps involve deciding what fields you want, and with frequency resolution you want to represent the information. To start with we will work on the small anonymised data included with the cleanEHR package. The command you will issue will look something like this - you will need to alter the load command entry to identify your Rdata file. ```{r eval=FALSE, echo=TRUE} load("/Users/steve/Documents/R/win-library/3.3/cleanEHR/doc/sample_ccd.Rdata") ``` This loads the raw data (`ccd`) - you will see this appear as a 'Value' in the Environment window at the top right of RStudio. ## The `ccd` structure ### Data overview You can have a quick overview of the data by checking infotb. In the sample dataset, patient identifiers and sensitive variables such as NHS number and admission time have been removed or altered. ```{r eval=FALSE, echo=TRUE} print(head(ccd@infotb)) ``` ```{r eval=FALSE, echo=TRUE} ## site_id episode_id nhs_number pas_number t_admission ## 1: pseudo_site 1 NA NA 1970-01-01 01:00:00 ## 2: pseudo_site 2 NA NA 1970-01-01 01:00:00 ## 3: pseudo_site 3 NA NA 1970-01-01 01:00:00 ## 4: pseudo_site 4 NA NA 1970-01-01 01:00:00 ## 5: pseudo_site 5 NA NA 1970-01-01 01:00:00 ## 6: pseudo_site 6 NA NA 1970-01-01 01:00:00 ## t_discharge parse_file parse_time pid index ## 1: NA 1 1 ## 2: NA 2 2 ## 3: NA 3 3 ## 4: NA 4 4 ## 5: NA 5 5 ## 6: NA 6 6 ``` The basic entry of the data is an admission episode which indicates an admission to a participating site. Using `episode_id` and `site_id` you can locate a unique admission entry. `pid` is a unique patient identifier. ### Quickly check how many episodes are there in the dataset ```{r eval=FALSE, echo=TRUE} ccd@nepisodes ``` ```{r eval=FALSE, echo=TRUE} [1] 30 ``` ## Scope of the dataset There are 263 fields which cover patient demographics, physiology and laboratory results as well as medication information. Each field has 2 labels, an NHIC code (i.e. NIHR_HIC_ICU_0108) and short name (h_rate). There is a function `lookup.items()` to look up the fields you need. The `lookup.items()` function is case insensitive and allows fuzzy search. ### Searching for heart rate ```{r eval=FALSE, echo=TRUE} lookup.items('heart') # fuzzy search ``` ```{r eval=FALSE, echo=TRUE} +-------------------+--------------+--------------+--------+-------------+ | NHIC.Code | Short.Name | Long.Name | Unit | Data.type | +===================+==============+==============+========+=============+ | NIHR_HIC_ICU_0108 | h_rate | Heart rate | bpm | numeric | +-------------------+--------------+--------------+--------+-------------+ | NIHR_HIC_ICU_0109 | h_rhythm | Heart rhythm | N/A | list | +-------------------+--------------+--------------+--------+-------------+ ``` A list of included data fields can be found [here](https://github.com/CC-HIC/analysis-template/blob/master/data-raw/item_ref.yaml). ### Inspect individual episode episode.graph(ccd, 7, c("h_rate", "bilirubin", "fluid_balance_d")) # Choosing your fields We now want to fill in an empty table with one row per patient per time point. ```{r eval=FALSE, echo=TRUE} dt <- create.cctable(ccd, freq=1, conf=NULL) ``` To find out more about the `cctable()` [function](https://github.com/CC-HIC/cleanEHR/tree/29e457782914eef3a89abd82a9391bd4f20a7bb4/R) that is part of the cleanEHR Package. If the `conf` argument is empty (no configuration provided), then the function returns _everything_ from the raw data. We are unlikely to want all the data entries in any given analysis and such a large data table will make for slow processing. Hence we specify the fields we are interested in using a list in a separate .yaml file. You provide a list of these fields in a [`YAML`](http://yaml.org/start.html) configuration file. NIHR_HIC_ICU_0108: NIHR_HIC_ICU_0122: Fields `NIHR_HIC_ICU_0108` and `NIHR_HIC_ICU_0122` are for heart rate, and the lactate from an arterial blood gas respectively. Create an empty file, list the fields you want with a colon `:` after each (the colon indicates that these are 'nodes') in the file. Save the file (ideally with a `.yaml` extension), and then give the path to the file to `create.cctable` function. You can find a full list of fields on the [ccdata wiki](https://github.com/UCL-HIC/ccdata/wiki/Data-set-1.0). For now the fields, must be specified with their NIHR HIC reference number. ### Choose your time cadence The other argument in the `create.cctable()` function above is for the time cadence of the table you are building. Different fields will have information recorded at different times. In most cases, a 'wide table' with one row per episode per time point and columns for all the fields of interest is desired. You need to specify the 'time points', and you do this with the `freq` argument given in hours (e.g. 1,24,0.5,...). If `freq=1`, then you will create an empty row for every hour between admission and discharge. Where raw data is available for that hour, then it will be filled in. Where no data is available, a 'missing' value will remain. You can additionally choose to impute missing values if you wish (see later). So putting this all together, and assuming you have a file called `walkthrough.yaml` that looks like the YAML code snippet above. ```{r eval=TRUE, echo=TRUE} library(ccdata) load("/Users/steve/projects/ac-CCHIC/data/anon.RData") # the raw data dt <- create.cctable(ccd, freq=24, conf="walkthrough.yaml") # make a 'wide' table ``` ## Cleaning `ccdata` ## Imputing missing data --- [Previous](installing-ccdata) --- [Next](using-ccdata.md)