This repository provides materials for a session that is part of the I2DS Tools for Data Science workshop run at the Hertie School, Berlin. The student-run workshop is part of the course Introduction to Data Science taught by Simon Munzert at the Hertie School, Berlin, in Fall 2021.
This repository contains the materials students will need for the interactive tutorial and two datasets they will need to conduct their analysis.
This session will teach you how to clean 'typical' dirty data sets using the janitor package. Data scientists spend a great deal of time cleaning and organising data to get it into shape for analysis. Janitor provides a set of simple but powerful functions that help to clean variable names, remove unnecessary rows and columns, prepare data to bind two datasets together, and more. In the first, pre-recorded part of the session we will introduce students to key janitor functions. In the second, live part, students will practice using janitor functions on a dirty data example.
By the end of our workshop session, we hope that students will be able to:
- Use janitor functions to clean a typical 'dirty dataset', including removing unnecessary columns and rows, creating readable column names, and locating duplicates.
- Understand how janitor integrates with the rest of the tidyverse and can be used in tandem with other tidyverse packages to quickly clean data
- Use janitor functions to quickly create tables, which include clear labels
- Github overview of janitor package
- Cross-tabulation with janitor
- CRAN all functions
- Blogpost from 'Swimming in the Data Lake' on janitor
The material in this repository is made available under the MIT license.
We collaborated to develop our approach to the workshop and to design our presentation and materials. However, we each took the lead on a key aspect of the workshop.
Claudia Zwar was responsible for preparing the materials for the live tutorial and writing the Readme file.
Eduardo Campbell was responsible for developing the recorded presentation and creating the recording.