title | author | date | output |
---|---|---|---|
ReadMe |
Kevin E. D'Elia |
07/22/2015 |
html_document |
This repository contains the project work done for my Coursera class Getting and Cleaning Data.
The purpose of this project is to demonstrate my ability to collect, work with, and clean a data set. The goal is to prepare a dataset which meets the principles of tidy data^3^ and that can be used for later analysis. To this end, the file run_analysis.R contains the R code necessary for the task.
The dataset which is constructed during the initial processing of the data is a "messy" dataset because of the following violations of the tidy data principles:
- There are duplicate columns (i.e., every column does not contain a different variable)
- Not every row is a single observation
The motivation for the project comes from research being done in the area of wearable computing^1^. The data used in the project is data collected from the accelerometers of the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained^2^.
Guidelines for approaching the project work can be found in the Coursera Course Project Forum^4^, ^5^.
In addition to this README, there is a file in the repository named CodeBook.md which describes the variables, the data, and any transformations or work performed to clean up the data.
- R version: R version 3.2.0 (2015-04-16)
- R Studio version: RStudio Desktop 0.99.467
- Operating environment(s): Mac OS/X Yosemite, Ubuntu 14.04, RHEL 6.6
- Dependencies: This script will require the following library to be installed and available - dplyr
NOTE: This script was developed and executed on Unix-like platforms and makes no provisions for running on a non-Unix platform. A future version may take into consideration the target execution platform and adjust paths accordingly.
- Obtain and extract the data
- Merge the training and the test sets to create one data set
- Extract only the columns containing values for the mean and standard deviation of each measurement
- Use descriptive activity names to name the activities in the data set
- Appropriately label the data set columns with descriptive variable names
- From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject
The dataset under analysis (UCI HAR Dataset) is obtained from the UCI Machine Learning Repository^2^. The script run_analysis.R contains code to:
- set the working directory to the directory from which the script is run
- check for the existence of the compressed data file
- if the compressed data file does not exist:
- the file will be downloaded as data.zip from the link pointing to the UCI Machine Learning Repository^2^
- the compressed file will be extracted to the data directory, relative to the current working directory
- the files for analysis will reside in data/UCI HAR Dataset, data/UCI HAR Dataset/train, and data/UCI HAR Dataset/test
NOTE: The code to do the actual download and extraction has been commented out due to the prohibitive time of the download. Prior to executing the script, it is expected that the user of the script has downloaded the file directly from the page link (much faster!) and extracted the contents to the following directories: ./data, ./data/train, and ./data/test. Step 1 of the script will verify the existence of this directory and exit the script should it not exist.
The script first checks for the existence of the directory ./data, which is relative to the location of the script. If this directory does not exist, the script exits with an appropriate message. Once the existence of the ./data has been confirmed, the script will read in a total of 8 files. A description of each file follows:
- activity_labels.txt: contains textual descriptions of the activities performed by the volunteer subjects (WALKING, etc.)
- features.txt: contains the names of the measurements taken during training/test (tBodyAcc-mean()-X). Information regarding the approach to naming the measurements can be found at the UCI Machine Learning Repository^2^.
- train/subject_train.txt: contains the subject IDs which participated in the training sessions
- train/y_train.txt: contains the activity IDs which were performed by the training subjects
- train/X_train.txt: contains the measurements captured during the training sessions
- train/subject_train.txt: contains the subject IDs which participated in the testing sessions
- train/y_train.txt: contains the activity IDs which were performed by the testing subjects
- train/X_train.txt: contains the measurements captured during the testing sessions
NOTE: The dataset also includes Inertial Signals data for both the training and testing sessions. These data files were not used in the analysis.
Step 2 - Extract only the columns containing values for the mean and standard deviation of each measurement
After the data has been read in successfully, the column headers are set using the text SubjectID, Activity, and the contents of the second column from features.txt. Duplicate columns are then identified and the dataset is subsetted using the logical vector of unique columns. Then the data is sorted first by SubjectID, then Activity. The column names are extracted and a grep is run against them to extract the column indices for mean and standard deviation measurements. The guideline for coding the regular expression to determine which columns to keep is that any column name which includes the text mean() or std() was selected; all others were excluded. This includes ones such as angle(X,gravityMean) [which was excluded because mean is a parameter] and fBodyBodyGyroJerkMag-meanFreq() [which was excluded because it was a weighted average]. This vector of indices is used to subset the original dataset to create a reduced dataset containing measurements for mean and standard deviations values only.
After the original dataset has been reduced to one containing only mean and standard deviation measurements, the numerical values which are contained in the Activity column are converted to textual values using the information contained in the activity_labels.txt data. The conversion is achieved through the use of a loop which iterates over the length of the activity_labels vector and subsets the dataset based on loop index, replacing the Activity column value of all matching rows with the corresponding entry from the activity_labels vector.
The first step in this phase of processing extracts the column headers in order to modify them to more meaningful names. The modification steps are to replace all -mean()- and -std()- text with Mean and Std, respectively. The final modification is to prefix all measurement column headers with the Greek letter mu, which represents an arithmetic mean value, and enclose the column headers in parentheses. The existing columns of the dataset are then overlaid with the new column headers.
Step 5 - From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject
The final step in the processing is to do the analysis. The dataset is grouped by SubjectID and Activity and the columns associated with the grouping are summarized via the mean function. The output of the analysis is written to the local file system using write.table with no specific options other than the defaults; the name of the output file is tidy_data.txt and can be read back into R for verification using tidy <- read.table("./tidy_data.txt", header=TRUE) and then viewing the tidy variable.
Note: For some reason, when reading the data back in, the parentheses which enclose the variable names are replaced with periods. So, μ(tBodyAccMeanX) becomes μ.tBodyAccMeanX. instead. No solution for this improper substitution was found prior to the completion of the project.