Repository UCI

title	author	date	output
ReadMe	Kevin E. D'Elia	07/22/2015	html_document

Repository UCI

This repository contains the project work done for my Coursera class Getting and Cleaning Data.

The purpose of this project is to demonstrate my ability to collect, work with, and clean a data set. The goal is to prepare a dataset which meets the principles of tidy data^3^ and that can be used for later analysis. To this end, the file run_analysis.R contains the R code necessary for the task.

The dataset which is constructed during the initial processing of the data is a "messy" dataset because of the following violations of the tidy data principles:

There are duplicate columns (i.e., every column does not contain a different variable)
Not every row is a single observation

The motivation for the project comes from research being done in the area of wearable computing^1^. The data used in the project is data collected from the accelerometers of the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained^2^.

Guidelines for approaching the project work can be found in the Coursera Course Project Forum^4^, ^5^.

In addition to this README, there is a file in the repository named CodeBook.md which describes the variables, the data, and any transformations or work performed to clean up the data.

Environment

R version: R version 3.2.0 (2015-04-16)
R Studio version: RStudio Desktop 0.99.467
Operating environment(s): Mac OS/X Yosemite, Ubuntu 14.04, RHEL 6.6
Dependencies: This script will require the following library to be installed and available - dplyr

NOTE: This script was developed and executed on Unix-like platforms and makes no provisions for running on a non-Unix platform. A future version may take into consideration the target execution platform and adjust paths accordingly.

Overview of run_analysis.R

Obtain and extract the data
Merge the training and the test sets to create one data set
Extract only the columns containing values for the mean and standard deviation of each measurement
Use descriptive activity names to name the activities in the data set
Appropriately label the data set columns with descriptive variable names
From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject

Detailed explanation

Step 0 - Obtain and extract the data

The dataset under analysis (UCI HAR Dataset) is obtained from the UCI Machine Learning Repository^2^. The script run_analysis.R contains code to:

set the working directory to the directory from which the script is run
check for the existence of the compressed data file
if the compressed data file does not exist:

the file will be downloaded as data.zip from the link pointing to the UCI Machine Learning Repository^2^
the compressed file will be extracted to the data directory, relative to the current working directory
the files for analysis will reside in data/UCI HAR Dataset, data/UCI HAR Dataset/train, and data/UCI HAR Dataset/test

NOTE: The code to do the actual download and extraction has been commented out due to the prohibitive time of the download. Prior to executing the script, it is expected that the user of the script has downloaded the file directly from the page link (much faster!) and extracted the contents to the following directories: ./data, ./data/train, and ./data/test. Step 1 of the script will verify the existence of this directory and exit the script should it not exist.

Step 1 - Merge the training and the test sets to create one data set

The script first checks for the existence of the directory ./data, which is relative to the location of the script. If this directory does not exist, the script exits with an appropriate message. Once the existence of the ./data has been confirmed, the script will read in a total of 8 files. A description of each file follows:

activity_labels.txt: contains textual descriptions of the activities performed by the volunteer subjects (WALKING, etc.)
features.txt: contains the names of the measurements taken during training/test (tBodyAcc-mean()-X). Information regarding the approach to naming the measurements can be found at the UCI Machine Learning Repository^2^.
train/subject_train.txt: contains the subject IDs which participated in the training sessions
train/y_train.txt: contains the activity IDs which were performed by the training subjects
train/X_train.txt: contains the measurements captured during the training sessions
train/subject_train.txt: contains the subject IDs which participated in the testing sessions
train/y_train.txt: contains the activity IDs which were performed by the testing subjects
train/X_train.txt: contains the measurements captured during the testing sessions

NOTE: The dataset also includes Inertial Signals data for both the training and testing sessions. These data files were not used in the analysis.

Step 2 - Extract only the columns containing values for the mean and standard deviation of each measurement

After the data has been read in successfully, the column headers are set using the text SubjectID, Activity, and the contents of the second column from features.txt. Duplicate columns are then identified and the dataset is subsetted using the logical vector of unique columns. Then the data is sorted first by SubjectID, then Activity. The column names are extracted and a grep is run against them to extract the column indices for mean and standard deviation measurements. The guideline for coding the regular expression to determine which columns to keep is that any column name which includes the text mean() or std() was selected; all others were excluded. This includes ones such as angle(X,gravityMean) [which was excluded because mean is a parameter] and fBodyBodyGyroJerkMag-meanFreq() [which was excluded because it was a weighted average]. This vector of indices is used to subset the original dataset to create a reduced dataset containing measurements for mean and standard deviations values only.

Step 3 - Use descriptive activity names to name the activities in the data set

After the original dataset has been reduced to one containing only mean and standard deviation measurements, the numerical values which are contained in the Activity column are converted to textual values using the information contained in the activity_labels.txt data. The conversion is achieved through the use of a loop which iterates over the length of the activity_labels vector and subsets the dataset based on loop index, replacing the Activity column value of all matching rows with the corresponding entry from the activity_labels vector.

Step 4 - Appropriately label the data set columns with descriptive variable names

The first step in this phase of processing extracts the column headers in order to modify them to more meaningful names. The modification steps are to replace all -mean()- and -std()- text with Mean and Std, respectively. The final modification is to prefix all measurement column headers with the Greek letter mu, which represents an arithmetic mean value, and enclose the column headers in parentheses. The existing columns of the dataset are then overlaid with the new column headers.

Step 5 - From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject

The final step in the processing is to do the analysis. The dataset is grouped by SubjectID and Activity and the columns associated with the grouping are summarized via the mean function. The output of the analysis is written to the local file system using write.table with no specific options other than the defaults; the name of the output file is tidy_data.txt and can be read back into R for verification using tidy <- read.table("./tidy_data.txt", header=TRUE) and then viewing the tidy variable.

Note: For some reason, when reading the data back in, the parentheses which enclose the variable names are replaced with periods. So, μ(tBodyAccMeanX) becomes μ.tBodyAccMeanX. instead. No solution for this improper substitution was found prior to the completion of the project.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.DS_Store		.DS_Store
.RData		.RData
.gitignore		.gitignore
CodeBook.html		CodeBook.html
CodeBook.md		CodeBook.md
README.html		README.html
README.md		README.md
UCI.Rproj		UCI.Rproj
features_info.txt		features_info.txt
run_analysis.R		run_analysis.R
tidy_data.txt		tidy_data.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository UCI

Environment

Overview of run_analysis.R

Detailed explanation

Step 0 - Obtain and extract the data

Step 1 - Merge the training and the test sets to create one data set

Step 2 - Extract only the columns containing values for the mean and standard deviation of each measurement

Step 3 - Use descriptive activity names to name the activities in the data set

Step 4 - Appropriately label the data set columns with descriptive variable names

Step 5 - From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject

References:

1. Data Science, Wearable Computing and the Battle for the Throne as World’s Top Sports Brand

2. Human Activity Recognition Using Smartphones Data Set

3. Hadley Wickham's paper on Tidy Data

4. David's personal course project FAQ

5. Tidy Data and the Assignment

About

Releases

Packages

Languages

GriffinRidgeback/UCI

Folders and files

Latest commit

History

Repository files navigation

Repository UCI

Environment

Overview of run_analysis.R

Detailed explanation

Step 0 - Obtain and extract the data

Step 1 - Merge the training and the test sets to create one data set

Step 2 - Extract only the columns containing values for the mean and standard deviation of each measurement

Step 3 - Use descriptive activity names to name the activities in the data set

Step 4 - Appropriately label the data set columns with descriptive variable names

Step 5 - From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject

References:

1. Data Science, Wearable Computing and the Battle for the Throne as World’s Top Sports Brand

2. Human Activity Recognition Using Smartphones Data Set

3. Hadley Wickham's paper on Tidy Data

4. David's personal course project FAQ

5. Tidy Data and the Assignment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages