Author: Lisa Salmon // Date: October 26, 2014
This file describes the process followed for generating the requested tidy data set from an original set of data files. It includes the following sections below:
-
Original Data Set - information about the original data set and where to get it
-
New Tidy Data Set - the variables contained within the resulting tidy data set
-
Process - the steps taken to convert the original data set to the new tidy data set
The original raw data is the "Human Activity Recognition Using Smartphones Data Set" available from the UCI Machine Learning Repository. As stated in the abstract, it is a "Human Activity Recognition database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors."
The original data set can be downloaded HERE.
The original data set inclues multiple files and folders, with test data separated from training data. Each test/training set includes three separate files for activity measurements, activity labels, and the subject performing the activity.
The downloaded data extracts into the following folders and files:
- activity_labels.txt
- features_info.txt
- features.txt
- README.txt
- test/subject_test.txt
- test/X_test.txt
- test/y_test.txt
- train/Inertial Signals/* (multiple files, not used)
- train/subject_train.txt
- train/X_train.txt
- train/y_train.txt
- train/Inertial Signals/* (multiple files, not used)
The original feature set includes 10,299 observations of 561 separate measurements, each associated with one of 30 subjects and one of 6 activities.
The new tidy data set includes a subset of just 66 feature measurements from the original -- only the standard deviation, and the mean value measurements, as requested in the assignment specification. After reading the original data description, this was interpreted to mean any measurement that included the exact character string "mean()" or "std()" in the name.
The final tidy data set includes the overall average for each of the 66 feature measurements grouped by subject AND activity. This results in a table with 68 columns (subject + activity + 66 feature averages) and 180 rows (30 subjects X 6 activities).
Subjects are indicated by number, in the range 1 - 30.
Activities are indicated by one of six character strings:
- WALKING
- WALKING_UPSTAIRS
- WALKING_DOWNSTAIRS
- SITTING
- STANDING
- LAYING
According to the information that came with the original data set, the feature measurements come from the accelerometer and gyroscope 3-axial raw signals. These time domain signals were captured at a constant rate of 50 Hz and the feature names are prefixed in our tidy data set with the string Time_.
Accelerometer readings are indicated in the tidy data set feature name by Acc and gyroscope readings are indicated by Gyro. The 3-axial signals contain a suffix of either -X, -Y or -Z indicating which direction the measurement is for.
In the original data the acceleration signal was then separated into body and gravity acceleration signals, indicated in the tidy data set by Body and Gravity in the feature names. The body linear acceleration and angular velocity were also derived in time to obtain Jerk signals (indicated by Jerk in the feature name). The magnitude of these three-dimensional signals were calculated using the Euclidean norm (indicated by Mag in the feature name).
Finally, again as stated in the original data, a Fast Fourier Transform (FFT) was applied to some of the signals. In our tidy data set, these measurement names are preceded by Freq_ to indicate that the value is in the frequency domain.
The full resulting set of column names in the tidy data set is:
Subject
ActivityName
Time_BodyAcc_MeanValue-X
Time_BodyAcc_MeanValue-Y
Time_BodyAcc_MeanValue-Z
Time_BodyAcc_StdDeviation-X
Time_BodyAcc_StdDeviation-Y
Time_BodyAcc_StdDeviation-Z
Time_GravityAcc_MeanValue-X
Time_GravityAcc_MeanValue-Y
Time_GravityAcc_MeanValue-Z
Time_GravityAcc_StdDeviation-X
Time_GravityAcc_StdDeviation-Y
Time_GravityAcc_StdDeviation-Z
Time_BodyAccJerk_MeanValue-X
Time_BodyAccJerk_MeanValue-Y
Time_BodyAccJerk_MeanValue-Z
Time_BodyAccJerk_StdDeviation-X
Time_BodyAccJerk_StdDeviation-Y
Time_BodyAccJerk_StdDeviation-Z
Time_BodyGyro_MeanValue-X
Time_BodyGyro_MeanValue-Y
Time_BodyGyro_MeanValue-Z
Time_BodyGyro_StdDeviation-X
Time_BodyGyro_StdDeviation-Y
Time_BodyGyro_StdDeviation-Z
Time_BodyGyroJerk_MeanValue-X
Time_BodyGyroJerk_MeanValue-Y
Time_BodyGyroJerk_MeanValue-Z
Time_BodyGyroJerk_StdDeviation-X
Time_BodyGyroJerk_StdDeviation-Y
Time_BodyGyroJerk_StdDeviation-Z
Time_BodyAccMag_MeanValue
Time_BodyAccMag_StdDeviation
Time_GravityAccMag_MeanValue
Time_GravityAccMag_StdDeviation
Time_BodyAccJerkMag_MeanValue
Time_BodyAccJerkMag_StdDeviation
Time_BodyGyroMag_MeanValue
Time_BodyGyroMag_StdDeviation
Time_BodyGyroJerkMag_MeanValue
Time_BodyGyroJerkMag_StdDeviation
Freq_BodyAcc_MeanValue-X
Freq_BodyAcc_MeanValue-Y
Freq_BodyAcc_MeanValue-Z
Freq_BodyAcc_StdDeviation-X
Freq_BodyAcc_StdDeviation-Y
Freq_BodyAcc_StdDeviation-Z
Freq_BodyAccJerk_MeanValue-X
Freq_BodyAccJerk_MeanValue-Y
Freq_BodyAccJerk_MeanValue-Z
Freq_BodyAccJerk_StdDeviation-X
Freq_BodyAccJerk_StdDeviation-Y
Freq_BodyAccJerk_StdDeviation-Z
Freq_BodyGyro_MeanValue-X
Freq_BodyGyro_MeanValue-Y
Freq_BodyGyro_MeanValue-Z
Freq_BodyGyro_StdDeviation-X
Freq_BodyGyro_StdDeviation-Y
Freq_BodyGyro_StdDeviation-Z
Freq_BodyAccMag_MeanValue
Freq_BodyAccMag_StdDeviation
Freq_BodyAccJerkMag_MeanValue
Freq_BodyAccJerkMag_StdDeviation
Freq_BodyGyroMag_MeanValue
Freq_BodyGyroMag_StdDeviation
Freq_BodyGyroJerkMag_MeanValue
Freq_BodyGyroJerkMag_StdDeviation
These are the steps followed in the run_analysis.R file, to transform the original data into the tidy data set.
Before running the script, the original data files must first be downloaded and extracted to a folder called data which must be in the same working directory as the run_analysis.R file.
####1. Read the 6 relevant data files from the original data set into 6 different data frames.####
test/subject_test.txt > testSubjects
test/X_test.txt > testFeatures
test/y_test.txt > testLabels
train/subject_train.txt > trainSubjects
train/X_train.txt > trainFeatures
train/y_train.txt > trainLabels
####2. Combine the test and training sets into 3 data frames which include all observations.#### The same order is maintained in all sets, so observations can be correctly associated across all data frames.
testSubjects + trainSubjects = allSubjects
testFeatures + trainFeatures = allFeatures
testLabels + trainLabels = allLabels
####3. Create a subsetFeatures data frame which contains ONLY the standard deviation and mean value measurements.####
First reads the feature names from the features.txt file into a data frame, featureNames. Then uses the grepl() function, to create a boolean vector which indicates the column names that include either -std() or -mean() in the original feature name.
The boolean vector is used to extract the appropriate columns from the allFeatures data frame, as well as the original data column names from the featureNames data frame.
####4. Clean up the feature names to be easier to read/understand.####
Includes a cleanColName() function that receives a column name from the original data and returns an easier-to-read tidy data version as described in the Features section above.
A for loop runs through each element of the featureNames vector, passes it to the cleanColName function, and populates a cleanNames vector with the returned value, maintaining order.
The new clean names are set as the column names of the subsetFeatures data frame, using the colnames() function.
####5. Add activity names to each observation.####
The original allLabels data contains the activity number for each observation, but we want to change this to be the activity name and add it as a new column in our subset data frame. The 6 activity names are read into a table from the activity_labels.txt file and converted into an activityNameVector vector of length six, where the order of the vector matches the activity number in the allLabels data.
A for loop runs through each element of the allLabels number values, and a new labelNameVector is populated using the activityNameVector key that matches the number.
Using the cbind() function, the subsetFeatures and activityNameVector values are combined into a new final data frame called subsetData. The column name for activity names is set to "ActivityName".
####6. Add subject numbers to the subset data frame.####
The allSubjects number column is added to the subsetData data frame as the first column, and the column name is set to "Subject".
At this point all the separate data from the multiple original data files has been combined into one tidy data set, with human-readable activity names, and column names that are also easier to read than before.
####7. Create a second tidy data set that contains the average of all feature measurements, grouped by subject AND activity.####
Finally, using the plyr library, the following code is run to find the average of all numeric columns, grouped by subject and activity:
averages <- ddply(subsetData, c("Subject", "ActivityName"), numcolwise(mean))