GitHub project (Project 3) repository for PDSND

This is the README file for the master branch, now merged with the README of documentation branch


README for documentation branch

Explore Bike Share Data

The goal for this data is for the team to modify here any data analysis

For this project, your goal is to ask and answer three questions about the available bikeshare data from Washington, Chicago, and New York. This notebook can be submitted directly through the workspace when you are confident in your results.

You will be graded against the project Rubric by a mentor after you have submitted. To get you started, you can use the template below, but feel free to be creative in your solutions!

#upload the csv files into the environment
ny <- read.csv('new_york_city.csv')
wash <- read.csv('washington.csv')
chi <- read.csv('chicago.csv')
#add column called city to each table
ny['CITY']='New york'
chi ['CITY']= 'Chicago'
#combine only same columns of all tables, to ignore the additional two columns from New York table
common_cols <- intersect(intersect(colnames(ny), colnames(wash)),colnames(chi))

#merge all three tables into one single data frame
cities <- rbind(subset(ny, select = common_cols),
                 subset(wash, select = common_cols),
                 subset(chi, select =common_cols))
#create an additional column to include the day of the week trip start time

cities$Weekday <- format(as.Date(cities$Start.Time), "%A")

Question 1

What is the most common day? .

#get the count for each day by city
by(cities$Weekday, cities$CITY, summary)
cities$CITY: Chicago
   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday
     1285      1302      1150      1111      1254      1292      1236
cities$CITY: New york
   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday
     8168      7570      6176      6597      8729      7898      9632
cities$CITY: Washington
   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday      NA's
    12926     11721     12133     11566     13204     13288     14212         1
#visualization to get the distribution of day of the week by City
ggplot(data = subset(cities, ! +
  geom_bar(mapping = aes(x=Weekday, fill = CITY),
           position = "dodge") +
  ggtitle("Day of the week trip start date by city")


Chicago drivers start most of their trips on Mondays but all days of the week have similar amount of bikers starting their trips, while Washington and New York start most of their trips on Wednesday. The weekends, Saturday and Sundays, are the least amount of bikers starting a trip at all cities as can be assumed during this time the travel had already started.

Question 2

what are the counts of each user type?

#get the summary info for each user type by city
by(cities$User.Type, cities$CITY, summary)

cities$CITY: Chicago
             Customer Subscriber
         1       1746       6883
cities$CITY: New york
             Customer Subscriber
       119       5558      49093
cities$CITY: Washington
             Customer Subscriber
         1      23450      65600
#remove the blank user type shown in previous code
cities1 = cities %>% mutate_if(is.factor,trimws) %>% filter(User.Type!='')
  1. 'Subscriber'
  2. 'Customer'
#create visualiztion of user type by city
ggplot(data = subset(cities1)) +
  geom_bar(mapping = aes(x=CITY, fill = User.Type),
           position = "dodge" ) +           
  ggtitle("User Types by city")


The dataframe contained mostly Subscribers and Customers. I removed the blank user type in order to get a more significant visualization. Results show that all cities have more Subscribers than customers. Washington has more customers all other cities. The difference of Subscribers bewteen Chicago and Washington is 58,717.

Question 3

What is the average travel time for users in different cities?

#get trip duration by city summary
 by(cities$Trip.Duration, cities$CITY, summary)
cities$CITY: Chicago
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   60.0   394.2   670.0   937.2  1119.0 85408.0
cities$CITY: New york
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's
     61.0     368.0     610.0     903.6    1051.0 1088634.0         1
cities$CITY: Washington
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's
    60.3    410.9    707.0   1234.0   1233.2 904591.4        1
#visualization as box plot to get the average of trip duration per city.
qplot(x = CITY, y = Trip.Duration ,
      data = cities, geom ='boxplot') +
  ggtitle("Trip Duration By city")
Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”


#omiting outliers: limiting the y axis to 1500
qplot(x = CITY, y = Trip.Duration ,
      data = cities, geom ='boxplot') +
  ggtitle("Trip Duration By city") +
  coord_cartesian(ylim = c(50.0, 1500))
Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”


The first graph shows all the values by city. Given that the trip durations ranges from 60 seconds to 1088634 seconds there are outliers that exceed the 75% quartile. In the second graph the y axis was limited to 1500 seconds to have a more concise visualization. The trip durations in Washington have the highest average with 1234 seconds, while New York have the shortest average.

Date created


Project title

Explore Bike Share Data

Files Used

chicago.csv washington.csv new-york-city.csv


