Google Data Analytics Capstone Project

Maung Agus Sutikno
6 min readAug 27, 2022

--

For this course 8th module (finally!), I chose case study 1 and case study 3 as my capstone project. Case Study 1 challenges us with a business goal in Divvy, a bike-share fictional company of Cyclist. Another option that I did also is finishing case study 3: researching my own case, which is about sentiment analysis toward cryptocurrency. At this writing, I would like to share about the prior, case study 1, that the technical part has been posted in my Kaggle. To go through the analysis, the writing structure will follow Google Data Analytics framework: ask, prepare, process, analyze, share, and act.

Google Data Analytics certificate in completing 8 modules of courses

As an analyst on this project, we are required to give some recommendations for marketing programs. The financial background is membership schemes are more profitable than casual users, customers who pay once they use the product only. For that reason, the project goals are based on:

1. How do annual members and casual riders use Cyclist bikes differently?
2. Why would casual riders buy Cyclist annual memberships?
3. How can Cyclist use digital media to influence casual riders to become members?

ASK

The business task is finding the right marketing program in order to convert casual riders to become annual members. Deciding which marketing programs that is suitable for changing a group’s preferences requires researches into their behaviors.

PREPARE

In order to do the research, we need to be assisted with some analysis derived from this given dataset: divvy bike-share data. The data consists of columns that can help our analysis; for example started_at and ended_at which explain the time stamp when the user starts using the bicycle and returning it back to the station. Writer decided to have a two year of research period, between April 2020 and March 2022. The data is stored on a monthly basis which will be combined later after conducting data cleaning process.

PROCESS

R programming language or RStudio is used in the data analysis process due to its better compatibility in accommodating statistics compared to other option like SQL or python. Now, we start the process with uploading the data after (of course) the required R libraries have been installed. Following R coding is for uploading the *.CSV file and converting it to a data frame.

apr2020 <- read.csv("../input/divvy-tripdata/202004-divvy-tripdata.csv")str(apr2020)

After uploading and inspecting the data frames for incongruities using R coding above, April 2020 until November 2020 apparently have integer in column start_station_id and end_station_id. In fact, the rest are in character. Therefore, we need to mutate these two columns in the mentioned periods from integer to become a character using the following R coding example.

nov2020 <-  mutate(nov2020, start_station_id = as.character(start_station_id)
,end_station_id = as.character(end_station_id))

ANALYZE

The next steps are stacking or combining the individual monthly data frames into one big data frame, using the following R coding.

all_trips <- bind_rows(apr2020, may2020, jun2020, jul2020, aug2020, sep2020,
oct2020, nov2020, dec2020, jan2021, feb2021, mar2021, apr2021,
may2021, jun2021, jul2021, aug2021, sep2021, oct2021, nov2021,
dec2021, jan2022, feb2022, mar2022)

The next data preparation is modifying the data: adding columns that list the date, month, day, and year of each ride. This will allow us to aggregate ride data for each month, day, or year … before completing these operations we could only aggregate at the ride level.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Like Microsoft Excel, we can add calculated fields using R. Here, we add a ride_length calculation to the data.

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

The data frame includes some entries of data when bikes were taken out of docks for quality checking by Divvy. Hence some ride_length are negative. We can create a new version of the data frame (v2) since the data is being removed. This is part of ensuring data’s integrity.

all_trips2 <- all_trips[!(all_trips$start_station_name=="HQ QR"|all_trips$ride_length<0),]
summary(all_trips2) #statistical summary of data for numeric data

From the statistical descriptive output above, we can now see that the negative ride_length has been removed. In addition, the average duration of riding the bicycle is 1,438 seconds or 23.97 minutes. Back to the business question of differentiating between casual riders and annual members; the riding duration average for both is 36.5 minutes and 14 minutes, respectively. It means that casual riders use the bicycle longer than the annual members in one-time usage.

aggregate(all_trips2$ride_length ~ all_trips2$member_casual, FUN = mean)

By analyzing the ridership data by type and weekday as output below, we can highlight at glance that the casual riders, the marketing target group, have higher number of rides on weekend in comparison to weekdays. In profiling purpose, it raises a question: are they tourists?

all_trips2 %>%
mutate(weekday=wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by user type and weekday
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% #calculates the average duration
arrange(member_casual, weekday) #sorting

SHARE

In this phase, we can get help from data visualization (below) to grasp better the data trend. It confirms our earlier hypothesis that casual riders are weekend riders while tending to be lower usability on Monday till Friday. Referring to the same visualization, we also understand that annual members have opposite behavior from casual riders. They tend to use bicycles less on the weekend. Again, for profiling purposes, it raises the assumption that the annual members are using bicycles to commute between places on a daily basis as if they were students or workers. Lastly, there is a question: is this trend happening within one year?

all_trips2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")

Addressing the question above about one year trend, we can create visualization with R coding and its result below. It shows that the peak in number of rides is in the summer season and the lowest usage is in winter. These trends explain that in winter the annual members, students or workers (as assumed before), choose another option for transportation.

all_trips2 %>%
group_by(member_casual, day_of_week, month) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length/60)
, .groups = 'drop_last') %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides/100000,
fill = member_casual)) +
geom_col(position = "dodge") +
facet_wrap(~month) +
ylab("Number of Rides (1 = 100,000)") +
xlab("Day of Week")

The interesting part we can spot in the graphic is a constant trend for casual riders when in winter. It means we can conclude that some parts of casual riders are not all tourists and apparently they are the the group that the marketing team needs to be aware of and converting them into annual members.

ACT

This phase is a time to draw conclusion and propose the recommendations to the business based on our analysis. Let’s start with the data analysis conclusions:

  • Casual riders use the bicycle longer than the annual members in one-time usage.
  • Casual riders are mostly using the bicycle on the weekend whereas the annual members are weekday users.
  • The bicycles are most optimized used in summer and least used in winter.
  • There is recurring usage from casual riders in winter.

The recommendations based on the findings are:

  • We need to generate all the casual riders in winter to offer them the marketing program of converting to become annual members.

--

--