Estimating When CitiBike Rolled Out E-Bikes Using Change Point Detection

Mar 21, 2021
12 min read

The other day, I hopped on a Citi Bike for the first time in a few months and was surprised – and a little frightened – to discover that I’d mounted an e-bike. When had Citi Bike rolled out the electric counterpart to their more familiar, human-powered model?

According to a recent New York Times article, Citi Bike “introduced e-bikes in 2018, but removed them in 2019 after reports of brakes malfunctioning, causing rider injuries,” before phasing them back in early last year.

But that’s not the whole story. A Twitter bot that publishes live bike availability numbers tweeted out that there were 813 available e-bikes on September 9th, a high watermark at the time:

There are 813 docked ebikes right now. That's the most seen so far.https://t.co/8kUZ9KxVuf
— CitiBike Stats (@CitiBikeStats) September 9, 2020

Then the account went radio silent until mid-December when it started reporting e-bike availability numbers that were almost three times higher:

There are 13338 bikes right now. 2303 of them are electric.
— CitiBike Stats (@CitiBikeStats) December 15, 2020

So, presumably, during this period Citi Bike added thousands of e-bikes to their fleet. Perhaps this was a gradual increase, but if there had been a single injection of e-bikes, or at least a primary one, would we be able to figure out when?

Well, e-bikes go faster, if you hadn’t heard, so maybe there was some sort of signature in the ridership data, which Citi Bike thankfully publishes, that showed bikers going faster after a certain date. In other words, we might expect the average duration between stations to tick down once a greater proportion of the riding population was sitting on top of a motor.

It’s not worth my time or my laptop’s processor to sift through the millions of rides that comprise months’ worth of Citi Bike data. Instead, let’s choose a particularly difficult set of routes that we’d suspect would be made much easier with the boost of an e-bike. From experience, I can tell you that ascending the first half of the Williamsburg Bridge as you cross over from Brooklyn to Manhattan is a brutal, uphill push that I wouldn’t wish upon an unassisted rider.

And so we have our goal: examine the various routes that we know crossed that bridge and see if their average duration dipped at some point in our suspected window. This exercise would qualify as change point detection.

Don’t like/know/care about R? Click here to skip to the results.

We need both ride and station data, both of which Citi Bike provides.

library(tidyverse)
library(rstan)
library(ggmap)
library(lubridate)
library(patchwork)

options(mc.cores = 4)

# download trip files from https://s3.amazonaws.com/tripdata/index.html
trips_files <- paste0(rep(2020:2021, c(5, 1)),
                     str_pad(c(8:12, 1), 2, "left", 0),
                     "-citibike-tripdata.csv")

trips <- map(trips_files, ~read_csv(.x)) %>%
  bind_rows() %>%
  as_tibble(.name_repair = "universal")

# read in station info as well
stations_json <- jsonlite::read_json('https://feeds.citibikenyc.com/stations/stations.json')

stations <- bind_rows(map(stations_json$stationBeanList, ~unlist(.x))) %>%
  mutate_at(c('id', 'latitude', 'longitude'), as.numeric)

We can then identify our stations of interest, which will be within close proximity of either end of the Williamsburg Bridge:

# station ids for those close to bridge on Brooklyn side
stations_wburg <- c(3104, 3103, 3101, 3096, 3098, 3092, 3093, 3090, 3091, 2002,
                    539, 3085, 3088, 3087, 3084, 460, 481, 471, 3082, 3079, 389, 
                    282, 2005, 443, 532, 3078, 3080, 3075, 3073, 3071, 3069, 3072, 
                    3065, 3066, 3079, 3077, 3089)

# station ids for those close to bridge on Manhattan side
stations_mnhtn <- c(2009, 408, 331, 340, 291, 502, 332, 308, 296, 295, 307, 350, 
                    356, 263, 412, 361, 311, 263, 531, 342, 341, 400, 410, 401, 473,
                    265, 312, 302, 150, 301, 393, 411, 266, 339, 394, 317, 432, 445, 
                    511, 433, 326, 438, 236, 300, 439, 403, 428, 300, 335, 229, 375, 
                    161, 250, 251, 151, 303)

map <- get_map(c(left = -74.000, 
                 bottom = 40.695, 
                 right = -73.945, 
                 top = 40.730))

ggmap(map) +
  geom_point(data = stations %>% 
               filter(id %in% c(stations_wburg, stations_mnhtn)),
             aes(x = longitude, y = latitude, fill = id %in% stations_wburg),
             col = "black", pch = 21, size = 3, alpha = 0.7) +
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none")

For each unique route between a starting station in Brooklyn and an ending one in Manhattan, we take the daily mean tripduration, although we throw away rides longer than a half hour in an effort to weed out bikers who might be meandering through a scenic trip instead of making a mostly direct path between stations.

We also index our dates and our routes, since we will need a way to keep track of them once we begin modeling.

trips_agg <- trips %>%
  filter(start.station.id %in% stations_wburg,
         end.station.id %in% stations_mnhtn,
         # filter out unreasonably long trips
         tripduration < 30*60) %>%
  group_by(date = date(starttime),
           start.station.id, start.station.name,
           end.station.id, end.station.name) %>%
  summarize(trips = n(),
            mean_duration = mean(tripduration)/60) %>%
  ungroup() %>%
  arrange(date) %>%
  # index dates and routes
  mutate(date_idx = as.numeric(as.factor(date)),
         route = paste0(start.station.name, " (", start.station.id, ") -> ", 
                        end.station.name, " (", end.station.id, ")")) %>%
  mutate(route_idx = as.numeric(factor(route, levels = unique(.$route)))) %>%
  arrange(date_idx, route_idx)

# assign chunks of data to each date
date_idx <- trips_agg %>%
  mutate(rn = row_number()) %>%
  group_by(date_idx) %>%
  summarize(first = min(rn),
            last = max(rn))

This is what our final dataset looks like, with each row providing the number of trips and mean trip duration for a single route on a particular day:

trips_agg %>%
  filter(trips > 2) %>%
  group_by(route_idx) %>%
  sample_n(1) %>%
  ungroup() %>%
  sample_n(5) %>%
  select(date, date_idx, start.station.name, end.station.name, route_idx, trips, mean_duration) %>%
  knitr::kable()

date	date_idx	start.station.name	end.station.name	route_idx	trips	mean_duration
2020-09-05	36	Roebling St & N 4 St	E 7 St & Avenue A	315	3	20.87778
2020-11-11	103	Grand St & Havemeyer St	Forsyth St & Canal St	331	3	24.45000
2020-09-27	58	S 5 Pl & S 5 St	Stanton St & Chrystie St	51	3	15.66111
2020-11-10	102	Union Ave & N 12 St	Allen St & Hester St	566	3	20.68889
2020-08-02	2	Roebling St & N 4 St	E 5 St & Avenue C	191	3	22.04444

We’re going to use Stan to fit this change-point detection model. The required data are:

N, the number of rows (or unique route-date combinations)
N_routes, the number of unique routes
N_days, the number of unique dates
mean_duration, the mean trip duration for each route-date
trips, the number of trips for each route-date
route, the route for each route-date

We also provide the first and last row in the data of each date to allow us to access observations based on their date. This workaround is necessary because logical indexing isn’t possible (by default) in Stan, so I can’t simply write something like mean_durations[date == 1] that is commonplace in R:

data {
  int<lower = 1> N; // data size
  int<lower = 1> N_routes; // number of unique routes
  int<lower = 1> N_days; // number of unique dates
  
  // assigning chunks of data to each day
  int<lower = 1, upper = N> day_first_idx [N_days];
  int<lower = 1, upper = N> day_last_idx [N_days];
  
  // duration, sample size, and route index for each observation
  vector [N] mean_duration;
  vector [N] trips;
  int<lower = 1, upper = N_routes> route [N];
}

We assume trips are normally distributed and seek to estimate the mean (\(\mu_1, \mu_2\)) and spread (\(\sigma_1, \sigma_2\)) of this distribution both before and after the theoretical change-point. We also know that different routes will have varying times around this “grand” trip mean, so we allow for a “random effect” in \(Z\) for each route (with a scale \(\sigma_Z\) that also must be estimated):

parameters {
  // average trip length before/after change
  real mu1;
  real mu2;
  
  // trip standard deviation before/after change
  real<lower = 0> sigma1;
  real<lower = 0> sigma2;
  
  // random effects by route
  vector[N_routes] Z;
  real<lower = 0> sigma_Z;
}

Of course, what we’re really after here is the date that the change occurred. But Stan doesn’t allow for direct estimation of latent discrete parameters, so we have to marginalize this parameter out of the posterior. By that, I mean that we can only sample from \(P(Y, \mu, \sigma)\) and not \(P(Y, d, \mu, \sigma)\), so we need a way to get from the latter to the former. Here I’m using \(Y\) as our observed mean trip durations and \(d\) as the latent change-point date, while momentarily ignoring the random effect parameters \(Z\) and \(\sigma_Z\) for simplicity.

How do we go about that? Well the chain rule of probability says that we can break the joint distribution into a marginal distribution and a conditional distribution, as shown in the first line below. From there, we can express the conditional as a sum over all the possible values of \(d\) (line 2) and then once again split that joint distribution into a marginal and a conditional (line 3):

\[ \begin{aligned} P(Y, \mu, \sigma)&= P(\mu, \sigma) \times P(Y|\mu, \sigma)\\ &= P(\mu, \sigma) \times \sum_{d \in D} P(Y, d|\mu, \sigma)\\ &= P(\mu, \sigma) \times \sum_{d \in D} P(d)P(Y|\mu, \sigma,d)\\ \end{aligned} \] That first term, \(P(\mu, \sigma)\), will already be addressed in the model block, so we use the transformed parameters block to also calculate lp, or the log of the term \(P(d)P(Y|\mu, \sigma,d)\) for each possible date. Note that each element is initialized as -log(N_days), which implies that our prior for any single date being the change-point is 1/N_days, or a uniform distribution over the available possibilities.

A few more details:

the random route effects are subtracted out from the beginning in order to isolate the grand mean in mean_duration_ctr
the data-generating normal distribution has the individual route scale shrunk according to the sample size, since we’re actually observing the distribution of a mean

transformed parameters {
  // initialize each log probability with uniform distribution
  vector[N_days] lp = rep_vector(-log(N_days), N_days);

  // center the observed means at grand mean by removing random effects
  vector [N] mean_duration_ctr = mean_duration - Z[route];

  // loop over each possible change-point
  for (cp in 1:N_days){
    // calculate log probability for each date
    for (d in 1:N_days){

      // find rows associated with date
      int start = day_first_idx[d];
      int end = day_last_idx[d];

      // mean/scale dependent on whether change-point has passed
      real mu = d < cp ? mu1 : mu2;
      real sigma = d < cp ? sigma1 : sigma2;

      // add density of observed daily means to log probability
      lp[cp] = lp[cp] + normal_lpdf(mean_duration_ctr[start:end]| mu, sigma ./ sqrt(trips[start:end]));
    }
  }
}

Except…this is unnecessarily slow. We’re actually constantly re-computing values we already know. Think about moving the change-point back one day–we should really only be figuring out the new probability of our data at that change-point, not calculating the density again for all the surrounding days whose distributions are unaffected.

So instead we turn to a dynamic programming solution demonstrated in the Stan manual. The key here is to create two vectors, lp_pre and lp_post, that hold the cumulative probability at all days under each condition (i.e. \(\mu_1, \sigma_1\) vs. \(\mu_2, \sigma_2\)). Then, to simulate the change-point occurring at each day, we combine the cumulative probability under the first condition up until that day (head(lp_pre, N_days)) with the cumulative probability under the second condition after that day (lp_post[N_days + 1] - head(lp_post, N_days)):

transformed parameters {
      // initialize each log probability with uniform distribution
      vector[N_days] lp = rep_vector(-log(N_days), N_days);
      
      // center the observed means at grand mean by removing random effects
      vector [N] mean_duration_ctr = mean_duration - Z[route];
      
      // vectors to hold log probability under each scenario
      vector[N_days + 1] lp_pre;
      vector[N_days + 1] lp_post;
      lp_pre[1] = 0;
      lp_post[1] = 0;
      
      // calculate log probability for each date
      for (d in 1:N_days) {
        
        // find rows associated with date
        int start = day_first_idx[d];
        int end = day_last_idx[d];
        
        // add density of observed daily means to both log probabilities
        lp_pre[d + 1] = lp_pre[d] + normal_lpdf(mean_duration_ctr[start:end] | mu1, sigma1 ./ sqrt(trips[start:end]));
        lp_post[d + 1] = lp_post[d] + normal_lpdf(mean_duration_ctr[start:end] | mu2, sigma2 ./ sqrt(trips[start:end]));
      }
      
      
      lp = lp + 
           head(lp_pre, N_days) + //log probability up until each possible change-point
           (lp_post[N_days + 1] - head(lp_post, N_days)); //log probability after each possible change-point
}

We put somewhat informative priors on our parameters in our model block, since we have an idea of how long it should take to bike over the bridge, and make sure to add lp to our target – the equivalent of including the \(\sum_{d \in D} P(Y, d|\mu, \sigma)\) term in the breakdown of our posterior from above.

model {
  // would expect rides to be around 15 minutes
  mu1 ~ normal(15, 5);
  mu2 ~ normal(15, 5);
  
  // half-normal prior on scale terms
  sigma1 ~ normal(0, 5);
  sigma2 ~ normal(0, 5);
  
  // tighter half-normal prior on random effects
  sigma_Z ~ normal(0, 3);
  Z ~ normal(0, sigma_Z);
  
  // include marginalized latent parameter in posterior
  target += log_sum_exp(lp);
}

And finally, in order to get posterior samples of the change-point, we need to create them in the generated quantities block since we’re not sampling them directly in the model block:

generated quantities {
  // sample change points from posterior simplex
  int<lower = 1, upper = N> cp_sim = categorical_rng(softmax(lp));
}

We can now fit this model. It takes a few minutes (with chains run in parallel), but without the dynamic programming workaround it would be intractable.

fit <- stan("/Users/walkerharrison/Downloads/Bikez/bike_CPD.stan",
            data = list(
              N = nrow(trips_agg),
              N_routes = max(trips_agg$route_idx),
              N_days = max(trips_agg$date_idx),
              
              day_first_idx = date_idx$first,
              day_last_idx = date_idx$last,
              
              mean_duration = trips_agg$mean_duration,
              trips = trips_agg$trips,
              route = trips_agg$route_idx
            )
)

The top row shows the posterior distribution of the mean and standard deviation of the rides before (blue) and after (red) the supposed change point. It appears that rides got about two minutes shorter at one point, which is consistent with what we might expect from an e-bike rollout.

As for the actual timing of the change point, we see that the log probability peaks at October 12th and that posterior samples are extremely concentrated on this date. In fact, I’m a little suspicious in the confidence of this model in a single date, although I don’t have much experience with this type of analysis so my fear might not be warranted.

(Graphing code is hidden but it’s available in original .Rmd)

We also can see the random effects in action by comparing the average durations by route with their respective parameter estimates. In general there’s a clean one-to-one relationship, but we can also see the effect of varying sample sizes as routes with little data get regressed more while those with lots of rides “earn” more extreme intercepts.

So why did the model pick October 12th? It’s not really plausible to grasp the varying influences of thousands of routes, but we can pick a select few that illustrate the concept. Below are three popular trips that (if you squint), all saw slight downticks in their mean ride duration around the date in question:

This of course is just a best guess. It’s more likely that e-bikes were released gradually and that (if there was an effect), it happened over time and not on a single date. But change point models are a neat concept that require some creative programming. And who knows…maybe CitiBike really did just flood New York with e-bikes on October 12th and was just waiting for a curious data blogger to reveal their strategy.