A Bayesian approach to predicting Citi Bike rides

Understanding Citi Bike through weather

Citi Bike is the number three mode of public transportation in New York City behind the subway and buses. There were 1.2 million Citi Bike trips in February 2020 equaling to 40,000 daily trips. Unlike it’s larger transit counterparts, Citi Bike daily ridership is difficult to predict as it is a mixture of commuter and leisure riders — the former is tied to weekdays and the latter is tied to the weather.

Given this limited information of just weather and weekday, can Citi Bike ridership be accurately predicted?

Key findings

A Bayesian negative binomial model can predict the number of daily Citi Bike trips using just the weather and day of the week. It clearly captures the seasonality of bike rides, and it provides a credible interval that covers almost every actual observation. The model suggests there are more trips on weekdays, less during rainy days, more as the temperature increases, and less as the wind increases. These are all intuitive drivers of Citi Bike trips. However, using point estimates alone, the RMSE of 14,001 is too large for many practical uses.

Analysis process

Citi Bike data

Citi Bike publishes real-time and monthly datasets detailing each bike trip taken since 2013. Information includes the date, start- and end-times, departure and arrival stations, subscriber status, rider sex and rider birth year. Ridership has been steadily growing with an average daily ridership of 26,238 in 2013 compared to 56,306 in 2019. To minimize the impact of this omitted growth variable while maximizing the size of the training set, only data from 2017, 2018, and 2019 will be included. The data is then aggregated and bike trip counts are calculated for each day. A random sample of 80% is used to train the model and the remaining 20% is used for model prediction evaluation.

Weather data

The Citi Bike dataset does not contain weather data. Weather information is obtained from the National Oceanic and Atmospheric Administration (NOAA) for the Central Park weather station. Information includes the daily precipitation, average temperature, and maximum gust speed. The data is collected during the day (i.e. ex post). The final model will be used for prediction so a practical application would require a weather forecast (i.e. ex ante). This discontinuity is acceptable as one-day weather forecasts are quite accurate.

Final dataset

The final dataset consists of 938 observations and five variables.

Variable Type Description
Trip_count continuous Count of daily Citi Bike trips
Weekday boolean 1 = weekday, 0 = weekend
Precipitation boolean 1 = rain, 0 = no rain
Temp integer Average daily temperature in Fahrenheit
Gust_speed continuous Maximum gust speed in miles per hour

Model selection

The outcome variable is a simple count of how many bike trips occurred in a single day. Count data is frequently modeled using a Poisson model or the more flexible negative binomial model. The data is assumed to not be zero-inflated as it seems unlikely there would be many days, if any, where there were no Citi Bike trips.

First, we will fit a negative binomial then evaluate it against a Poisson using expected log predictive density, effective number of model parameters, and Pareto k values via leave-one-out cross-validation.

The model form in R syntax:

Trip count ~ Weekday + Precipitation + Temperature + Gust speed

The primarily tool for this analysis is the excellent R package for Bayesian generalized models brms — which interfaces with Stan.

Drawing from the prior predictive distribution

Specifying priors is the first step of a Bayesian analysis. Each coefficient plus the intercept and shape parameter need a respective prior. Passing the model specification to brms::get_prior() returns the priors that need to be set along with their default values. These should be replaced with our own priors.

Interpreting coefficients (and priors)
Negative binomial models’ outcomes are specified in log counts so the coefficients need to be interpreted accordingly. For example, if the coefficient estimate is 0.50 then for a one-unit increase in this predictor variable, the difference in the expected log count of the number of bike trips is expected to increase by 0.50 — given everything else is held constant. When setting priors it’s more helpful to think about the coefficient like this:

  • If our outcome variable is 50,000 bike trips and we believe changing only this one predictor will have a positive effect of 33,000 bike trips then the coefficient estimate for this predictor would be: log(83,000) - log(50,000) = 0.50

Specifying priors

Temperature and gust speed are not likely to be associated with large effects on the outcome for each one unit increase. Temperature is in Fahrenheit and gust speed is in miles per hour. A one unit increase in either of these is unlikely to be associated with a measurable effect on the trips. Both are set to normal(0, 0.01).

Weekday and precipitation, I believe, are more likely to be associated with a larger effect since these are binary variables. However, their associated effects will be inverses of each other: weekdays have more commuter riders and rainy days have less overall riders. The priors are set to normal(0.5, 0.1) and normal(-0.5, 0.1) respectively.

Prior Class Coef
normal(0, 0.01) b Gust_speed
normal(-0.5, 0.1) b Precipitation
normal(0, 0.01) b Temp
normal(0.5, 0.1) b Weekday
normal(10, 1) Intercept
exponential(10) shape

We can draw from this model using brms::brm() with the optional arguments family = 'negbinomial' and sample_prior = "only". The resulting estimates are inline with our priors except for maybe that shape parameter seems a little off. More on that later.

prior_estimate <- brm(Trip_count ~ Weekday + Precipitation + Temp + Gust_speed,
                      data = Citibike_train, family = 'negbinomial',
                      prior = priors, sample_prior = "only")

And finally, we compute the expected value using brms::pp_expect(). The below plot shows these draws from the prior distribution of the conditional expectation are reasonable. Examining the deciles shows that the middle 90% are covering a reasonable range of data. The values are little low but close to the actuals: the mean draw is 38,778 whereas the mean daily ridership for 2017-2019 is closer to 50,000.

prior_draws <- pp_expect(draws, nsamples = 4000)

Conditioning on the observed data

Since the priors are reasonable we can now condition on the data. This is simple using the brms package; simply call brms::update() on the model with the argument sample_prior = "no".

post_nb <- update(prior_estimate, sample_prior = "no")

The model converges and Rhat values are each 1.00. The effective-sample-sizes are all large as well, ranging from 2,800 to 5,500. Importantly, we see the estimates are close to the priors.

The largest surprise is that precipitation is associated with a larger effect on the outcome than weekday which may indicate that the hypothesized weekday commuters do not have as strong a positive effect as rain has a negative effect.

Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 9.41 0.08 9.25 9.58 1.000 4,933 3,338
Weekday 0.24 0.03 0.19 0.30 1.000 4,006 2,828
Precipitation -0.36 0.03 -0.42 -0.30 1.003 4,007 2,830
Temp 0.02 0.00 0.02 0.02 1.000 4,433 3,095
Gust_speed -0.01 0.00 -0.02 0.00 1.000 5,583 2,635
shape 8.39 0.43 7.59 9.24 1.003 4,003 2,773

There’s a few drawbacks to these priors. Notably, the shape parameter was misspecified, and the temperature and weekday beliefs may have been too strong (i.e. the resulting distributions of prior estimates are too narrow). It would have been more appropriate to have slightly wider priors, and the shape parameter should have been set with rate = 0.1 instead of rate = 10. Rate = 0.1 corresponds to a mean of approximately 10 which was the goal. Below, the darker grey fill are the posterior estimates and the black lines are the prior estimates.

The posterior draws are well within range — these values are, effectively, 4,000 estimates of the model’s prediction for each observation so they should follow the data closely. The middle 90% [26,141, 82,748] fits the data well; the middle 90% of the actual data is [25,881, 75,358]. Additionally, the mean (52,636) and median (49,902) are close to the actual data (51,528 and 53,809 respectively).

Data characteristics

Since we’ve fit the model we can now examine the data closely. The trip count is roughly normally distributed but slightly left-skewed with a mean of 51,895. There are more trips on weekdays and less for rainy days. Trips are positively correlated (0.77) with temperature and negatively correlated (-0.44) with gust speed. Temperature is left skewed and gust speed is right skewed.

Evaluating the negative binomial model

The model meets all the standard criteria. Executing leave-one-out cross-validation via brms::loo() we see the Pareto k estimates are fine with values less than 0.5 indicating no high leverage data points. The expected log predictive density (elpd_loo) is approximately -8,000 — this will be important when comparing against the Poisson model. And the expected number of parameters in the model, p_loo, is approximately six. Note that looic is just elpd_loo multiplied by -2 and is used similarly.

Estimate SE
elpd_loo -8,330.8 21.51
p_loo 5.7 0.76
looic 16,661.6 43.03

The model estimates are in-line with the actual observations, and it clearly captures the seasonality of trips. The credible interval may be wide but the middle of the estimates are close to the actuals.

Predicting new data

The goal of this exercise is accurate prediction. The data was originally split 80% (757 observations) for training and 20% (181 observations) for testing.

Posterior predictions are made using brms::posterior_predict() with argument newdata = Citibike_test data. Similar to the in-sample estimates, the out-of-sample estimates fit the data well but with a wide credible interval.

An alternative model: Poisson

Count data is most frequently associated with Poisson models, which are a special case of negative binomial. Below, the negative binomial model is refitted as a Poisson using the same model form.

Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 9.65 0.001 9.65 9.66 1.001 4,018 3,070
Weekday 0.19 0.000 0.19 0.19 1.001 1,388 1,593
Precipitation -0.29 0.000 -0.29 -0.29 1.003 1,010 1,422
Temp 0.02 0.000 0.02 0.02 1.000 4,124 3,082
Gust_speed -0.01 0.000 -0.01 -0.01 1.002 3,676 2,567

Compared to the negative binomial model, it has a significantly worse ELPD, is more complicated (considerably larger p_loo value), and has a substantial number of large Pareto k. 532 or 70% of the observations have Pareto k values larger than 0.5 indicating the posterior distribution is sensitive. 373 or 49% of the observations have values over 1.

Poisson Estimate Poisson SE Negative binomial Estimate Negative binomial SE
elpd_loo -1,236,590 64,663 -8,330.8 21.513
p_loo 11,250 483 5.7 0.764
looic 2,473,179 129,326 16,661.6 43.025

The Poisson model is estimating the data well but is severely overfitting.

Prediction comparison

Simplifying the predictions down to just the point estimates, we see the predictions are similar. Using the mean estimates for each out-of-sample observation, the root-mean-squared-error (RMSE) of the negative binomial and the Poisson are close. However, the severe overfitting of the poisson relative to the negative binomial is evident in the below plot.

Negative binomial 14,001
Poisson 12,605

Final thoughts

Between these two models the negative binomial is superior. Negative-binomial models allow the variance of the distribution to be larger than the mean. This is important in the Citi Bike data as it is over-dispersed: the mean is 51,824 and the variance 418,294,720.

Both models suggest that there are more riders on weekdays, less riders during rain, more riders as temperature increases, and slightly less riders as the wind increases in speed. Overall model performance is mediocre, though. The negative binomial model captures much of the variation due to weather and day of the week. However, the RMSE of 14,001 is rather large so it may not be an effective model in practice.

Addendum: a frequentist comparison

The frequentist approach to the negative binomial is simple. In practice, just fit the negative binomial model via MASS::glm.nb() with the same model form specified above. The coefficient estimates are similar to the Bayesian approach. The standard errors are notably tight and the p-values are effectively zero. These are usually interpreted as all good signs.

Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.45 0.08 122.47 0.0000
Weekday 0.22 0.02 9.09 0.0000
Precipitation -0.34 0.03 -11.62 0.0000
Temp 0.02 0.00 30.20 0.0000
Gust_speed -0.01 0.00 -3.55 0.0004
shape 10.75

And the RMSE is effectively the same as the Bayesian negative binomial model.

Negative binomial - Frequentist 13,923
Negative binomial - Bayesian 14,001
Poisson - Bayesian 12,605

However, after plotting the data it is obvious the model is tremendously overfitting similar to the Bayesian Poisson model.

A perk of Bayesian

Bayesians assume model parameters are random and data is fixed whereas Frequentists assume parameters are fixed and data random. The benefit of the Bayesian approach allows us to make probabilistic statements about these parameters. For example, what is the probability that Weekday has an associated positive effect greater than 0.25? 0.41.

To wrap it all up, the plot below shows the spread of each models’ predictions. The predictions have been standardized so 0 represents a perfect prediction, and the ranges — represented by dark to light green — cover roughly 69%, 96%, and 99% of the predictions for that observation. The methods are not directly comparable between Frequentist and Bayesian but this still gives an idea of the spread of each model’s predictions. The Bayesian negative binomial model always covers the actual value — represented by the dark vertical line at x = 0.

2020 May
Find the code here: