:HIPSTER_DEV_BLOG

Another Octopress blog about programming and infrastructure.

NY Taxi Data Visualized

Recently a massive dataset of NYC Taxi Data was made public. There are torrents available but at 19gb the data can be quite unwieldy to manage on a home machine. /r/BigQuery have uploaded the dataset to Google’s BigQuery service.

BQ provides a simple way to get insights out of this dataset without tearing through your internet usage or waiting for your home machine to query 173 million records. For example on reddit they have already discovered some anonymization issues.

I’ve taken some of the popular Queries and charted them.

Histogram of tips as a % of fare.

1
2
3
4
5
6
SELECT INTEGER(ROUND(FLOAT(tip_amount) / FLOAT(fare_amount) * 100)) tip_pct,
  count(*) trips
FROM [833682135931:nyctaxi.trip_fare]
WHERE payment_type='CRD' and float(fare_amount) > 0.00
GROUP BY 1
ORDER BY 1

Average Speed Over Hour.

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
  HOUR(TIMESTAMP(pickup_datetime)) as hour,
  ROUND(AVG(FLOAT(trip_distance)/FLOAT(trip_time_in_secs)*60*60)) AS speed
FROM
  [833682135931:nyctaxi.trip_data]
WHERE
  INTEGER(trip_time_in_secs) > 10
  AND FLOAT(trip_distance) < 90
GROUP BY
  hour
ORDER BY
  hour;

Average Tip Over Month.

1
2
3
4
5
6
SELECT INTEGER(AVG(tip_amount)*100)/100 avg_tip,
  REGEXP_EXTRACT(pickup_datetime, "2013-([0-9]*)") month
FROM [833682135931:nyctaxi.trip_fare]
WHERE payment_type='CRD'
GROUP BY 2
ORDER BY 2

Comments