Wednesday, September 27, 2017
Transit schedule data demystified using GTFS
Transit schedule data demystified using GTFS
General Transit Feed Specification (GTFS) is the Google-originated standard format for transit route, stop, trip, schedule, map, and fare data. Everything except realtime.
Its called a feed because it (usually) includes an RSS update for changes.
There are lists of feeds on the Google wiki, and on the separate GTFS data website.
Each organizations GTFS file includes all their services, so some agency files can get pretty big, and get updated often. Any schedule change or route adjustment means a new release of the entire GTFS file. The file itself is merely a big zipfile, containing several csv files that are strangely required to be mislabelled as .txt.
Heres the contents of Milwaukee County Transit Systems GTFS file:
$ unzip -l mcts.zip
Archive: mcts.zip
Length Date Time Name
--------- ---------- ----- ----
169 2014-01-10 05:01 agency.txt
40136 2014-01-10 05:00 calendar_dates.txt
5746 2014-01-10 05:01 routes.txt
307300 2014-01-10 05:00 stops.txt
35198135 2014-01-10 05:00 stop_times.txt
650622 2014-01-10 05:01 trips.txt
8369736 2014-01-10 05:01 shapes.txt
3490 2014-01-10 05:01 terms_of_use.txt
--------- -------
44575334 8 files
Yeah, 44MB unzipped.
But only 5MB zipped. Still not something you want to download every day to your phone.
Lets find a stop at Mitchell International Airport:
$ cat stops.txt | grep AIRPORT
7168,7168,AIRPORT,, 42.9460473, -87.9037345,,,1
7162,7162,AIRPORT & ARRIVALS TERMINAL,, 42.9469597, -87.9030569,,,0
Its right, there are two stops at the airport. Each stop has a latitude and longitude, a unique ID number, and a descriptive name. The final field designates a timepoint (1=Timepoint, 0=Not).
Lets try an intersection where two routes cross:
$ cat stops.txt | grep "HOWELL & OKLAHOMA"
709,709,HOWELL & OKLAHOMA,, 42.9882051, -87.9043319,,,1
658,658,HOWELL & OKLAHOMA,, 42.9885464, -87.9045333,,,1
$ cat stops.txt | grep "OKLAHOMA & HOWELL"
5152,5152,OKLAHOMA & HOWELL,, 42.9881561, -87.9046550,,,1
5068,5068,OKLAHOMA & HOWELL,, 42.9883466, -87.9041176,,,1
Heres a problem that will require some logic to solve. I consider the intersection to be one place (not a GTFS term). Many trips and routes can use the same stop. Multiple stops (GTFS terms) can exist at the same place. In this case, northbound, southbound, eastbound, and westbound buses each have a different stop at the same place.
This might make your job easier...or harder.
GTFS cares about trips and stops. It doesnt care that Stops #709 and #5152 are twenty meters apart, and serve different routes - that its a transfer point. Nothing in GTFS explicitly links the two stops. Generally, you must figure out the logic to do that - you have the lat/lon and the name to work with.
GTFS does have an optional transfers.txt file, that fills in the preferred transfer locations for you. But thats for a more advanced exercise.
Lets see what stops at #709:
$ grep -m 5 ,709, stop_times.txt
4819177_1560,06:21:00,06:21:00,709, 14,,0,0
4819179_1562,06:49:00,06:49:00,709, 14,,0,0
4819180_1563,07:02:00,07:02:00,709, 14,,0,0
4819181_1564,07:15:00,07:15:00,709, 14,,0,0
4819182_1565,07:28:00,07:28:00,709, 14,,0,0
These fields are trip_id, arrival_time, departure_time, and stop-sequence (14th).
Lets see the entire run of trip 4819177_1560:
$ grep 4819177_1560 stop_times.txt
4819177_1560,06:09:00,06:09:00,7162, 2,,0,0 # Hey, look - stops out of sequence in the file
4819177_1560,06:09:00,06:09:00,7168, 1,,0,0 # Begin Trip
4819177_1560,06:11:00,06:11:00,7178, 3,,0,0
[...]
4819177_1560,06:20:00,06:20:00,8517, 13,,0,0
4819177_1560,06:21:00,06:21:00,709, 14,,0,0 # Howell & Oklahoma
4819177_1560,06:22:00,06:22:00,711, 15,,0,0
[...]
4819177_1560,07:17:00,07:17:00,1371, 66,,0,0
4819177_1560,07:19:00,07:19:00,6173, 67,,0,0
4819177_1560,07:20:00,07:20:00,7754, 68,,0,0 # End of trip
We can also look up more information about trip 4819177_1560:
$ grep 4819177_1560 trips.txt
GRE,13-DEC_WK,4819177_1560,N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS,0,515111,13-DEC_GRE_0_12
This needs a little more explanation
- route_id: Green Line (bus)
- service_id (weekday/days-of-service): 13-DEC_WK
- headsign: N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
- direction_id (binary, 0 or 1): 0
- block_id (useful only if the same bus changes routes): 515111
- shape_id (useful for route maps): 13-DEC_GRE_0_12
Lets look up the route_id:
$ grep GRE routes.txt
GRE,MCTS, GRE,MetroEXpress GreenLine,,3,http://www.ridemcts.com/Routes-Schedules/Routes/GRE/,,
The full route name is MetroEXpress GreenLine, its a bus (type-3 = bus) route, and we have the operator website for it.
Lets look up the service_id:
$ grep -m 10 13-DEC_WK calendar_dates.txt
13-DEC_WK,20140113,1
13-DEC_WK,20140114,1
13-DEC_WK,20140115,1
13-DEC_WK,20140116,1
13-DEC_WK,20140117,1
13-DEC_WK,20140120,1
13-DEC_WK,20140121,1
13-DEC_WK,20140122,1
13-DEC_WK,20140123,1
13-DEC_WK,20140124,1
Ah, this specific trip is a weekday (Monday-Friday) only trip.
Lets look up the route map shapefile for the trip:
$ grep 13-DEC_GRE_0_12 shapes.txt
13-DEC_GRE_0_12, 42.946054, -87.903810,10001
13-DEC_GRE_0_12, 42.946828, -87.903659,10002
13-DEC_GRE_0_12, 42.946824, -87.903588,10003
13-DEC_GRE_0_12, 42.946830, -87.903472,10004
[...]
13-DEC_GRE_0_12, 43.123137, -87.915431,670004
13-DEC_GRE_0_12, 43.123359, -87.915228,670005
13-DEC_GRE_0_12, 43.124016, -87.914535,670006
13-DEC_GRE_0_12, 43.124117, -87.914440,670007
The line for this trip has 520 points. Thats pretty detailed.
So what do we know?
We know that Stop #709 is served by the GreenLine route, its the 14th stop in direction 0, its a bus line, we have all the times the stop is served, and we have the route website. We know the route map and all the other stops of any trip serving that stop.
How can we find the next scheduled bus at stop #709?
One way is to start with all trips that stop at #709 from stop_times.txt.
Since we probably know what time it is, we can filter out all the past times, and most of the future times. This leaves us with a nice, small list of, say, 10 possibles that include trips that dont run today at all (we must delve deeper to determine).
We can look up each of those trips in trips.txt, and get the route.
Each trip also includes a service_id code. The calendar_dates.txt file tells us which dates each service_id code is valid.
Right, we need to do three lookups.
The shell code gets a bit complex with three lookups, so I shifted to Python and wrote a basic next-vehicle-at-stop-lookup in about 160 lines. Python lists are handy, since it can handle all the stops at a location just as easily as a single stop. Pythons zip module is also handy, so I can read data directly from the zipfile. But at 13 seconds, Python is probably too slow for this kind of application:
$ time ./next_bus.py
Next departures from Howell & Okahoma
16:16 GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:22 GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:26 51 OKLAHOMA - TO LAKE DRIVE
16:28 51 TO 124TH ST. - VIA OKLAHOMA
16:30 GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:35 GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:43 51 TO 124TH ST. - VIA OKLAHOMA
16:44 GRE N AIRPORT - VIA OAKLAND-HOWELL METROEXPRESS
16:45 51 TO NEW YORK
16:45 GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
16:56 GRE N BAYSHORE - VIA OAKLAND-HOWELL METROEXPRESS
real 0m13.171s # Ugh. If I had started 13 seconds sooner, I wouldnt be bored now.
user 0m10.740s
sys 0m0.260s
All that time crunching the GTFS file has not gone unnoticed.
Trip planners (like Google) pre-process the data, mapping out and caching link-node and transfer relationships, limiting the trip data to the next hour or two (as appropriate), and using rather fancy algorithms to prune the link-node map to a likely set of possibilities before looking at trips along those links.
Thats one reason Google Transit is much faster than 13 seconds.
But thats all advanced stuff.
Also advanced is how to integrate real-time data, which uses one of several different formats. Next time...
download file now