Data science PythonFebruary 7, 2016

Convert London Oyster (Travel) PDFs to Pandas DataFrames

As a part of analysing Emily’s allergic rhinitis we want to test whether using the London Underground (notoriously dirty!) increases the likelihood of sneezing. The “black snot” phenomenon is well known to Londoners, possibly the particulates (from oil and metal) cause irritation. You can get updates via our allergic rhinitis analysis mailing list (very very low volume).

Transport for London lets us download a log of journeys – either as a CSV file (just dates and costs, no details) or a PDF file (containing full details of the journey and time). It would be much nicer if they made the data available in a cleanly-formatted open format (e.g. at least a CSV, preferably as HDF5).

The goal is to take the detail-rich PDFs and to build a DataFrame like:

                             from is_train                to
date                                                        
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town

Using textract (see these Python 3.4 install notes, I also use pdftotext) and a very hacky parser (written this evening, it really is a stateful-messy-hack <sorry>) I can parse a single PDF or a folder to build a Pandas DataFrame of journeys. You’ll find London Oyster PDF to DataFrame Parser here. The output is an HDF5 which can be loaded by Python into Pandas (or R or Matlab or whatever).

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

14 Comments

tommorris
February 7, 2016 at 11:26 pm
@ianozsvald as someone with rhinitis (albeit non-allergic), please keep doing this.
dsguidebiz
February 7, 2016 at 11:30 pm
Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/zGM7zONWMs
DimensionVtweet
February 7, 2016 at 11:46 pm
Convert London Oyster (Travel) PDFs to Pandas #DataFrames: As a part of analysing Emily's allergic rhinitis... https://t.co/N8vawdUv7g
kyran_dale
February 8, 2016 at 12:54 am
RT @ianozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames: As a part of analysing Emily's allergic rhinitis... https://t.co/…
RJDrexel
February 8, 2016 at 2:52 am
Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/JtdMNdCyrN #python #coding
programmingncr
February 8, 2016 at 2:59 am
Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames As a part of analysing Emily’s allergic rhi… https://t.co/sz7xpwKTyX
snipe
February 8, 2016 at 3:00 am
Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/AtPOyfeBUu
__J_L_M__
February 8, 2016 at 3:06 am
#PlanetPython Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/StKNFKgMF4
planetpython
February 8, 2016 at 4:06 am
Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/1egqY6Rp3m
patwater
February 8, 2016 at 4:09 am
RT @planetpython: Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/1egqY6Rp3m
fx86
February 8, 2016 at 5:27 am
RT @ianozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames: As a part of analysing Emily's allergic rhinitis... https://t.co/…
pdipao
February 8, 2016 at 7:02 am
#Python Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames https://t.co/ivdzf7poiM
ManAHLTech
February 8, 2016 at 9:17 am
RT @ianozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames: As a part of analysing Emily's allergic rhinitis... https://t.co/…
EvasiumDev
February 8, 2016 at 8:41 pm
Ian ozsvald: convert london oyster (travel) pdfs to pandas dataframes https://t.co/qbK0UpPAqo

Convert London Oyster (Travel) PDFs to Pandas DataFrames

14 Comments

Navigation

Recent Posts

About Ian