About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group

7 February 2016 - 23:15Convert London Oyster (Travel) PDFs to Pandas DataFrames

As a part of analysing Emily’s allergic rhinitis we want to test whether using the London Underground (notoriously dirty!) increases the likelihood of sneezing. The “black snot” phenomenon is well known to Londoners, possibly the particulates (from oil and metal) cause irritation. You can get updates via our allergic rhinitis analysis mailing list (very very low volume).

Transport for London lets us download a log of journeys – either as a CSV file (just dates and costs, no details) or a PDF file (containing full details of the journey and time). It would be much nicer if they made the data available in a cleanly-formatted open format (e.g. at least a CSV, preferably as HDF5).

The goal is to take the detail-rich PDFs and to build a DataFrame like:

                             from is_train                to
date                                                        
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town

Using textract (see these Python 3.4 install notes, I also use pdftotext) and a very hacky parser (written this evening, it really is a stateful-messy-hack <sorry>) I can parse a single PDF or a folder to build a Pandas DataFrame of journeys. You’ll find London Oyster PDF to DataFrame Parser here. The output is an HDF5 which can be loaded by Python into Pandas (or R or Matlab or whatever).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Data science, Python