About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

7 February 2016 - 23:15Convert London Oyster (Travel) PDFs to Pandas DataFrames

As a part of analysing Emily’s allergic rhinitis we want to test whether using the London Underground (notoriously dirty!) increases the likelihood of sneezing. The “black snot” phenomenon is well known to Londoners, possibly the particulates (from oil and metal) cause irritation. You can get updates via our allergic rhinitis analysis mailing list (very very low volume).

Transport for London lets us download a log of journeys – either as a CSV file (just dates and costs, no details) or a PDF file (containing full details of the journey and time). It would be much nicer if they made the data available in a cleanly-formatted open format (e.g. at least a CSV, preferably as HDF5).

The goal is to take the detail-rich PDFs and to build a DataFrame like:

                             from is_train                to
date                                                        
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town

Using textract (see these Python 3.4 install notes, I also use pdftotext) and a very hacky parser (written this evening, it really is a stateful-messy-hack <sorry>) I can parse a single PDF or a folder to build a Pandas DataFrame of journeys. You’ll find London Oyster PDF to DataFrame Parser here. The output is an HDF5 which can be loaded by Python into Pandas (or R or Matlab or whatever).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: Data science, Python

25 January 2016 - 21:27PyDataLondon 2016 Call for Proposals Open

Our Call for Proposals for PyDataLondon 2016 (May 6-8) is open until approx. end of February (5ish weeks), you need to get your submission in soon!

If you want to sponsor to talk with 330 cutting edge data scientists – you’d better hurry, we’ve already started signing deals.

In the CfP we’re looking for:

  • Stories about successful data science projects (including the highs and lows)
  • Machine learning (including Deep Learning) – especially why you used certain algorithms and how you diagnosed features
  • Visualisation – have you explained or explored something that’s good to share?
  • Data cleaning
  • Data process (getting data, understanding it, building models, deploying solutions)
  • Industrial and Academic stories
  • Big data including Spark

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).

We’ve also got a new (temporary URL) webpage for our regular meetups here, this has notes on how to submit a talk to the meetup (not the conference, just the PyDataLondon meetup). Please take a look if you’d like to speak to 200 folk at our monthly meetup.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

22 Comments | Tags: Data science, pydata, Python

12 January 2016 - 11:28Data Scientist Jobs in London

Back in January 2015 I announced my Data Science Jobs UK email list. This has grown nicely, several hundred data scientists have joined it and are interested in (mostly) Python related jobs around London with an even split between contract and permanent roles. If you sign-up to the mailing list you’ll get:

  • 1-2 plain-ASCII mails a month with a summary of current jobs (typically 4-6), mostly focused around London
  • Sometimes the jobs are remote
  • Mostly they’re for Python but Matlab and R also come up

I manage the list, your email is never shared and the list is run by mailchimp so you can easily unsubscribe. Active data scientists who attend PyDataLondon can post for free, others can post at a commercial rate (e.g. recruiters and folk in companies). I vet all the jobs to ensure they’re relevant. Drop me an email if you’ve got a relevant job to share.

“After placing a contract ad on this list I was contacted by a number of high quality and enthusiastic data scientists, who all proposed innovative and exciting solutions to my research problem, and were able to explain their proposals clearly to a non-specialist; the quality of responses was so high that I was presented with a real dilemma in choosing who to work with”. – Hazel Wilkinson, Cambridge University

I put the list together to help local data scientists find more relevant jobs, feel free to dip in and out when it might be useful.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, Python

11 January 2016 - 23:57Allergic Rhinitis (“Why do I always sneeze?!”) research project using Machine Learning

Since April my wife (@fluffyemily) and I have been running a research project around her allergies. She sneezes all year and we’re trying to figure out the cause. Allergic Rhinitis affects 10-30% of Westerners, in Emily’s case it is all-year so it isn’t just pollen related. We figure that a good data-collection process coupled with robust analysis might reveal some of the causes of sneezing such that Emily’s in better control of her Rhinitis.

Emily’s a senior iOS developer with Mozilla, she wrote an open source App for her iPhone to log her sneezes, antihistamine use and interactions with “things” like animals. The App gives us a time-stamp and geolocation. Since she’s mostly in London we’ve got a rich source of events to join to other datasets.

This post is just to put down a marker. I’ve made some progress using Machine Learning to predict when an antihistamine might be used. Currently I can out-predict a Dummy (majority-class) classifier using many cross-validation runs, this is hardly brilliant but we didn’t expect diagnosing a long-term allergy to be a simple affair! Exploratory data analysis on the data shows lots of interesting behaviours, I hope to talk about some of these in the future.

We’ve tried (and so far rejected) air-born particulates as a reason for her allergies via Kings College LondonAir data (thanks!). Weather data is more promising using a local wunderground station (Emily seems to be a little sensitive to humidity and windspeed). I’ve recently started work on MyFitnessPal logged data (the Python 3.4 port was thankfully easy) to start to look at alcohol (a known histamine modifier) and possibly other food.

Behind the scenes I’ve got a collaborative group (thanks Frank and Giles!) in Slack and a private github repo, I plan to talk a little on how this works. I think talking about ways we can collaborate on research projects has value, anything that helps us move on from just working in an office seems like a good idea.

If you’re interested in hearing updates about this project and maybe getting involved to log your own allergy data, join this email announce list. Your email will be kept private, I’ll just send you an email every now and again when we’ve made some progress (which will probably appear here) and when we need volunteers.

Ultimately we’d like to help predict the causes of allergies for other folk. We’ve been talking about this for around 2 years, it is encouraging to see research like this pointing to the use of ML to predict and model the body’s behaviours.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

10 Comments | Tags: Data science, Life, Python

10 January 2016 - 23:08Announcing PyDataLondon 2016 (May 6-8th)

We’re very happy to announce that Bloomberg will host us a second time for PyDataLondon 2016 (our 3rd annual conference). We’ll run the conference over May 6-8th (a tutorial day and 2 conference days as last time) with approximately 330 people in attendance. The location is Central London – near Bank underground station and London Bridge.

Our PyDataLondon meetup community has grown amazingly in the last year, we’ve almost doubled in size to 2,500+ members with 200 in the room each month. We’ve had 19 events in almost 2 years, mostly around Python (some with R, Julia and Matlab), mostly on data science (and stats, visualisation and high performance) and all with a lovely collaborative audience.

The conference Call for Proposals will be opened very soon (in a week or two). If you’d like to speak in front of 330 active data scientists in London’s most active data science community, get thinking on your topic. We’re interested in data science topics, mostly around Python (but we’re cool with other tech and theory). Extra attention will be paid to talks offering real-world stories (for both success and failure – all lessons are equally useful).

Sign-up to this email announce list to be kept in the loop, I’ll write a couple of mails when the CfP is open and as the conference plans develop.

If you’ve not been to one of our conferences before checkout my write-ups from 2015 and 2014.

If you’re hiring or you have a relevant product – think on sponsoring. We expect to sell all of our spots this year due to increased demand for strong data scientists – if you’d like to have a prime spot in the central room (all the talk-rooms hang off of the central room so sponsors are in the thick of it), do get in contact.

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

17 Comments | Tags: Data science, pydata, Python

7 December 2015 - 23:59“Data Science Delivered” (a collection of notes on getting stuff shipped)

Over the last year I’ve given a collection of keynotes and talks around shipping and supporting data science products with Python. I’ve started to gather up my notes into a document – they’re hosted on github as Data Science Delivered, currently its around 5 pages of A4. I put the rough form together after my last keynote of the year in Budapest.

Right now it has notes on how to approach a new project, ways of dealing with bad data, ways to ship working products and ways projects might get sunk.

I’m slowly going to add to this list, I think the rough structure is in place and there’s a lot of detail to add. If you’re interested in getting updates then add your email here and I’ll mail you on occasion when I’ve added a new chunk of information.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

7 Comments | Tags: Data science, pydata, Python

6 November 2015 - 18:24“Featherweight” data science API to publish Python functions on the web

One of the challenges I’ve encountered when coaching data science teams in smaller organisations is the difficulty of publishing proof-of-concept data science products via web calls, when the team doesn’t know anything about web programming. My preference is to use Flask (and flask-restful and maybe Swagger docs) but that’s an awful lot of learning to put onto a non-engineering researcher to help them publish code that another team can consume.

I’ve prototyped “featherweight” as a very simple solution to this problem. Behind the scenes Flask is used to publish your function(s) on a local server. You can then call the function with standard GET requests and key/value arguments (e.g. via cURL or a web browser or the requests module) and get a block of JSON that wraps whatever results your function returned.

The goal is to make it super-easy for a non-engineering researcher to take their Python function or method and to publish it on a web API, without knowing anything about web programming. Examples on github include publishing a simple math function and publishing scikit-learn’s Iris classifier.

Whilst this API won’t solve production use-cases (it is single-threaded, it doesn’t do any clever logging, there’s no additional security) it will solve proof-of-concept and dev-level usage. It also opens the door to moving from Featherweight to a custom Flask interface. Feedback happily received!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, Python

14 October 2015 - 13:21Opening Plenary at BudapestBI Forum 2015

I’ve just given my final talk for the year – I’m “at my other home” in Budapest (I’m half-Hungarian) and have had the honour of opening Bence and team’s BudapestBI Forum 2015. This conference has both an open-source-day and (tomorrow) an enterprise-day, all around analytics and with lots of Python and R.

This talk is an iteration of my previous Shipping talks, in part backed by results from our latest PyDataLondon survey to 2,000 members where we’ve asked about member frustrations and I’ve integrated some of the results into this talk:

Shipping Data Science Products
(source)

Here are my slides:

In the room we had roughly 2/3 ‘engineers/builders’ and 1/3 ‘researchers/analysts’, it seems that Python and R are used by a large number of folk here today.

I also ‘released’ a set of my notes that I’ve tentatively entitled “Data Science Delivered” – this is a github doc with a series of the notes that I wish I’d learned years ago. Right now these notes are super-rough, I figure “release early, release often” will help me refine these.

It is based in part through my talking, teaching and coaching over the last couple of years. I intend to add more in the next couple of weeks (so hopefully by November 2015 it’ll be far less rough!), I’d like to add some Notebooks as examples. You’re welcome to post bugs/requests and I’ll try to add notes, if I know about those areas. Please feel free to share some of your experiences (via @ianozsvald, via email, via Bugs etc).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Life, pydata, Python

20 September 2015 - 17:23“Ship Data Science Products!” at PyConUK2015

PyConUK2015 is over, it was another year of happy Pythonistic hobbitness in Coventry. I spoke on shipping data science products on the new Science track (organised by Sarah):

It was nice to hear some polite-abuse being thrown at folk stuck on Python 2.x reminding them that it is high time to upgrade to Python 3. Propaganda was given away to support this move.

Obviously I plugged PyDataLondon and our upcoming meetups – if you like data science then come along to our meetups.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, Life, pydata, Python

28 August 2015 - 11:27EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I’m at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. Next year will be in Bavaria, you can sign-up for announces.

EuroSciPy 2015

I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

  • decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
  • validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
  • automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
  • manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
  • visualisation with GlueViz, Seaborn and csv-fingerprint
  • starting your first ML project

Here are the slides:

 

Thanks to Enthought and the org-team for a lovely conference!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

9 Comments | Tags: Data science, pydata, Python