About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

29 February 2016 - 23:02Will we see “[module] on Python 3.4+ is free but only paid-support for Python 2.7”?

I’m curious about the transition in our ecosystem from Python 2 to Python 3. On stage at our monthly PyDataLondon meetups I’m known to badger folk to take the step and upgrade to reduce the support burden on developers. The transition gathers pace but it still feels slow. I’ve noted my recommendations for moving to Python 3+ at the end. See also the reddit discussion.

I’m wondering – when will we see the point where open source projects say “We support Python 3.x for free but if you want bugs fixed for Python 2.7, you’ll have to pay“? I’m not saying “if”, but “when”. There’s already one example below and others will presumably follow.

In the last couple of years a slew of larger projects have dropped or are dropping support for Python 2.6 – numpy (some discussion), pandas, scipy, matplotlib, NLTK, astropy, ipythondjango, numba, twisted, scrapy. Good – Python 2.6 was deprecated when 2.7 was released in 2010 (that’s 6 years ago!).

The position of the matplotlib and django teams is clearly “Python 2.7 and Python 3.4+”. Django states that Python 2.7 will be supported until the 2020 sunset date:

“As a final heads up, Django 1.11 is likely to be the last version to support Python 2.7 as it will be supported until the end of Python 2 upstream support in 2020. We’ve adopted a Python version support policy…”

We can expect the larger projects to support legacy userbases with a mix of Python 2.7 and 3.4+ for 3.5 years (at least until 2020). After this we should expect projects to start to drop 2.7 support, some (hopefully) more aggressively than others.

What about smaller projects? Several have outright dropped Python 2.7 support already – nikola, python-thumbnails – or never supported it – wordfreq, featherweight. Which others have I missed? UpdateJupyterHub (cheers Thomas) too.

More interestingly David MacIver (of Hypothesis) stated a while back that he’d support Python 2.7 for free but Python 2.6 would be a paid support option. He’s also tagged (regardless of version) a bunch of bugs that can be fixed for a fee. Viewflow is another – Python 3.4 is free for non-commercial use but a commercial license or support for Python 2.7 requires a fee. Asking for money to support old, PITA or difficult options seems mightily sensible. I guess we’ll see this first for tools that have a good industrial userbase who’d be used to paying for support (like Viewflow).

What next? I imagine most new smaller projects will be Python 3.4+ (probably 3.5+ only soon), they’ll have no legacy userbase to support. They could widen their potential userbase by supporting Python 2.7 but this window only exists for 3 years and those users will have to upgrade anyhow. So why bother going backwards?

Once users notice that cooler new toys are Python 3.4+ only they’ll want to upgrade (e.g. NetworKit is Python 3.3+ only for high volume graph network analysis). They’ll only hold back if they’re supporting legacy internal systems (which will be the case for an awful lot of people). We’ll see this more as we get closer to 2020. What about after 2020?

I guess many companies will be slow to jump to Python 3 (it’d take a lot of effort for no practical improvement), so I’d imagine separate groups will start to support Python 2.7 libraries as forks. Hopefully the main library developers will drop support fairly quickly, to stop open source (cost-free) developers having a tax on their time supporting both platforms.

Separate evidence – Drupal 6 adopted a commercial-support-for-old-versions policy (thanks @chx). It is also worth noting that Ubuntu 16.04 LTS ships without Python 2. Microsoft and Brett Cannon have discussed the benefits of moving to Python 3+ recently.

My recommendations (coming from a Senior Industrial Data Scientist with 15+ years commercial experience and 10+ years using Python):

  • Where feasible – all new projects must use Python 3.5+ (e.g. Proof of Concepts, Research, new isolated systems) – this is surprisingly easy
  • If Python 2.7 compatibility is required – write all new code in a Python 3.5+ compatible way (1, 2, 3, 4), make extensive tests for the later inevitable migration (you already have good test-coverage, right?)
  • Accept that support for Python 2.7 gets turned off in 3.5 years and that all Python 2.7 code written now will likely have to be migrated later (this is a business cost that you can estimate now)
  • Accept that as we get closer to 2020 more programmers (both new and experienced) will be using Python 3.5+, so support for Python 2.7-based libraries will inevitably decline (there’s a big business risk here)

Graham and I did a lightning talk on jumping to Python 3 six months back, there’s a lot of new features in Python 3.4+ that will make your life easier (and make your code safer, so you burn less time hunting for problems). Jake also discussed the general problem for scientists back in 2013, it’ll be lovely when we get past this (now-very-boring) discussion.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

15 Comments | Tags: Data science, Python

7 February 2016 - 23:15Convert London Oyster (Travel) PDFs to Pandas DataFrames

As a part of analysing Emily’s allergic rhinitis we want to test whether using the London Underground (notoriously dirty!) increases the likelihood of sneezing. The “black snot” phenomenon is well known to Londoners, possibly the particulates (from oil and metal) cause irritation. You can get updates via our allergic rhinitis analysis mailing list (very very low volume).

Transport for London lets us download a log of journeys – either as a CSV file (just dates and costs, no details) or a PDF file (containing full details of the journey and time). It would be much nicer if they made the data available in a cleanly-formatted open format (e.g. at least a CSV, preferably as HDF5).

The goal is to take the detail-rich PDFs and to build a DataFrame like:

                             from is_train                to
date                                                        
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town

Using textract (see these Python 3.4 install notes, I also use pdftotext) and a very hacky parser (written this evening, it really is a stateful-messy-hack <sorry>) I can parse a single PDF or a folder to build a Pandas DataFrame of journeys. You’ll find London Oyster PDF to DataFrame Parser here. The output is an HDF5 which can be loaded by Python into Pandas (or R or Matlab or whatever).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Python

25 January 2016 - 21:27PyDataLondon 2016 Call for Proposals Open

Our Call for Proposals for PyDataLondon 2016 (May 6-8) is open until approx. end of February (5ish weeks), you need to get your submission in soon!

If you want to sponsor to talk with 330 cutting edge data scientists – you’d better hurry, we’ve already started signing deals.

In the CfP we’re looking for:

  • Stories about successful data science projects (including the highs and lows)
  • Machine learning (including Deep Learning) – especially why you used certain algorithms and how you diagnosed features
  • Visualisation – have you explained or explored something that’s good to share?
  • Data cleaning
  • Data process (getting data, understanding it, building models, deploying solutions)
  • Industrial and Academic stories
  • Big data including Spark

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).

We’ve also got a new (temporary URL) webpage for our regular meetups here, this has notes on how to submit a talk to the meetup (not the conference, just the PyDataLondon meetup). Please take a look if you’d like to speak to 200 folk at our monthly meetup.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

22 Comments | Tags: Data science, pydata, Python

12 January 2016 - 11:28Data Scientist Jobs in London

Back in January 2015 I announced my Data Science Jobs UK email list. This has grown nicely, several hundred data scientists have joined it and are interested in (mostly) Python related jobs around London with an even split between contract and permanent roles. If you sign-up to the mailing list you’ll get:

  • 1-2 plain-ASCII mails a month with a summary of current jobs (typically 4-6), mostly focused around London
  • Sometimes the jobs are remote
  • Mostly they’re for Python but Matlab and R also come up

I manage the list, your email is never shared and the list is run by mailchimp so you can easily unsubscribe. Active data scientists who attend PyDataLondon can post for free, others can post at a commercial rate (e.g. recruiters and folk in companies). I vet all the jobs to ensure they’re relevant. Drop me an email if you’ve got a relevant job to share.

“After placing a contract ad on this list I was contacted by a number of high quality and enthusiastic data scientists, who all proposed innovative and exciting solutions to my research problem, and were able to explain their proposals clearly to a non-specialist; the quality of responses was so high that I was presented with a real dilemma in choosing who to work with”. – Hazel Wilkinson, Cambridge University

I put the list together to help local data scientists find more relevant jobs, feel free to dip in and out when it might be useful.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, Python

11 January 2016 - 23:57Allergic Rhinitis (“Why do I always sneeze?!”) research project using Machine Learning

Since April my wife (@fluffyemily) and I have been running a research project around her allergies. She sneezes all year and we’re trying to figure out the cause. Allergic Rhinitis affects 10-30% of Westerners, in Emily’s case it is all-year so it isn’t just pollen related. We figure that a good data-collection process coupled with robust analysis might reveal some of the causes of sneezing such that Emily’s in better control of her Rhinitis.

Emily’s a senior iOS developer with Mozilla, she wrote an open source App for her iPhone to log her sneezes, antihistamine use and interactions with “things” like animals. The App gives us a time-stamp and geolocation. Since she’s mostly in London we’ve got a rich source of events to join to other datasets.

This post is just to put down a marker. I’ve made some progress using Machine Learning to predict when an antihistamine might be used. Currently I can out-predict a Dummy (majority-class) classifier using many cross-validation runs, this is hardly brilliant but we didn’t expect diagnosing a long-term allergy to be a simple affair! Exploratory data analysis on the data shows lots of interesting behaviours, I hope to talk about some of these in the future.

We’ve tried (and so far rejected) air-born particulates as a reason for her allergies via Kings College LondonAir data (thanks!). Weather data is more promising using a local wunderground station (Emily seems to be a little sensitive to humidity and windspeed). I’ve recently started work on MyFitnessPal logged data (the Python 3.4 port was thankfully easy) to start to look at alcohol (a known histamine modifier) and possibly other food.

Behind the scenes I’ve got a collaborative group (thanks Frank and Giles!) in Slack and a private github repo, I plan to talk a little on how this works. I think talking about ways we can collaborate on research projects has value, anything that helps us move on from just working in an office seems like a good idea.

If you’re interested in hearing updates about this project and maybe getting involved to log your own allergy data, join this email announce list. Your email will be kept private, I’ll just send you an email every now and again when we’ve made some progress (which will probably appear here) and when we need volunteers.

Ultimately we’d like to help predict the causes of allergies for other folk. We’ve been talking about this for around 2 years, it is encouraging to see research like this pointing to the use of ML to predict and model the body’s behaviours.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

10 Comments | Tags: Data science, Life, Python

10 January 2016 - 23:08Announcing PyDataLondon 2016 (May 6-8th)

We’re very happy to announce that Bloomberg will host us a second time for PyDataLondon 2016 (our 3rd annual conference). We’ll run the conference over May 6-8th (a tutorial day and 2 conference days as last time) with approximately 330 people in attendance. The location is Central London – near Bank underground station and London Bridge.

Our PyDataLondon meetup community has grown amazingly in the last year, we’ve almost doubled in size to 2,500+ members with 200 in the room each month. We’ve had 19 events in almost 2 years, mostly around Python (some with R, Julia and Matlab), mostly on data science (and stats, visualisation and high performance) and all with a lovely collaborative audience.

The conference Call for Proposals will be opened very soon (in a week or two). If you’d like to speak in front of 330 active data scientists in London’s most active data science community, get thinking on your topic. We’re interested in data science topics, mostly around Python (but we’re cool with other tech and theory). Extra attention will be paid to talks offering real-world stories (for both success and failure – all lessons are equally useful).

Sign-up to this email announce list to be kept in the loop, I’ll write a couple of mails when the CfP is open and as the conference plans develop.

If you’ve not been to one of our conferences before checkout my write-ups from 2015 and 2014.

If you’re hiring or you have a relevant product – think on sponsoring. We expect to sell all of our spots this year due to increased demand for strong data scientists – if you’d like to have a prime spot in the central room (all the talk-rooms hang off of the central room so sponsors are in the thick of it), do get in contact.

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

17 Comments | Tags: Data science, pydata, Python

7 December 2015 - 23:59“Data Science Delivered” (a collection of notes on getting stuff shipped)

Over the last year I’ve given a collection of keynotes and talks around shipping and supporting data science products with Python. I’ve started to gather up my notes into a document – they’re hosted on github as Data Science Delivered, currently its around 5 pages of A4. I put the rough form together after my last keynote of the year in Budapest.

Right now it has notes on how to approach a new project, ways of dealing with bad data, ways to ship working products and ways projects might get sunk.

I’m slowly going to add to this list, I think the rough structure is in place and there’s a lot of detail to add. If you’re interested in getting updates then add your email here and I’ll mail you on occasion when I’ve added a new chunk of information.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

7 Comments | Tags: Data science, pydata, Python

6 November 2015 - 18:24“Featherweight” data science API to publish Python functions on the web

One of the challenges I’ve encountered when coaching data science teams in smaller organisations is the difficulty of publishing proof-of-concept data science products via web calls, when the team doesn’t know anything about web programming. My preference is to use Flask (and flask-restful and maybe Swagger docs) but that’s an awful lot of learning to put onto a non-engineering researcher to help them publish code that another team can consume.

I’ve prototyped “featherweight” as a very simple solution to this problem. Behind the scenes Flask is used to publish your function(s) on a local server. You can then call the function with standard GET requests and key/value arguments (e.g. via cURL or a web browser or the requests module) and get a block of JSON that wraps whatever results your function returned.

The goal is to make it super-easy for a non-engineering researcher to take their Python function or method and to publish it on a web API, without knowing anything about web programming. Examples on github include publishing a simple math function and publishing scikit-learn’s Iris classifier.

Whilst this API won’t solve production use-cases (it is single-threaded, it doesn’t do any clever logging, there’s no additional security) it will solve proof-of-concept and dev-level usage. It also opens the door to moving from Featherweight to a custom Flask interface. Feedback happily received!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, Python

14 October 2015 - 13:21Opening Plenary at BudapestBI Forum 2015

I’ve just given my final talk for the year – I’m “at my other home” in Budapest (I’m half-Hungarian) and have had the honour of opening Bence and team’s BudapestBI Forum 2015. This conference has both an open-source-day and (tomorrow) an enterprise-day, all around analytics and with lots of Python and R.

This talk is an iteration of my previous Shipping talks, in part backed by results from our latest PyDataLondon survey to 2,000 members where we’ve asked about member frustrations and I’ve integrated some of the results into this talk:

Shipping Data Science Products
(source)

Here are my slides:

In the room we had roughly 2/3 ‘engineers/builders’ and 1/3 ‘researchers/analysts’, it seems that Python and R are used by a large number of folk here today.

I also ‘released’ a set of my notes that I’ve tentatively entitled “Data Science Delivered” – this is a github doc with a series of the notes that I wish I’d learned years ago. Right now these notes are super-rough, I figure “release early, release often” will help me refine these.

It is based in part through my talking, teaching and coaching over the last couple of years. I intend to add more in the next couple of weeks (so hopefully by November 2015 it’ll be far less rough!), I’d like to add some Notebooks as examples. You’re welcome to post bugs/requests and I’ll try to add notes, if I know about those areas. Please feel free to share some of your experiences (via @ianozsvald, via email, via Bugs etc).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Life, pydata, Python

20 September 2015 - 17:23“Ship Data Science Products!” at PyConUK2015

PyConUK2015 is over, it was another year of happy Pythonistic hobbitness in Coventry. I spoke on shipping data science products on the new Science track (organised by Sarah):

It was nice to hear some polite-abuse being thrown at folk stuck on Python 2.x reminding them that it is high time to upgrade to Python 3. Propaganda was given away to support this move.

Obviously I plugged PyDataLondon and our upcoming meetups – if you like data science then come along to our meetups.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, Life, pydata, Python