Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group

22 April 2015 - 21:47A review of ModelInsight’s growth this last year

Early last year Chris and I founded ModelInsight, a boutique Python-focused Data Science agency in London. We’ve grown well, I figure some reflection is in order. In addition the Data Science scene has grown very well in London, I’ll put some notes on that down below too.

Through consulting, training, workshops and coaching we’ve had the pleasure of working with the likes of King.com, Intel, YouGov and ElevateDirect. Each project aimed to help our client identify and use their data more effectively to generate more business. Projects have included machine learning, natural language processing, prediction, data extraction for both prototyping and deploying live services.

I’ve particularly enjoyed the training and coaching. We’ve run courses introducing data science with Python, covering stats and scikit-learn and high performance Python (based on my book), if you want to be notified of future courses then please join our training announce list.

With the coaching I’ve had the pleasure of working with two data scientists who needed to deploy reliably-working classifiers faster, to automate several human-driven processes for scale. I’ve really enjoyed the challenges they’re posing. If your team could do with some coaching (on-site or off-site) then get in touch, we have room for one more coaching engagement.

I’ve also launched my first data-cleaning service at Annotate.io, it aims to save you time during the early data-cleaning part of a new project. I’d value your feedback and you can join an announce list if you’d like to follow the new services we have planned that’ll make data-cleaning easier.

All the above occurs because the Data Science scene here in London has grown tremendously in the last couple of years. I co-organise the PyDataLondon meetup (over 1,400 members in a year!), here’s a chart showing our month-on-month growth. At Christmas it turned up a notch and it just keeps growing:


Each month we have 150-200 people in the room for strong Data Science talks, in a couple of months we’ll have our second conference with 300 people at Bloomberg (CfP announce list). We’re actively seeking speakers – join that list if you’d like to know when the CfP opens.

I’ve been privileged to speak as the opening keynoter on The Real Unsolved Problems in Data Science last year at PyConIreland, I’ve just spoken on data cleaning at PyDataParis and soon I’ll keynote on Data Science Deployed at PyConSE. I’m deeply grateful to the community for letting me share my experience. My goal is to help more companies utilise their data to improve their business, if you’ve got ideas on how we could help then I’d love to hear from you!

I’m also thinking of writing a book on Building Python Data Science Products, see the link for some notes, it’ll cover 15 years of hard-won advice in building and shipping successful data science products using Python.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: Data science, pydata, Python

3 April 2015 - 11:05PyDataParis 2015 and “Cleaning Confused Collections of Characters”

I’m at PyDataParis, this is the first PyData in France and we have a 300-strong turn-out. In my talk I asked about the split of academic and industrial folk, we have 70% industrialists here (at least – in my talk of 70 folk). The bulk of the attendees are in the Intro track and maybe the split is different in there. All slides are up, videos are following, see them here.

Here’s a photo of Gael giving a really nice opening keynote on Scikit-Learn:

I spoke on data cleaning with text data, I packed quite a bit into my 40 minutes and got a nice set of questions. The slides are below, it covers:

  • Data extraction from text files, PDF, HTML/XML and images
  • Merging on columns of data
  • Correctly processing datetimes from files and the dangers of relying on the pandas defaults
  • Normalising text columns so we could join on otherwise messy data
  • Automated data transformation using my annotate.io (Python demo)
  • Ideas on automated feature extraction
  • Ideas on automating visualisation for new, messy datasets to get a “bird’s eye view”
  • Tips on getting started – make a Gold Standard!

One question concerned the parsing of datetime strings from unusual sources. I’d mentioned dateutil‘s parser in the talk and a second parser is delorean. In addition I’ve also seen arrow (an extension of the standard datetime) which has a set of parsers including one for ISO8601. The parsedatetime module has an NLP module to convert statements like “tomorrow” into a datetime.

I don’t know of other, better parsers – do you? In particular I want one that’ll take a list of datetimes and return one consistent converter that isn’t confused by individual instances (e.g. “1/1” is MM/DD or DD/MM ambiguous).

I’m also asking for feedback on the subject of automated feature extraction and automated column-join tools for messy data. If you’ve got ideas on these subjects I’d love to hear from you.

In addition I was reminded of DiffBot, it uses computer vision and NLP to extract meaning from web pages. I’ve never tried it, can any of you comment on its effectiveness? Olivier Grisel mentioned pyquery to me, it is an lxml parser which lets you make jquery-like queries on HTML.

update I should have mentioned chardet, it detects encodings (UTF8, CP1252 etc) from raw text, very useful if you’re trying to figure out the encoding for a collection of bytes off of a random data source! libextract (write-up) looks like a young but nice tool for extracting text blocks from HTML/XML sources, also goose. boltons is a nice collection of bolton-tools to the standard library (e.g. timeutils, strutils, tableutils). Possibly mETL is a useful tool to think about the extract, transform and load process.

update It might also be worth noting some useful data sources from which you can extract semi-structured data, e.g. ‘tech tags’ from stackexchange‘s forums (and I also see a new hackernews dump). Here’s a big list of “awesome public datasets“.

update Peadar Coyle (@springcoil) gave a nice talk at PyConItaly 2015 on “Data Products – how to get models into production” which is related.

Camilla Montonen has just spoken on Rush Hour Dynamics, visualising London Underground behaviour. She noted graph-tool, a nice graphing/viz library I’d not seen before. Fabian has just shown me his new project, it collects NLP IPython Notebooks and lists them, it tries to extract titles or summaries (which is a gnarly sub-problem!). The AXA Data Innovation Lab have a nice talk on explaining machine learned models.

Gilles Loupe’s slides for his ML/sklearn talk on trees and boosting are online, as are Alexandre Gramfort‘s on sklearn linear models.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

18 Comments | Tags: Data science, Life, pydata, Python