About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group

11 October 2014 - 16:18My Keynote at PyConIreland 2014 – “The Real Unsolved Problems in Data Science”

I’ve just given the opening keynote here at PyConIreland 2014 – many thanks to the organisers for letting me get on stage. This is based on 15 years experience running my own consultancies in Data Science and Artificial Intelligence. (Small note  – with the pic below James mis-tweeted ‘sexist’ instead of ‘sexiest’ (from my opening slide) <sigh>)

Sidenote – this is the precursor to my “Data Science Deployed” opening keynote at PyConSE 2015.

 

The slides for “The Real Unsolved Problems in Data Science” are available on speakerdeck along with the full video. I wrote this for the more engineering-focused PyConIreland audience. These are the high level points, I did rather fill my hour:

  • Data Science is driven by companies needing new differentiation tactics (not by ‘big data’)
  • Problem 1 – People asking for too-complex stuff that’s not really feasible (‘magic’)
  • Problem 2 – Lack of statistical education for engineers – do go statistics courses!
  • Problem 3 – Dirty data is a huge cost – think about doing a Data Audit
  • Problem 4 – We need higher-level data cleaning APIs that understand human-level data (rather than numbers, strings and bools!) – much work is required here
  • Problem 5 – Visualisation with Python still hard and clunky, has a poor on-boarding experience for new users (and R does well here)
  • Problem 6 – Lots of go-faster/high-performance options but really Python should ‘handle this for us’ (and yes, I have written a book on this)
  • Problem 7 – Lack of shared vocabulary for statisticians & engineers
  • Problem 8 – Heterogeneous storage world is mostly non-Python (at least for high performance work), we need a “LAMP Stack for Data Science”
  • Problem 9 – Collaboration is still painful (but the IPython Notebook is improving this)
  • Problem 10 – We’re still building the same tools over and over (but the Notebook makes it easier) – we could do with some shared tools here
  • Linked Open Data is very useful and you should contribute to it and consume it
  • Our common tooling in Python is very powerful – please join numpy and scipy projects and contribute to the core
  • I noted a few times that the Python science stack works in Python 3 so you should just use Python 3.4+ for all new projects
  • PyData/EuroSciPy/SciPy/DataKind meetups are a great way to get involved
  • We need a “Design Patterns for Data Science with Python” book (and I want to know what you want to learn)

From discussions afterwards it seems that my message “you need clean data to do neat data science stuff” was well received. I’m certainly not the only person in the room battling with Unicode foolishness (not in Python of course as Python 3+ solves the Unicode problem :-).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

25 Comments | Tags: High Performance Python Book, pydata, Python