I’ve just given the opening keynote here at PyConIreland 2014 – many thanks to the organisers for letting me get on stage. This is based on 15 years experience running my own consultancies in Data Science and Artificial Intelligence. (Small note – with the pic below James mis-tweeted ‘sexist’ instead of ‘sexiest’ (from my opening slide) <sigh>)
Sidenote – this is the precursor to my “Data Science Deployed” opening keynote at PyConSE 2015.
The slides for “The Real Unsolved Problems in Data Science” are available on speakerdeck along with the full video. I wrote this for the more engineering-focused PyConIreland audience. These are the high level points, I did rather fill my hour:
- Data Science is driven by companies needing new differentiation tactics (not by ‘big data’)
- Problem 1 – People asking for too-complex stuff that’s not really feasible (‘magic’)
- Problem 2 – Lack of statistical education for engineers – do go statistics courses!
- Problem 3 – Dirty data is a huge cost – think about doing a Data Audit
- Problem 4 – We need higher-level data cleaning APIs that understand human-level data (rather than numbers, strings and bools!) – much work is required here
- Problem 5 – Visualisation with Python still hard and clunky, has a poor on-boarding experience for new users (and R does well here)
- Problem 6 – Lots of go-faster/high-performance options but really Python should ‘handle this for us’ (and yes, I have written a book on this)
- Problem 7 – Lack of shared vocabulary for statisticians & engineers
- Problem 8 – Heterogeneous storage world is mostly non-Python (at least for high performance work), we need a “LAMP Stack for Data Science”
- Problem 9 – Collaboration is still painful (but the IPython Notebook is improving this)
- Problem 10 – We’re still building the same tools over and over (but the Notebook makes it easier) – we could do with some shared tools here
- Linked Open Data is very useful and you should contribute to it and consume it
- Our common tooling in Python is very powerful – please join numpy and scipy projects and contribute to the core
- I noted a few times that the Python science stack works in Python 3 so you should just use Python 3.4+ for all new projects
- PyData/EuroSciPy/SciPy/DataKind meetups are a great way to get involved
- We need a “Design Patterns for Data Science with Python” book (and I want to know what you want to learn)
From discussions afterwards it seems that my message “you need clean data to do neat data science stuff” was well received. I’m certainly not the only person in the room battling with Unicode foolishness (not in Python of course as Python 3+ solves the Unicode problem :-).
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.