About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

18 January 2015 - 19:40Data Science Jobs UK (ModelInsight) – Python Jobs Email List

I’ve had people asking me about how they can find data scientists in London and through our PyDataLondon meetup we’ve had members announcing jobs. There’s no central location for data science jobs so I’ve put together a new list (administered through my ModelInsight agency).

Sign-up to the list here: Data Science Jobs UK (ModelInsight)

  • Aimed at Data Science jobs in the UK
  • Mostly Python (maybe R, Matlab, Julia if relevant)
  • It’ll include Permie and Contract jobs

The list will only work if you can trust it so:

  • Your email is private (it is never shared)
  • The list is on MailChimp so you can unsubscribe at any time
  • We vet the job posts and only forward them if they’re in the interests of the list
  • Nobody else can post into the list (all jobs are forwarded just by us)
  • It’ll be low volume and all posts will be very relevant

Sign-up to the list here: Data Science Jobs UK (ModelInsight)

Obviously if you’re interested in joining the London Python data science community then come along to our PyDataLondon meetups.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: Data science, pydata, Python

10 January 2015 - 14:04A first approach to automatic text data cleaning

In October I gave the opening keynote at PyConIreland on The Real Unsolved Problems in Data Science. One of the topics I covered was poor quality data, by some estimates data cleaning occupies 50-80% of a data scientist’s time.

Personally I’ve just spent the better part of last year figuring out ways to convert poorly-represented company names on 100,000s CVs/resumes to a cleaned subset for my contract recruitment client (via my ModelInsight). This enables us to build ranking engines for contract job applicants (and I’ll note happily that it works rather well!). It only works because we put so much effort into cleaning the raw data. Huge investments like this are expensive in time and money, that carries risk for a client. Tools used include NLTK, ftfy, Pandas, scikit-learn and the re module, all in Python 3.4.

During the keynote I asked if anyone had tooling they could open up to make this sort of task easier. I didn’t get a lot of feedback on that so I’ve had a crack at one of the problems I’d discussed on my annotate.io.

The mapping of raw input data to a lower-dimensional output isn’t trivial, but it felt like something that might be automated. Let’s say you scraped job adverts (e.g. using import.io on adzuna, both based in London). The salary field for the jobs will be messy, it’ll include strings like “To 53K w/benefits”, “30000 OTE plus bonus” and maybe even non-numeric descriptions like “Forty two thousand GBP”. Theses strings are collated from a diverse set of job adverts, all typed by hand by a human and there’s no standard format.

Let’s say we’re after “53000”, “30000”, “42000” as an output. We can expand contractions (“<nbr>K”->”<nbr>000), convert written numbers into an integer and then extract the number. If you’re used to this sort of process then you might expect to spend 30-60 minutes writing unit tests and support code. When you come to the next challenge, you’ll repeat that hour or so of work. If you’re not sure how you want your output data to look you might spend considerably longer trying transformation ideas. What if we could short-circuit this development process and just focus on “what we have” and “what we want”?

More complex tasks include transforming messy company name strings, fixing broken unicode and converting unicode to ASCII (which can ease indexing for search) and identifying tokens that need to be stripped or transformed. There’s a second example over at Annotate and more will follow. I’m about to start work on ‘fact extraction’ – given a block of text (e.g. a description field) can we reliably extract a single fact that’s written in a variety of ways?

Over at Annotate.io I’ll be uploading the first version of a learning text transformer soon. It takes a set of example input->output mappings, learns a transformation sequence that minimizes the transformation distance (hopefully to a distance of 0 meaning it has solved the problem) and then it can use this transformation sequence on future text you pass into the system.

The API is JSON based and will come with Python examples, there’s a mailing list you can join on the site for announcements. I’m specifically interested in the kind of problems you might want to put into this system, please get in contact if you’re curious.

I’m also hoping to work on another data cleaning tool later. If you want to talk about this at a future PyDataLondon meetup, I’d love to chat.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: ArtificialIntelligence, Data science, Python

10 December 2014 - 13:31New Relic, uWSGI and “Cannot perform a data harvest for ‘‘ as there is no active session.”

This is more a note-to-self and maybe to another confused soul – if you’re using New Relic (it seems to be really rather nice for web app monitoring) with uWSGI, by default uWSGI runs without the GIL. This means no threads and this means New Relic won’t report anything which leads to a confusing first try.

Specifically read the Best Practices notes for uWSGI around “–enable-threads”. You have to add “–enable-threads” if you’re using New Relic’s Python agent, this is documented on their Python Agent Integration docs for uWSGI but for me the clue was in their log (by default in /tmp/newrelic-python-agent.log if you enable it in newrelic.ini) which showed:

(3717/NR-Harvest-Thread) newrelic.core.agent DEBUG 
 - Commencing harvest of all application data.
(3717/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Cannot perform a data harvest for '<appname>' as there is no active session.
(3717/NR-Harvest-Thread) newrelic.core.agent DEBUG 
 - Completed harvest of all application data in 0.00 seconds.

Once I’d added “–enable-threads” to uWSGI the logs looked a lot healthier, particularly:

(3292/NR-Harvest-Thread) newrelic.core.agent DEBUG 
 - Commencing harvest of all application data.
(3292/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Commencing data harvest of '<appname>'.
 ...
(3292/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Send profiling data for harvest of '<appname>'.
(3292/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Done sending data for harvest of '<appname>'.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Life, Python

25 November 2014 - 19:11We’re running more Data Science Training in 2015 Q1 in London

A couple of weeks ago Bart and I ran two very successful training courses in London through my ModelInsight, one introduced data science using pandas and numpy to build a recommender engine, the second taught a two-day course on High Performance Python (and yes, that was somewhat based on my book with a lot of hands-on exercises). Based on feedback from those courses we’re looking to introduce up to 5 courses at the start of next year.

If you’d like to hear about our London data science training then sign-up to our (very low volume) announce list. I posted an anonymous survey onto the mailing list, if you’d like to give your vote to the courses we should run then jump over here (no sign-up, there’s only 1 question, there’s no commitment).

If you’d like to talk about these in person then you can find me (probably on-stage) co-running the PyDataLondon meetups.

Here’s the synopses for each of the proposed courses:

“Playing with data – pandas and matplotlib” (1 day)

Aimed at beginner Pythonista data scientists who want to load, manipulate and visualise data
We’ll use pandas with many practical exercises on different sorts of data (including messy data that needs fixing) to manipulate, visualise and join data. You’ll be able to work with your own data sets after this course, we’ll also look at other visualise tools like Seaborn and Bokeh. This will suit people who haven’t used pandas who want a practical introduction such as data journalists, engineers and semi-technical managers.

“Building a recommender system with Python” (1 day)

Aimed at intermediate Pythonistas who want to use pandas and numpy to build a working recommender engine, this covers both using data through to delivering a working data science product. You already know a little linear algebra and you’ve used numpy lightly, you want to see how to deploy a working data science product as a microservice (Flask) that could reliably be put into production.

“Statistics and Big Data using scikit-learn” (2 days)

Aimed at beginner/intermediate Pythonistas with some mathematical background and a desire to learn everyday statistics and to start with machine learning
Day 1 – Probability, distributions, Frequentist and Bayesian approaches, Inference and Regression, Experiment Design – part discussion and part practical
Day 2 – Applying these approaches with scikit-learn to everyday problems, examples may include (note *examples may change* this just gives a flavour) Bayesian spam detection, predicting political campaigns, quality testing, clustering, weather forecasting, tools will include Statsmodels and matplotlib.

“Hands on with Scikit-Learn” (5 days)

Aimed at intermediate Pythonistas who need a practical and comprehensive introduction to machine learning in Python, you’ve already got a basic statistical and linear algebra background
This course will cover all the terminology and stages that make up the machine learning pipeline and the fundamental skills needed to perform machine learning successfully. Aided by many hands on labs with Python scikit-learn the course will enable you to understand the basic concepts, become confident in applying the tools and techniques, and provide a firm foundation from which to dig deeper and explore more advanced methods.

“High Performance Python” (2 days)

Aimed at intermediate Pythonistas whose code is too slow
Day 1 – Profiling (CPU and RAM), compiling with Cython, using Numba, PyPy and Pythran (all the way through to using OpenMP)
Day 2 – Going multicore (multiprocessing) and multi-machine (IPython parallel), fitting more into RAM, probabilitistic counting, storage engines, Test Driven Development and several debugging exercises
A mix of theory and practical exercises, you’ll be able to use the main Python tools to confidently and reliably make your code run faster


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: Data science, Python

11 October 2014 - 16:18My Keynote at PyConIreland 2014 – “The Real Unsolved Problems in Data Science”

I’ve just given the opening keynote here at PyConIreland 2014 – many thanks to the organisers for letting me get on stage. This is based on 15 years experience running my own consultancies in Data Science and Artificial Intelligence. (Small note  – with the pic below James mis-tweeted ‘sexist’ instead of ‘sexiest’ (from my opening slide) <sigh>)

 

The slides for “The Real Unsolved Problems in Data Science” are available on speakerdeck along with the full video. I wrote this for the more engineering-focused PyConIreland audience. These are the high level points, I did rather fill my hour:

  • Data Science is driven by companies needing new differentiation tactics (not by ‘big data’)
  • Problem 1 – People asking for too-complex stuff that’s not really feasible (‘magic’)
  • Problem 2 – Lack of statistical education for engineers – do go statistics courses!
  • Problem 3 – Dirty data is a huge cost – think about doing a Data Audit
  • Problem 4 – We need higher-level data cleaning APIs that understand human-level data (rather than numbers, strings and bools!) – much work is required here
  • Problem 5 – Visualisation with Python still hard and clunky, has a poor on-boarding experience for new users (and R does well here)
  • Problem 6 – Lots of go-faster/high-performance options but really Python should ‘handle this for us’ (and yes, I have written a book on this)
  • Problem 7 – Lack of shared vocabulary for statisticians & engineers
  • Problem 8 – Heterogeneous storage world is mostly non-Python (at least for high performance work), we need a “LAMP Stack for Data Science”
  • Problem 9 – Collaboration is still painful (but the IPython Notebook is improving this)
  • Problem 10 – We’re still building the same tools over and over (but the Notebook makes it easier) - we could do with some shared tools here
  • Linked Open Data is very useful and you should contribute to it and consume it
  • Our common tooling in Python is very powerful – please join numpy and scipy projects and contribute to the core
  • I noted a few times that the Python science stack works in Python 3 so you should just use Python 3.4+ for all new projects
  • PyData/EuroSciPy/SciPy/DataKind meetups are a great way to get involved
  • We need a “Design Patterns for Data Science with Python” book (and I want to know what you want to learn)

From discussions afterwards it seems that my message “you need clean data to do neat data science stuff” was well received. I’m certainly not the only person in the room battling with Unicode foolishness (not in Python of course as Python 3+ solves the Unicode problem :-).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

25 Comments | Tags: High Performance Python Book, pydata, Python

5 September 2014 - 12:20Fourth PyDataLondon Meetup

We’ve just run our 4th PyDataLondon meetup (@PyDataLondon). Having over 500 members is superb for just 4 months growth, woot :-)

Many thanks to @GoPivotalEMEA for hosting us.

We had 3 speakers and 1 lightning talk.

Here are my slides on “The High Performance Python Landscape”:

I’m still collecting data for my two surveys (to discuss at a future PyData when I’ve got enough data), one on Data Science training needs and one on Why Are More Companies Not Using Data Science?

Next Dirk spoke on Data for Good and datamining water sources in Tanzania including some very honest thoughts on how to (hopefully) leave behind working systems that local teams can maintain. Dirk’s talk is built on a project called Taarifa (and source) that our Florian helped build.

Finally we had Matt from Plot.ly over from San Francisco, he gave very compelling reasons to investigate the online visualisation (and data sharing) system for plot.ly. The matplotlib 1-line converter was particularly nice.

Tariq gave a lightning talk on his Make Your Own Mandelbrot book (aimed at kids and newbiews to Python), his slides are online.

We’ve got a growing collection of Offer/Want cards which help connect folk in the pub afterwards, we’ll keep building these up:

Our next event is on October 7th, be sure to follow the @pydatalondon twitter account and join the PyDataLondon meetup group to see the forthcoming announce. The RSVPs for the 4th event were filled in 2 hours of the general announce!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: pydata, Python

30 August 2014 - 12:06Slides for High Performance Python tutorial at EuroSciPy2014 + Book signing!

Yesterday I taught an excerpt of my 2 day High Performance Python tutorial as a 1.5 hour hands-on lesson at EuroSciPy 2014 in Cambridge with 70 students:

IMG_20140828_155857

We covered profiling (down to line-by-line CPU & memory usage), Cython (pure-py and OpenMP with numpy), Pythran, PyPy and Numba. This is an abridged set of slides from my 2 day tutorial, take a look at those details for the upcoming courses (including an intro to data science) we’re running in October.

I’ll add the video in here once it is released, the slides are below.

I also got to do a book-signing for our High Performance Python book (co-authored with Micha Gorelick), O’Reilly sent us 20 galley copies to give away. The finished printed book will be available via O’Reilly and Amazon in the next few weeks.

Book signing at EuroSciPy 2014

If you want to hear about our future courses then join our low-volume training announce list. I have a short (no-signup) survey about training needs for Pythonistas in data science, please fill that in to help me figure out what we should be teaching.

I also have a further survey on how companies are using (or not using!) data science, I’ll be using the results of this when I keynote at PyConIreland in October, your input will be very useful.

Here are the slides (License: CC By NonCommercial), there’s also source on github:


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Life, pydata, Python

28 August 2014 - 10:38High Performance Python Training at EuroSciPy this afternoon

I’m training on High Performance Python this afternoon at EuroSciPy, my github source is here (as a shortlink: http://bit.ly/euroscipy2014hpc). There are prerequisites for the course.

This training is actually a tiny part of what I’ll teach on my 2 day High Performance Python course in London in October (along with a Data Science course). If you’re at EuroSciPy, please say Hi :-)


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Python

26 August 2014 - 21:35Why are technical companies not using data science?

Here’s a quick question. How come more technical companies aren’t making use of data science? By “technical” I mean any company with data and the smarts to spot that it has value, by “data science” I mean any technical means to exploit this data for financial gain (e.g. visualisation to guide decisions, machine learning, prediction).

I’m guessing that it comes down to an economic question – either it isn’t as valuable as some other activity (making mobile apps? improving UX on the website? paid marketing? expanding sales to new territories?) or it is perceived as being valuable but cannot be exploited (maybe due to lack of skills and training or data problems).

I’m thinking about this for my upcoming keynote at PyConIreland, would you please give me some feedback in the survey below (no sign-up required)?

To be clear – this is an anonymous survey, I’ll have no idea who gives the answers.

Create your free online surveys with SurveyMonkey , the world’s leading questionnaire tool.

 

If the above is interesting then note that we’ve got a data science training list where we make occasional announcements about our upcoming training and we have two upcoming training courses. We also discuss these topics at our PyDataLondon meetups. I also have a slightly longer survey (it’ll take you 2 minutes, no sign-up required), I’ll be discussing these results at the next PyDataLondon so please share your thoughts.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: ArtificialIntelligence, Data science, pydata, Python

20 August 2014 - 21:24Data Science Training Survey

I’ve put together a short survey to figure out what’s needed for Python-based Data Science training in the UK. If you want to be trained in strong data science, analysis and engineering skills please complete the survey, it doesn’t need any sign-up and will take just a couple of minutes. I’ll share the results at the next PyDataLondon meetup.

If you want training you probably want to be on our training announce list, this is a low volume list (run by MailChimp) where we announce upcoming dates and suggest topics that you might want training around. You can unsubscribe at any time.

I’ve written about the current two courses that run in October through ModelInsight, one focuses on improving skills around data science using Python (including numpy, scipy and TDD), the second on high performance Python (I’ve now finished writing O’Reilly’s High Performance Python book). Both courses focus on practical skills, you’ll walk away with working systems and a stronger understanding of key Python skills. Your developer skills will be stronger as will your debugging skills, in the longer run you’ll develop stronger software with fewer defects.

If you want to talk about this, come have a chat at the next PyData London meetup or in the pub after.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Data science, pydata, Python