About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

7 December 2015 - 23:59“Data Science Delivered” (a collection of notes on getting stuff shipped)

Over the last year I’ve given a collection of keynotes and talks around shipping and supporting data science products with Python. I’ve started to gather up my notes into a document – they’re hosted on github as Data Science Delivered, currently its around 5 pages of A4. I put the rough form together after my last keynote of the year in Budapest.

Right now it has notes on how to approach a new project, ways of dealing with bad data, ways to ship working products and ways projects might get sunk.

I’m slowly going to add to this list, I think the rough structure is in place and there’s a lot of detail to add. If you’re interested in getting updates then add your email here and I’ll mail you on occasion when I’ve added a new chunk of information.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

11 Comments | Tags: Data science, pydata, Python

6 November 2015 - 18:24“Featherweight” data science API to publish Python functions on the web

One of the challenges I’ve encountered when coaching data science teams in smaller organisations is the difficulty of publishing proof-of-concept data science products via web calls, when the team doesn’t know anything about web programming. My preference is to use Flask (and flask-restful and maybe Swagger docs) but that’s an awful lot of learning to put onto a non-engineering researcher to help them publish code that another team can consume.

I’ve prototyped “featherweight” as a very simple solution to this problem. Behind the scenes Flask is used to publish your function(s) on a local server. You can then call the function with standard GET requests and key/value arguments (e.g. via cURL or a web browser or the requests module) and get a block of JSON that wraps whatever results your function returned.

The goal is to make it super-easy for a non-engineering researcher to take their Python function or method and to publish it on a web API, without knowing anything about web programming. Examples on github include publishing a simple math function and publishing scikit-learn’s Iris classifier.

Whilst this API won’t solve production use-cases (it is single-threaded, it doesn’t do any clever logging, there’s no additional security) it will solve proof-of-concept and dev-level usage. It also opens the door to moving from Featherweight to a custom Flask interface. Feedback happily received!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

10 Comments | Tags: Data science, Python

14 October 2015 - 13:21Opening Plenary at BudapestBI Forum 2015

I’ve just given my final talk for the year – I’m “at my other home” in Budapest (I’m half-Hungarian) and have had the honour of opening Bence and team’s BudapestBI Forum 2015. This conference has both an open-source-day and (tomorrow) an enterprise-day, all around analytics and with lots of Python and R.

This talk is an iteration of my previous Shipping talks, in part backed by results from our latest PyDataLondon survey to 2,000 members where we’ve asked about member frustrations and I’ve integrated some of the results into this talk:

Shipping Data Science Products
(source)

Here are my slides:

In the room we had roughly 2/3 ‘engineers/builders’ and 1/3 ‘researchers/analysts’, it seems that Python and R are used by a large number of folk here today.

I also ‘released’ a set of my notes that I’ve tentatively entitled “Data Science Delivered” – this is a github doc with a series of the notes that I wish I’d learned years ago. Right now these notes are super-rough, I figure “release early, release often” will help me refine these.

It is based in part through my talking, teaching and coaching over the last couple of years. I intend to add more in the next couple of weeks (so hopefully by November 2015 it’ll be far less rough!), I’d like to add some Notebooks as examples. You’re welcome to post bugs/requests and I’ll try to add notes, if I know about those areas. Please feel free to share some of your experiences (via @ianozsvald, via email, via Bugs etc).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: Data science, Life, pydata, Python

20 September 2015 - 17:23“Ship Data Science Products!” at PyConUK2015

PyConUK2015 is over, it was another year of happy Pythonistic hobbitness in Coventry. I spoke on shipping data science products on the new Science track (organised by Sarah):

It was nice to hear some polite-abuse being thrown at folk stuck on Python 2.x reminding them that it is high time to upgrade to Python 3. Propaganda was given away to support this move.

Obviously I plugged PyDataLondon and our upcoming meetups – if you like data science then come along to our meetups.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: Data science, Life, pydata, Python

28 August 2015 - 11:27EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I’m at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. Next year will be in Bavaria, you can sign-up for announces.

EuroSciPy 2015

I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

  • decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
  • validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
  • automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
  • manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
  • visualisation with GlueViz, Seaborn and csv-fingerprint
  • starting your first ML project

Here are the slides:

 

Thanks to Enthought and the org-team for a lovely conference!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Data science, pydata, Python

1 August 2015 - 21:57PyConUK and the Science Track

PyConUK is in its 9th year and this year it’ll host its first Science Track aimed at scientists (not “data scientists” but real lab-coat-wearing scientists). I’m speaking in that track, yay (“Ship Data Science Products!“)! This track is part of the main conference, it all runs during September 19-21. Here’s a tiny reminder from the first 2007 event.

If you’d like to learn about Python’s role in helping researchers with their work, enabling reproducible research and the spread of digital literacy in the sciences, you should attend this track. This track can be attended for just £99 (without attending the rest of the conferece), this is a bit of a steal given you’ll get 3 days of great networking and learning.

The Software Sustainability Institute is involved and PyConUK is looking for sponsors, this is a great way to spread your message into a scientific community and to over 300 attendees. For details you should contact PyConUK directly (pyconuk-sponsorship@python.org).

Other speakers include members of PyDataLondon (I’m a co-org) and the wider UK Python community.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

9 Comments | Tags: Data science, pydata, Python

21 June 2015 - 16:27PyDataLondon 2015 Write-up and my “Ship It!” talk on publishing data science products

(this post is still evolving June 22nd…)

We’ve just run our 2nd PyDataLondon conference, we’ve had around 300 attendees, 3 keynotes, 3 tracks over 3 days. It has been fab! We’ve grown 50% on last year along with 20% female speakers and 20% female attendees (both up on last year). I’m really happy with the results of all the hard work of our conference committee. Here’s Helena giving our opening keynote:

Video status – forthcoming. Slide status – they’ll get linked in this github repo.

Our keynoters were Helena Bengstton (Editor for Data Projects at The Guardian), Eric Drass (the data scientist’s artist-philosopher, see @bffbot2 and @theresamaybot) and Meta Brown (speaker and writer for statistics and business analytics). Meta gave me a copy of her latest book Data Mining for Dummies which covers the CRISP-DM process she discussed – yay and thanks!

Florian has posted a huge set of high quality conf photos, go dig to see some gems!

Our monthly meetup is now at 1,650 members and our 13th meetup is scheduled for Tues July 7th at AHL (near Bank tube) – go RSVP now! If you have questions about Pythonic data science – you’ll get them answered with 200+ folk at our meetups (probably in the pub after – buy beer and talk to folk!).

I gave a talk entitled “Ship It!“, breaking down 10 years of experience on building, running and deploying successful data science projects. It reflects on recent experiences consulting on automated contract recruitment over 1.5 years with ElevateDirect here in London. I looked at 10 years of my consulting projects, removed those that failed (noting reasons why) and then categorised those that worked into the 4 groups that I start the talk with. After that I build on lessons as the groups build into each other.

Peadar Coyle (@springcoil) spoke on deployment recently at PyConItaly, his talk is worth a watch. You’ll probably want to catch up on his PyMC tutorial that we had over the weekend at PyDataLondon.

I’m thinking of writing a book (or something like that) in the future on building and shipping data science products, if you’re interested take a look and join the announce list.

In my talk and during the closing notes I made a point to everyone – if there’s one simple thing you do today to help support open source projects (particularly if you use them, but don’t contribute to them in other ways) – please please Cite the Project in Public. scikit-learn has a citations page, this helps them raise money from funding bodies, they justify the funding by showing how it helps companies do more business. All you have to do is write a paragraph’s testimonial and send it to your favourite project. The scikit’s, scipy, numpy, ML tools, matplotlib etc – they’d all love to have new testimonials. It’ll take you 15 minutes, please go do it.

Other reviews:

Since the conference was a huge success it means a good chunk of money was raised for NumFOCUS, the non-profit that backs the PyData conferences. As a result the awards and scholarships that they provide to the community including the John Hunter scholarship, diversity grants and women in tech, grants for development on tools like AstroPy, IPython, SymPy and Software Carpentry will get a huge boost. Good job all!

“”If you want to support open source projects publicly say you use them and write testimonials” – @ianozsvald at #pydataldn15 YES PLEASE.” @drmaciver of Hypothesis

UPDATE – David has a testimonials page for his Hypothesis library.

I’ll call out a new project that I mentioned- DSADD (Data Scientists Against Dirty Data – now known as Engarde), a set of decorators to apply to Pandas DataFrames to set constraints on your data. This helps when dealing with dirty data.

I also got to do another book signing for my High Performance Python, along with Yves and his Python for Finance:

Our team (my co-chair Emlyn and team Cecilia, Graham, Florian, Slavi and Calvin) did a wonderful job, along with Leah and James (our International Team [they make all the background stuff happen – particularly Leah!]), and Bloomberg’s team including Amy, Kenny and Darren:

Our wonderful sponsors were Continuum (thanks for PyDatas and for Anaconda!), Bloomberg (thanks for the venue!), Pivigo, Pivotal, Adthena, Pluralsight, Plotly, Sainsburys. Huge thanks to you all for making this possible.

The party last night was in a local Bier Keller with a live Oompah Band (don’t ask!). Much conversation was had 🙂

It was encouraging to see more folk using Python 3.4 at the conference, though still 2.7 was in the majority. I wonder how news that the next Ubuntu (15.10 Wily Werewolf) is switching to Python 3.5 in October will help with people’s transition?

If you’re interesting in hearing about PyDataLondon 2016, join this announce list. It’ll be almost-zero-volume for the next 6 months, I’ll do something with it once we’re planning the next conference.

If you’re interested in other conferences, also check out:

Finally – if you’re after a Data Science Job, I run a very-low-volume jobs list (mostly for London but for the UK in general), read about it here. My ModelInsight also runs data science Python training in London, we announce new training courses on this list. All the lists are MailChimp (so you can unsubscribe instantly at any time), I rarely post to the lists and I keep it all relevant.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

41 Comments | Tags: Data science, pydata, Python

13 May 2015 - 16:42Data Science Deployed – Opening Keynote for PyConSE 2015

I’ve just had a fab couple of days at PyConSE in Stockholm, I really enjoyed giving the opening keynote (thanks!) and attending two days of interesting talks. The Saturday was packed with data science talks (see below), it felt like a mini PyData or EuroSciPy, most cool!

The goal of my talk was to show use-cases for why you should do data science, why it is valuable, how to do it successfully with Python and how get the data products deployed. The whole shebang in 40 minutes. Tools mentioned include scikit-learn, statsmodels, textract, pandas, matplotlib, seaborn, bokeh, IPython and Notebooks, Spyder, PyCharm, Flask and Spyre.

Sidenote – this is the follow-on to my “The Real Unsolved Problems in Data Science” opening keynote at PyConIreland 2014.

My main points seemed to make it through, phew!

What I take from @ianozsvald talk:
“How can i turn our data into business value?”
“Log everything!”
Think + hypothesize + test @pythse

Exploiting your data is key to staying relevant in your business! Listening to @ianozsvald at #pyconse @scalior

Note – I’ll be updating this write-up a little over the next couple of days (it is the end of the conf and I’m rather shattered right now!).

The slides and video for my Data Science Deployed talk are below:

I’d like to acknowledge Ollie Glass along with Ferenc Huszár (Balderton) and Thomas Stone (Prediction.io) for feedback on early ideas for my talk – cheers gents!

I also plugged PyDataBerlin, our upcoming PyDataLondon (June 19-21, CfP open for just 1 more week) and EuroSciPy on stage, hopefully we’ll see a few more international visitors. I should also have plugged PyConUK too as there’s now a Science Track too!

The following talks from yesterday will interest you, I hope the videos come online soon:

  • Analyzing data with Pandas
  • Data processing and machine learning with Python (slides)
  • Deep Learning and Deep Data Science
  • Hacking Human Language
  • IPython: How a notebook is changing science
  • The Hitchhikers Guide to Python

Here’s a couple of extra links that might be interesting:

Here’s Ilian Iliev’s review of the conference too.

I have a vague idea to write-up these topics more in the future, I’m calling this Building Data Science Products with Python. There’s a mailing list, I’ll email to ask questions a little over the coming months to figure out if/how I should write this.

Thanks everyone for a lovely conference!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Life, pydata, Python

7 May 2015 - 20:2212th PyDataLondon meetup at AHL

We’ve just had our 12th meetup – we’re fully a year old, we’ve nearly 1,500 members and now we’re planning our second conference (the Call for Proposals is open for just another 10 days!). Python Data Science has grown crazily-popular in the last couple of years!

Here’s a photo from last week’s meetup, that’s over 220 people at our new host hedge-fund AHL (they’re hiring):

IMG_20150505_190654

Our two speakers were:

  • Slavi Marinov talking on using gensim for topic classification for financial prediction
  • Lasse Bohling talking on using statistics for football prediction at footballradar.com

Slides are linked in the meetup comments. We’ll take a break for a month to run the conference (June 19-21), then we’ll pick up again in July.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: pydata, Python

3 May 2015 - 15:08“#talkpay” tweet salary visualisation

This weekend the #talkpay tag has shown people outing their salaries, to democratise some of this information. This provides some interesting data for visualisation. If you’re curious about a discussion around salary data then @patio11’s blog entry is a good starting point.

@echen grabbed some of the data, I took a copy of the online sheet and made the following code to visualise the salaries. This is a very simplistic analysis, it is mostly US data, there’s no filtering for location (you’d expect San Francisco to pay significantly more than many other US cities).

First, here’s a histogram of the majority of the salaries listed (ignoring the top-9 which go up to $1.1 million which distort the plot):

Next we can filter by some text terms, here’s a similar histogram for software developers. Note the interesting peaks at $80k and $120k, then smaller but obvious bumps at $150k, $200k and $250k:

There’s much less data for teachers but you can get an idea of the difference in likely salaries:

Finally we can plot a normed (summed to 1.0) cumulative histogram, you can think of the data as probabilities to get an idea of the proportion of people who earn less/more than a certain amount:

It is worth remembering that the data is thin, just 800 samples, it is also self-reported so most of the reports will be from people who are confident in being public. It is likely that the true distribution of salaries is lower, as people who aren’t confident are less likely to publish.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

15 Comments | Tags: pydata, Python