About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

20 June 2016 - 8:56Results for “Which version of Python (2.x vs 3.x) do London Data Scientists use?”

Over the last week I’ve surveyed my PyDataLondon meetup community (3,400+ members) to ask “Which version of Python do you use at work and at home?”. The goal is to gain evidence about which versions of Python are used by Data Scientists. This will help tool developers so they can make evidence-based decisions (e.g. this Dask discussion and another for h5py) about which versions of Python need support now and in the future.

Below I also discuss some business risks of sticking with Python 2.7. Of 3,400+ members over 466 (13%) responded to the 4 emails I sent. By 2020 (3.5 years from now) Python 2.7’s support will end.

TL;DR Python 2.7 is still dominant for UK Data Scientists at work, Python 3.4 dominates outside of work, I hypothesise that 50% of London Data Scientists will be using Python 3.x by June 2017, business risks exist for companies who lack a 2.7->3.x migration plan.

Survey results:

At work Python 2.7 dominates (58%) for PyDataLondon members. Python 3.4+ is used by 33% of our respondents (including me).

Version of Python at work

Outside of work Python 3.4+ dominates 2.7 by a small margin (the majority of home users choose Python 3.5 (37%) over 3.4 (12%)). For work and home usage Python versions <=3.3 and 2.6 are used by approximately 2% of respondents each, nobody uses <= 2.5. Separately I know at least 2 members at our meetups who have noted that they use Python 2.4 (so that’s at least 2 in 3,400 members).

Version of Python at home

 

The more interesting outcome is “If you use Python 2.7 – do you expect to be using Python 3.x within a year?”. 25% of respondents are using Python 2.7 and do expect to upgrade. 36% are already on Python 3.x. 38% expect to still be using Python 2.7 in a year. Of the aspirational 25% who believe they’ll upgrade, I suspect that at least half of these will have upgraded within a year.

Hypothesis – if I survey again in June 2017 we’ll see Python 3.x usage at 50% of the PyDataLondon community.

Will I upgrade to Python 3.x

When asked about a choice of distribution it is clear that Continuum’s Anaconda is the clear choice. A significant number of users still use their Operating System’s default Python.

Which distribution at work

Edit – I did have a question about choice of Operating System but I’d left it as a multiple-choice not single choice question. Since the results were hard to interpret I’ve removed that result.

The above results mirror the finding in Randal Olson’s recent 2014 and 2013 surveys. There are a couple of related (early 2015 for scientific Python users) surveys (2013).

There’s a final question on “Anything else you’d like to add?”. Some users note that they are fixed to 2.7 for the time being due to a large legacy code-base. This sort of theme “Current python use at work is around 60% 2.7 and 40% 3.4 .. this ratio is continuously moving towards 3.4 as most new things are in 3.4+.” and “Just made jump to Py3, still have a body of legacy running under 2” occurred through-out the comments. Nobody commented that they’d moved backwards and nobody ruled-out upgrading (though some said that they were in no hurry).

Business risk: A few newer tools are only written for Python 3.4+ and are unlikely to be back-ported. Some established projects (e.g. IPython/Jupyter) are moving their next development versions to Python 3.4+ and keeping Python 2.7 for the current branch as a move towards discontinuing Python 2.7 support. There’s an increasing risk that Python 2.7-based Data Scientists will see newer tools occur around them for Python 3.4+ which won’t fit into their development chain. I’ve made notes on this before. For businesses using 2.7 you should at least have a plan for strong unit-test coverage and new code should be written as 3.4-compatible, to ease your journey into 3.x+.

Advice to developers of new packages in 2016: If you’re not worried about losing some of your potential users, you might focus just on Python 3.4+, you’ll lose around 60% of your potential userbase and this will move in your favour fairly quickly over the next 2 years. You might want to invest time in cross-compatibility using tools like __future__ and six if supporting Python 2.7 isn’t too complicated – the burden is heavier if you’re doing text processing and web-based data processing (as the bytes/str/unicode distinctions induce more pain). You probably shouldn’t focus solely on Python 2.7 as the trend is against you.

 

If you run a community group then maybe you’d like to make a survey like this?

I used SurveyMonkey, it is free if you have <100 respondents, I had buy a monthly plan to access these results. Here are some notes:

Community surveyed: London Python-using Data Scientists who are members of PyDataLondon, these are mainly industrial users (40% PdD, 40% MSc, the majority self-identify as being Practicing Data Scientists), some are academics. In the UK we’ve had various Python communities grow over the years including PyConUK (2007+), London Financial Python Usergroup (2009-2014), London Python Usergroup (2010+). Our PyDataLondon is 3 years old, it is also the largest active Python usergroup in Europe. The above results probably reflect (within a margin of error) the general state of Python-using Data Scientists in the UK.

Bias: Accepting the demographics of the audience noted above (i.e. self-selected professional and active individuals focused around London), I did observe an interesting bias in the distribution of results over time. I issued the survey 4 times over 1 week. At first I received clear evidence (approx 40 responses) that Python 3.4+ was used significantly at work. I wrote a second email to our meetup group strongly requesting more submissions and quickly this climbed to 200+. In this second tranche the dominance of Python 3.4 dropped and 2.7 edged forwards. After a 3rd and 4th email it was clear that Python 2.7 is dominant. Possibly the early responders are more vocal in their Python 3.4+ support?

Improving the questions: If I were to run the survey again (and if you want to run one like it) – I’d suggest changing the “Which distribution question do you use at work?” to “Which distribution do you use the majority of the time?” (removing ‘work’). I’d probably also add a new question with a short “Which industry do you primarily work in?” list of radio-boxes (e.g. Finance, Biotech, Gaming, Retail, Hardware, …). I think the break-down of Python 2.7 users against Industry might be interesting.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

37 Comments | Tags: Data science, Life, pydata, Python

10 May 2016 - 22:43PyDataLondon 2016 Conference Write-up

We’ve just run our 3rd PyDataLondon Conference (2016) – 3 days, 4 tracks, 330 people.This builds on PyDataLondon 2015. It was ace! If you’d like to be notified about PyDataLondon 2017 then join this announce list (it’ll be super low volume like it has been for the last 2 years).

Big thanks to the organizers, sponsors and speakers, such a great conference it was. Being super tired going home on the train, but it was totally worth it. – Brigitta

We held it at Bloomberg UK again – many thanks to our hosts! I’d also like to thank my colleagues, review committee and all our volunteers for their hard work, the weekend went incredibly smoothly and that’s because our team is so on-top-of-everything – thanks!

Our keynote speakers were:

Our videos are being uploaded to YouTube. Slides will be linked against each author’s entry. There are an awful lot of happy comments on Twitter too. Our speakers covered Python, Julia, R, MCMC, clustering, geodata, financial modeling, visualisation, deployment, pipelines and a whole lot more. I spoke on Statistically Solving Sneezes and Sniffles (a citizen science project using ML to try to diagnose the causes of Rhinitis). Our Beginner Bootcamp (led by Conrad) had over 50 attendees!

…Let me second that. My first PyData also. It was incredible. Well organised – kudos to everyone who helped make it happen; you guys are pros. I found Friday useful as well, are the meetups like that? I’d love to be more involved in this community. –  lewis

We had two signing sessions for five authors with a ton of free books to give away:

  • Kyran Dale – Data Visualisation with Python and Javascript (these were the first copies in the UK!)
  • Amit Nandi – Spark for Python Developers
  • Malcolm Sherrington – Mastering Julia
  • Rui Miguel Forte – Mastering Predictive Analytics with R
  • Ian Ozsvald (me!) – High Performance Python (now in Italian, Polish and Japanese)

 

Some achievements

  • We used slack for all members at the conference – attendees started side-channels to share tutorial files, discuss the meets and recommend lunch venues (!)
  • We added an Unconference track (7 blank slots that anyone could sign-up for on the day), this brought us a nice random mix of new topics and round-table discussions
  • A new bioinformatics slack channel is likely to be formed due to collaborations at the conference
  • We signed up a ton of new volunteers to help us next year (thanks!)
  • An impromptu jobs board appeared on a notice board and was rapidly filled (if useful – also see my jobs list)

Thank you to all the organisers and speakers! It’s been my first PyData and it’s been great! – raffo

We had 15-20% female attendance this year, a slight drop on last year’s numbers (we’ll keep working to do better).

On a personal note it was great to see colleagues who I’ve coached in the past – especially as some were speaking or were a part of our organising committee.

With thanks to our sponsors and via ticket sales we raised more money this year for the NumFOCUS non-profit that backs the scientific Python stack (they give grants and stipends for contributors). We’d love to have more sponsors next year (this is especially useful if you’re hiring!). Thanks to:

Let me know if you do a write-up so I can link it here please:

If you’d like to hear about next year’s event then join this announce list (it’ll be super low volume). You probably also want to join our PyDataLondon meetup.

There are other upcoming PyData conferences including Berlin, Paris and Cologne. Take a look and get involved!

As an aside – if your data science team needs coaching, do drop me a line (and take a look at my coaching testimonials on LinkedIn). If you want a job in data science, take a look at my London Python data science jobs list.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

31 Comments | Tags: Data science, Life, pydata, Python

7 May 2016 - 15:04Statistically Solving Sneezes and Sniffles – a Work in Progress Report at PyDataLondon 2016

This is a Work in Progress report, presented this morning at my PyDataLondon 2016 conference. A group of 4 of us are modelling a year’s worth of self-reported data from my wife around her allergies – we’re learning to model which environmental conditions cause her sneezes such that she might have more control over her antihistamine use. Join the email updates list for low-volume updates about this project.

I really should have warned my audience that I was about to photograph them (honest – they seemed to enjoy the talk!):

Emily created the Allergy Tracker (open src) iPhone app a year ago, she logs every sneeze, antihistamine, alcoholic drink, runny nose and more. She’s sneezed for 20 years and by heck, we wondered if we could apply some Data Science to the problem to see if her symptoms correlate with weather, food and pollution. I’m pleased to say we’ve made some progress – it looks like humidity is connected to her propensity to use an antihistamine.

This talk (co-presented with Giles Weaver) discusses the data, the app, our approach to analysis and our tools (including Jupyter, scikit-learn, R, Anaconda and Seaborn) to build a variety of machine learned models to try to model antihistamine usage against external factors. Here are the slides:

Now we’re moving forward to a couple of other participants (we’d like a few more to join us – if you’re on iOS and in London and can commit to 3 months consistent usage we’ll try to tell you what drives your sneezes). We also have academic introductions so we can validate our ideas (and/or kick them into the ground and try again!).

This is the second full day of the conference – we have 330 attendees and we’ve had 2 great keynote speakers and a host of wonderful talks and tutorials (yesterday). Tonight we have our conference party. I’m super happy with how things are progressing – many thanks to all of our speakers, volunteers, Bloomberg and our sponsors for making this work so well.

Update – featured in Mode Analytics #23.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

17 Comments | Tags: Data science, pydata, Python

25 January 2016 - 21:27PyDataLondon 2016 Call for Proposals Open

Our Call for Proposals for PyDataLondon 2016 (May 6-8) is open until approx. end of February (5ish weeks), you need to get your submission in soon!

If you want to sponsor to talk with 330 cutting edge data scientists – you’d better hurry, we’ve already started signing deals.

In the CfP we’re looking for:

  • Stories about successful data science projects (including the highs and lows)
  • Machine learning (including Deep Learning) – especially why you used certain algorithms and how you diagnosed features
  • Visualisation – have you explained or explored something that’s good to share?
  • Data cleaning
  • Data process (getting data, understanding it, building models, deploying solutions)
  • Industrial and Academic stories
  • Big data including Spark

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).

We’ve also got a new (temporary URL) webpage for our regular meetups here, this has notes on how to submit a talk to the meetup (not the conference, just the PyDataLondon meetup). Please take a look if you’d like to speak to 200 folk at our monthly meetup.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

25 Comments | Tags: Data science, pydata, Python

10 January 2016 - 23:08Announcing PyDataLondon 2016 (May 6-8th)

We’re very happy to announce that Bloomberg will host us a second time for PyDataLondon 2016 (our 3rd annual conference). We’ll run the conference over May 6-8th (a tutorial day and 2 conference days as last time) with approximately 330 people in attendance. The location is Central London – near Bank underground station and London Bridge.

Our PyDataLondon meetup community has grown amazingly in the last year, we’ve almost doubled in size to 2,500+ members with 200 in the room each month. We’ve had 19 events in almost 2 years, mostly around Python (some with R, Julia and Matlab), mostly on data science (and stats, visualisation and high performance) and all with a lovely collaborative audience.

The conference Call for Proposals will be opened very soon (in a week or two). If you’d like to speak in front of 330 active data scientists in London’s most active data science community, get thinking on your topic. We’re interested in data science topics, mostly around Python (but we’re cool with other tech and theory). Extra attention will be paid to talks offering real-world stories (for both success and failure – all lessons are equally useful).

Sign-up to this email announce list to be kept in the loop, I’ll write a couple of mails when the CfP is open and as the conference plans develop.

If you’ve not been to one of our conferences before checkout my write-ups from 2015 and 2014.

If you’re hiring or you have a relevant product – think on sponsoring. We expect to sell all of our spots this year due to increased demand for strong data scientists – if you’d like to have a prime spot in the central room (all the talk-rooms hang off of the central room so sponsors are in the thick of it), do get in contact.

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

20 Comments | Tags: Data science, pydata, Python

7 December 2015 - 23:59“Data Science Delivered” (a collection of notes on getting stuff shipped)

Over the last year I’ve given a collection of keynotes and talks around shipping and supporting data science products with Python. I’ve started to gather up my notes into a document – they’re hosted on github as Data Science Delivered, currently its around 5 pages of A4. I put the rough form together after my last keynote of the year in Budapest.

Right now it has notes on how to approach a new project, ways of dealing with bad data, ways to ship working products and ways projects might get sunk.

I’m slowly going to add to this list, I think the rough structure is in place and there’s a lot of detail to add. If you’re interested in getting updates then add your email here and I’ll mail you on occasion when I’ve added a new chunk of information.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

11 Comments | Tags: Data science, pydata, Python

14 October 2015 - 13:21Opening Plenary at BudapestBI Forum 2015

I’ve just given my final talk for the year – I’m “at my other home” in Budapest (I’m half-Hungarian) and have had the honour of opening Bence and team’s BudapestBI Forum 2015. This conference has both an open-source-day and (tomorrow) an enterprise-day, all around analytics and with lots of Python and R.

This talk is an iteration of my previous Shipping talks, in part backed by results from our latest PyDataLondon survey to 2,000 members where we’ve asked about member frustrations and I’ve integrated some of the results into this talk:

Shipping Data Science Products
(source)

Here are my slides:

In the room we had roughly 2/3 ‘engineers/builders’ and 1/3 ‘researchers/analysts’, it seems that Python and R are used by a large number of folk here today.

I also ‘released’ a set of my notes that I’ve tentatively entitled “Data Science Delivered” – this is a github doc with a series of the notes that I wish I’d learned years ago. Right now these notes are super-rough, I figure “release early, release often” will help me refine these.

It is based in part through my talking, teaching and coaching over the last couple of years. I intend to add more in the next couple of weeks (so hopefully by November 2015 it’ll be far less rough!), I’d like to add some Notebooks as examples. You’re welcome to post bugs/requests and I’ll try to add notes, if I know about those areas. Please feel free to share some of your experiences (via @ianozsvald, via email, via Bugs etc).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: Data science, Life, pydata, Python

20 September 2015 - 17:23“Ship Data Science Products!” at PyConUK2015

PyConUK2015 is over, it was another year of happy Pythonistic hobbitness in Coventry. I spoke on shipping data science products on the new Science track (organised by Sarah):

It was nice to hear some polite-abuse being thrown at folk stuck on Python 2.x reminding them that it is high time to upgrade to Python 3. Propaganda was given away to support this move.

Obviously I plugged PyDataLondon and our upcoming meetups – if you like data science then come along to our meetups.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: Data science, Life, pydata, Python

28 August 2015 - 11:27EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I’m at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. Next year will be in Bavaria, you can sign-up for announces.

EuroSciPy 2015

I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

  • decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
  • validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
  • automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
  • manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
  • visualisation with GlueViz, Seaborn and csv-fingerprint
  • starting your first ML project

Here are the slides:

 

Thanks to Enthought and the org-team for a lovely conference!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Data science, pydata, Python

1 August 2015 - 21:57PyConUK and the Science Track

PyConUK is in its 9th year and this year it’ll host its first Science Track aimed at scientists (not “data scientists” but real lab-coat-wearing scientists). I’m speaking in that track, yay (“Ship Data Science Products!“)! This track is part of the main conference, it all runs during September 19-21. Here’s a tiny reminder from the first 2007 event.

If you’d like to learn about Python’s role in helping researchers with their work, enabling reproducible research and the spread of digital literacy in the sciences, you should attend this track. This track can be attended for just £99 (without attending the rest of the conferece), this is a bit of a steal given you’ll get 3 days of great networking and learning.

The Software Sustainability Institute is involved and PyConUK is looking for sponsors, this is a great way to spread your message into a scientific community and to over 300 attendees. For details you should contact PyConUK directly (pyconuk-sponsorship@python.org).

Other speakers include members of PyDataLondon (I’m a co-org) and the wider UK Python community.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

9 Comments | Tags: Data science, pydata, Python