About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

23 September 2016 - 12:28Practical ML for Engineers talk at #pyconuk last weekend

Last weekend I had the pleasure of introducing Machine Learning for Engineers (a practical walk-through, no maths) [YouTube video] at PyConUK 2016. Each year the conference grows and maintains a lovely vibe, this year it was up to 600 people! My talk covered a practical guide to a 2 class classification challenge (Kaggle’s Titanic) with scikit-learn, backed by a longer Jupyter Notebook (github) and further backed by Ezzeri’s 2 hour tutorial from PyConUK 2014.

Debugging slide from my talk (thanks Olivia)

Debugging slide from my talk (thanks Olivia)

Topics covered include:

  • Going from raw data to a DataFrame (notable tip – read Katharine’s book on Data Wrangling)
  • Starting with a DummyClassifier to get a baseline result (everything you do from here should give a better classification score than this!)
  • Switching to a RandomForestClassifier, adding Features
  • Switching from a train/test set to a cross validation methodology
  • Dealing with NaN values using a sentinel value (robust for RandomForests, doesn’t require scaling, doesn’t require you to impute your own creative values)
  • Diagnosing quality and mistakes using a Confusion Matrix and looking at very-wrong classifications to give you insight back to the raw feature data
  • Notes on deployment

I had to cover the above in 20 minutes, obviously that was a bit of a push! I plan to cover this talk again at regional meetups, probably with 30-40 minutes. As it stands the talk (github) should lead you into the Notebook and that’ll lead you to Ezzeri’s 2 hour tutorial. This should be enough to help you start on your own 2 class classification challenge, if your data looks ‘somewhat like’ the Titanic data.

I’m generally interested in the idea of helping more engineers get into data science and machine learning. If you’re curious – I have a longer set of notes called Data Science Delivered and some vague plans to maybe write a book (maybe) – for the book join the mailing list here if you’d like to hear more (no hard sell, almost no emails at the moment, I’m still figuring out if I should do this).

You might also want to follow-up on Katharine Jarmul’s data wrangling talk and tutorial, Nick Radcliffe’s Test Driven Data Analysis (with new automated TDD-for-data tool to come in a few months), Tim Vivian-Griffiths’ SVM Diagnostics, Dr. Gusztav Belteki’s Ventilator medical talk, Geoff French’s Deep Learning tutorial and Marco Bonzanini and Miguel ‘s Intro to ML tutorial. The videos are probably in this list.

If you like the above then do think on coming to our monthly PyDataLondon data science meetups near London Bridge.

PyConUK itself has grown amazingly – the core team put in a huge amount of effort. It was very cool to see the growth of the kids sessions, the trans track, all the tutorials and the general growth in the diversity of our community’s membership. I was quite sad to leave at lunch on the Sunday – next year I plan to stay longer, this community deserves more investment. If you’ve yet to attend a PyConUK then I strongly urge you to think on submitting a talk for next year and definitely suggest that you attend.

The organisers were kind enough to let Kat and myself do a book signing, I suggest other authors think on joining us next year. Attendees love meeting authors and it is yet another activity that helps bind the community together.

Book signing at PyConUK

Book signing at PyConUK


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

9 Comments | Tags: Data science, pydata, Python

19 August 2016 - 18:47Some notes on building a conda recipe

I’ve spent the day building a conda recipe, the process wasn’t super-smooth, hopefully these notes will help others and/or maybe you can leave me a comment to improve my flow. The goal was to learn how to use conda to distribute a package that ordinarily I’d put on PyPI.

I’m using Linux 64bit (Mint 18 on an XPS 9550), conda 4.1 and conda build 1.21.14 (up to date as of today). My goal was to build a recipe (python_template_with_config_recipe) to install my python_template_with_config (a bit of boilerplate code I use sometimes when making a new project). That template has two modules, a test, a setup.py and it depends on numpy.

The short story:

  1. git clone https://github.com/ianozsvald/python_template_with_config_recipe.git
  2. cd inside, run “conda build –debug .”
  3. # note the period means “current directory” and that’s two dashes debug
  4. a local bzip will be built and  you’ll see that the 1 test ran ok

On my machine the built code ends up in “~/anaconda3/pkgs/python_template_with_config-0.1-py35_0/lib/python3.5/site-packages/python_template_with_config” and the building takes place in “~/anaconda3/conda-bld/linux-64”.

In a new conda environment I can use “conda install –use-local python_template_with_config” and it’ll install the built recipe into the new environment.

To get started with this I first made a fresh empty conda environment (note that anaconda isn’t my default Python, hence the long-form access to `conda`):

  1. $ ~/anaconda3/bin/conda create -n new_recipe_env python=3.5
  2. $ . ~/anaconda3/bin/activate new_recipe_env

To check that my existing “setup.py” runs I use pip to install from git, we’ll need the setup.py in the conda recipe later so we want to confirm that it works:

  • $ pip install git+https://github.com/ianozsvald/python_template_with_config.git # runs setup.py
  • # $ pip uninstall python_template_with_config # use this if you need to uninstall whilst developing

I can check that this has installed as a module using:

In [1]: from python_template_with_config import another_module
In [2]: another_module.a_math_function() 
# silly function just to check that numpy is installed
Out[2]: -2.4492935982947064e-16

Now I’ll make a second conda environment to develop the recipe:

  1. $ ~/anaconda3/bin/conda create -n new_recipe_env2 python=3.5 # vanilla environment, no numpy
  2. $ . ~/anaconda3/bin/activate new_recipe_env2
  3. git clone https://github.com/ianozsvald/python_template_with_config_recipe.git
  4. cd inside, run "conda build --debug ."

The recipe (meta.yaml) will look at the git repo for python_template_with_config, pull down a copy, build using build.sh and the store a bzip2 archive locally. The build step also notes that I can upload this to Anaconda using `$ anaconda upload /home/ian/anaconda3/conda-bld/linux-64/python_template_with_config-0.1-py35_0.tar.bz2`.

A few caveats occurred whilst creating the recipe:

  • You need a bld.bat, build.sh, meta.yaml, at first I created bld.sh and meta.yml (both typos) and there were no complaints…just frustration on my part – the first clue was seeing “source tree  in: /home/ian/anaconda3/conda-bld/work \n number of files: 0” in the build output
  • When running conda build it seems to not overwrite the version in ~/anaconda3/pkgs/ – I ended up deleting “python_template_with_config-0.1-py35_0/” and “python_template_with_config-0.1-py35_0.tar.bz2” by hand just to make sure on each build iteration – I must be missing something here, please enlighten me (see note from Marco below)
  • Having deleted the cached versions and fixed the typos I’d later see “number of files: 14”
  • Later I added “run_tests.py” rather than “run_test.py”, I knew it wasn’t running as I’d added a “1/0” line inside run_tests.py that obviously wasn’t running (it should raise a ZeroDivisionError even if the tests did run ok). Again this was a typo on my part
  • The above is tested on Linux, it ought to work on Windows but I’ve not tested it
  • This meta.yaml installs from github, there’s a commented out line in there showing how to access the local source files instead

Marco Bonzanini has noted that “conda clean“, “conda clean -t” (tarballs) and “conda clean -p” (packages) can help with the caching issue mentioned above. He also notes “conda skeleton <pypi package url>” takes care of the boilerplate if you have a published version on PyPI, so that avoids the silly mistakes I made by hand. Cheers!

I didn’t get as far as uploading this to Anaconda to make it ‘public’ (as I don’t think that’s so useful) but I believe that final step is easy enough.

Useful docs:


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

13 Comments | Tags: Data science, Python

20 June 2016 - 8:56Results for “Which version of Python (2.x vs 3.x) do London Data Scientists use?”

Over the last week I’ve surveyed my PyDataLondon meetup community (3,400+ members) to ask “Which version of Python do you use at work and at home?”. The goal is to gain evidence about which versions of Python are used by Data Scientists. This will help tool developers so they can make evidence-based decisions (e.g. this Dask discussion and another for h5py) about which versions of Python need support now and in the future.

Below I also discuss some business risks of sticking with Python 2.7. Of 3,400+ members over 466 (13%) responded to the 4 emails I sent. By 2020 (3.5 years from now) Python 2.7’s support will end.

TL;DR Python 2.7 is still dominant for UK Data Scientists at work, Python 3.4 dominates outside of work, I hypothesise that 50% of London Data Scientists will be using Python 3.x by June 2017, business risks exist for companies who lack a 2.7->3.x migration plan.

Survey results:

At work Python 2.7 dominates (58%) for PyDataLondon members. Python 3.4+ is used by 33% of our respondents (including me).

Version of Python at work

Outside of work Python 3.4+ dominates 2.7 by a small margin (the majority of home users choose Python 3.5 (37%) over 3.4 (12%)). For work and home usage Python versions <=3.3 and 2.6 are used by approximately 2% of respondents each, nobody uses <= 2.5. Separately I know at least 2 members at our meetups who have noted that they use Python 2.4 (so that’s at least 2 in 3,400 members).

Version of Python at home

 

The more interesting outcome is “If you use Python 2.7 – do you expect to be using Python 3.x within a year?”. 25% of respondents are using Python 2.7 and do expect to upgrade. 36% are already on Python 3.x. 38% expect to still be using Python 2.7 in a year. Of the aspirational 25% who believe they’ll upgrade, I suspect that at least half of these will have upgraded within a year.

Hypothesis – if I survey again in June 2017 we’ll see Python 3.x usage at 50% of the PyDataLondon community.

Will I upgrade to Python 3.x

When asked about a choice of distribution it is clear that Continuum’s Anaconda is the clear choice. A significant number of users still use their Operating System’s default Python.

Which distribution at work

Edit – I did have a question about choice of Operating System but I’d left it as a multiple-choice not single choice question. Since the results were hard to interpret I’ve removed that result.

The above results mirror the finding in Randal Olson’s recent 2014 and 2013 surveys. There are a couple of related (early 2015 for scientific Python users) surveys (2013).

There’s a final question on “Anything else you’d like to add?”. Some users note that they are fixed to 2.7 for the time being due to a large legacy code-base. This sort of theme “Current python use at work is around 60% 2.7 and 40% 3.4 .. this ratio is continuously moving towards 3.4 as most new things are in 3.4+.” and “Just made jump to Py3, still have a body of legacy running under 2” occurred through-out the comments. Nobody commented that they’d moved backwards and nobody ruled-out upgrading (though some said that they were in no hurry).

Business risk: A few newer tools are only written for Python 3.4+ and are unlikely to be back-ported. Some established projects (e.g. IPython/Jupyter) are moving their next development versions to Python 3.4+ and keeping Python 2.7 for the current branch as a move towards discontinuing Python 2.7 support. There’s an increasing risk that Python 2.7-based Data Scientists will see newer tools occur around them for Python 3.4+ which won’t fit into their development chain. I’ve made notes on this before. For businesses using 2.7 you should at least have a plan for strong unit-test coverage and new code should be written as 3.4-compatible, to ease your journey into 3.x+.

Advice to developers of new packages in 2016: If you’re not worried about losing some of your potential users, you might focus just on Python 3.4+, you’ll lose around 60% of your potential userbase and this will move in your favour fairly quickly over the next 2 years. You might want to invest time in cross-compatibility using tools like __future__ and six if supporting Python 2.7 isn’t too complicated – the burden is heavier if you’re doing text processing and web-based data processing (as the bytes/str/unicode distinctions induce more pain). You probably shouldn’t focus solely on Python 2.7 as the trend is against you.

 

If you run a community group then maybe you’d like to make a survey like this?

I used SurveyMonkey, it is free if you have <100 respondents, I had buy a monthly plan to access these results. Here are some notes:

Community surveyed: London Python-using Data Scientists who are members of PyDataLondon, these are mainly industrial users (40% PdD, 40% MSc, the majority self-identify as being Practicing Data Scientists), some are academics. In the UK we’ve had various Python communities grow over the years including PyConUK (2007+), London Financial Python Usergroup (2009-2014), London Python Usergroup (2010+). Our PyDataLondon is 3 years old, it is also the largest active Python usergroup in Europe. The above results probably reflect (within a margin of error) the general state of Python-using Data Scientists in the UK.

Bias: Accepting the demographics of the audience noted above (i.e. self-selected professional and active individuals focused around London), I did observe an interesting bias in the distribution of results over time. I issued the survey 4 times over 1 week. At first I received clear evidence (approx 40 responses) that Python 3.4+ was used significantly at work. I wrote a second email to our meetup group strongly requesting more submissions and quickly this climbed to 200+. In this second tranche the dominance of Python 3.4 dropped and 2.7 edged forwards. After a 3rd and 4th email it was clear that Python 2.7 is dominant. Possibly the early responders are more vocal in their Python 3.4+ support?

Improving the questions: If I were to run the survey again (and if you want to run one like it) – I’d suggest changing the “Which distribution question do you use at work?” to “Which distribution do you use the majority of the time?” (removing ‘work’). I’d probably also add a new question with a short “Which industry do you primarily work in?” list of radio-boxes (e.g. Finance, Biotech, Gaming, Retail, Hardware, …). I think the break-down of Python 2.7 users against Industry might be interesting.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

37 Comments | Tags: Data science, Life, pydata, Python

10 May 2016 - 22:43PyDataLondon 2016 Conference Write-up

We’ve just run our 3rd PyDataLondon Conference (2016) – 3 days, 4 tracks, 330 people.This builds on PyDataLondon 2015. It was ace! If you’d like to be notified about PyDataLondon 2017 then join this announce list (it’ll be super low volume like it has been for the last 2 years).

Big thanks to the organizers, sponsors and speakers, such a great conference it was. Being super tired going home on the train, but it was totally worth it. – Brigitta

We held it at Bloomberg UK again – many thanks to our hosts! I’d also like to thank my colleagues, review committee and all our volunteers for their hard work, the weekend went incredibly smoothly and that’s because our team is so on-top-of-everything – thanks!

Our keynote speakers were:

Our videos are being uploaded to YouTube. Slides will be linked against each author’s entry. There are an awful lot of happy comments on Twitter too. Our speakers covered Python, Julia, R, MCMC, clustering, geodata, financial modeling, visualisation, deployment, pipelines and a whole lot more. I spoke on Statistically Solving Sneezes and Sniffles (a citizen science project using ML to try to diagnose the causes of Rhinitis). Our Beginner Bootcamp (led by Conrad) had over 50 attendees!

…Let me second that. My first PyData also. It was incredible. Well organised – kudos to everyone who helped make it happen; you guys are pros. I found Friday useful as well, are the meetups like that? I’d love to be more involved in this community. –  lewis

We had two signing sessions for five authors with a ton of free books to give away:

  • Kyran Dale – Data Visualisation with Python and Javascript (these were the first copies in the UK!)
  • Amit Nandi – Spark for Python Developers
  • Malcolm Sherrington – Mastering Julia
  • Rui Miguel Forte – Mastering Predictive Analytics with R
  • Ian Ozsvald (me!) – High Performance Python (now in Italian, Polish and Japanese)

 

Some achievements

  • We used slack for all members at the conference – attendees started side-channels to share tutorial files, discuss the meets and recommend lunch venues (!)
  • We added an Unconference track (7 blank slots that anyone could sign-up for on the day), this brought us a nice random mix of new topics and round-table discussions
  • A new bioinformatics slack channel is likely to be formed due to collaborations at the conference
  • We signed up a ton of new volunteers to help us next year (thanks!)
  • An impromptu jobs board appeared on a notice board and was rapidly filled (if useful – also see my jobs list)

Thank you to all the organisers and speakers! It’s been my first PyData and it’s been great! – raffo

We had 15-20% female attendance this year, a slight drop on last year’s numbers (we’ll keep working to do better).

On a personal note it was great to see colleagues who I’ve coached in the past – especially as some were speaking or were a part of our organising committee.

With thanks to our sponsors and via ticket sales we raised more money this year for the NumFOCUS non-profit that backs the scientific Python stack (they give grants and stipends for contributors). We’d love to have more sponsors next year (this is especially useful if you’re hiring!). Thanks to:

Let me know if you do a write-up so I can link it here please:

If you’d like to hear about next year’s event then join this announce list (it’ll be super low volume). You probably also want to join our PyDataLondon meetup.

There are other upcoming PyData conferences including Berlin, Paris and Cologne. Take a look and get involved!

As an aside – if your data science team needs coaching, do drop me a line (and take a look at my coaching testimonials on LinkedIn). If you want a job in data science, take a look at my London Python data science jobs list.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

31 Comments | Tags: Data science, Life, pydata, Python

7 May 2016 - 15:04Statistically Solving Sneezes and Sniffles – a Work in Progress Report at PyDataLondon 2016

This is a Work in Progress report, presented this morning at my PyDataLondon 2016 conference. A group of 4 of us are modelling a year’s worth of self-reported data from my wife around her allergies – we’re learning to model which environmental conditions cause her sneezes such that she might have more control over her antihistamine use. Join the email updates list for low-volume updates about this project.

I really should have warned my audience that I was about to photograph them (honest – they seemed to enjoy the talk!):

Emily created the Allergy Tracker (open src) iPhone app a year ago, she logs every sneeze, antihistamine, alcoholic drink, runny nose and more. She’s sneezed for 20 years and by heck, we wondered if we could apply some Data Science to the problem to see if her symptoms correlate with weather, food and pollution. I’m pleased to say we’ve made some progress – it looks like humidity is connected to her propensity to use an antihistamine.

This talk (co-presented with Giles Weaver) discusses the data, the app, our approach to analysis and our tools (including Jupyter, scikit-learn, R, Anaconda and Seaborn) to build a variety of machine learned models to try to model antihistamine usage against external factors. Here are the slides:

Now we’re moving forward to a couple of other participants (we’d like a few more to join us – if you’re on iOS and in London and can commit to 3 months consistent usage we’ll try to tell you what drives your sneezes). We also have academic introductions so we can validate our ideas (and/or kick them into the ground and try again!).

This is the second full day of the conference – we have 330 attendees and we’ve had 2 great keynote speakers and a host of wonderful talks and tutorials (yesterday). Tonight we have our conference party. I’m super happy with how things are progressing – many thanks to all of our speakers, volunteers, Bloomberg and our sponsors for making this work so well.

Update – featured in Mode Analytics #23.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

17 Comments | Tags: Data science, pydata, Python

29 February 2016 - 23:02Will we see “[module] on Python 3.4+ is free but only paid-support for Python 2.7”?

I’m curious about the transition in our ecosystem from Python 2 to Python 3. On stage at our monthly PyDataLondon meetups I’m known to badger folk to take the step and upgrade to reduce the support burden on developers. The transition gathers pace but it still feels slow. I’ve noted my recommendations for moving to Python 3+ at the end. See also the reddit discussion.

I’m wondering – when will we see the point where open source projects say “We support Python 3.x for free but if you want bugs fixed for Python 2.7, you’ll have to pay“? I’m not saying “if”, but “when”. There’s already one example below and others will presumably follow.

In the last couple of years a slew of larger projects have dropped or are dropping support for Python 2.6 – numpy (some discussion), pandas, scipy, matplotlib, NLTK, astropy, dask, ipythondjango, numba, twisted, scrapy. Good – Python 2.6 was deprecated when 2.7 was released in 2010 (that’s 6 years ago!).

The position of the matplotlib and django teams is clearly “Python 2.7 and Python 3.4+”. Django states that Python 2.7 will be supported until the 2020 sunset date:

“As a final heads up, Django 1.11 is likely to be the last version to support Python 2.7 as it will be supported until the end of Python 2 upstream support in 2020. We’ve adopted a Python version support policy…”

We can expect the larger projects to support legacy userbases with a mix of Python 2.7 and 3.4+ for 3.5 years (at least until 2020). After this we should expect projects to start to drop 2.7 support, some (hopefully) more aggressively than others.

What about smaller projects? Several have outright dropped Python 2.7 support already – errbot (2016 is the last Python 2.7-supported year), nikola, python-thumbnails – or never supported it – wordfreq, featherweight. Which others have I missed? UpdateJupyterHub (cheers Thomas) too. IPython 6.0 in 2007 will be Python 3.4+ only (IPython 5.0 is the last supported Python 2.7 release). mitmproxy has just switched to a Python-3-only branch (Sept 2016).

More interestingly David MacIver (of Hypothesis) stated a while back that he’d support Python 2.7 for free but Python 2.6 would be a paid support option. He’s also tagged (regardless of version) a bunch of bugs that can be fixed for a fee. Viewflow is another – Python 3.4 is free for non-commercial use but a commercial license or support for Python 2.7 requires a fee. Asking for money to support old, PITA or difficult options seems mightily sensible. I guess we’ll see this first for tools that have a good industrial userbase who’d be used to paying for support (like Viewflow).

Aaron Meurer (lead dev on SymPy) has taken the position that library leaders should pledge for a switch to Python 3.x only by 2020. The pledge shows that scikit-bio is about to go Python 3-only and that IPython 6.x+ will be Python 3 only (from 2017). Increasingly we’ll see new libraries adding the shiny features for their Python 3 branch only.

What next? I imagine most new smaller projects will be Python 3.4+ (probably 3.5+ only soon), they’ll have no legacy userbase to support. They could widen their potential userbase by supporting Python 2.7 but this window only exists for 3 years and those users will have to upgrade anyhow. So why bother going backwards?

Once users notice that cooler new toys are Python 3.4+ only they’ll want to upgrade (e.g. NetworKit is Python 3.3+ only for high volume graph network analysis). They’ll only hold back if they’re supporting legacy internal systems (which will be the case for an awful lot of people). We’ll see this more as we get closer to 2020. What about after 2020?

I guess many companies will be slow to jump to Python 3 (it’d take a lot of effort for no practical improvement), so I’d imagine separate groups will start to support Python 2.7 libraries as forks. Hopefully the main library developers will drop support fairly quickly, to stop open source (cost-free) developers having a tax on their time supporting both platforms.

Separate evidence – Drupal 6 adopted a commercial-support-for-old-versions policy (thanks @chx). It is also worth noting that Ubuntu 16.04 LTS ships without Python 2. Microsoft and Brett Cannon have discussed the benefits of moving to Python 3+ recently.

My recommendations (coming from a Senior Industrial Data Scientist with 15+ years commercial experience and 10+ years using Python):

  • Where feasible – all new projects must use Python 3.5+ (e.g. Proof of Concepts, Research, new isolated systems) – this is surprisingly easy
  • If Python 2.7 compatibility is required – write all new code in a Python 3.5+ compatible way (1, 2, 3, 4), make extensive tests for the later inevitable migration (you already have good test-coverage, right?)
  • Accept that support for Python 2.7 gets turned off in 3.5 years and that all Python 2.7 code written now will likely have to be migrated later (this is a business cost that you can estimate now)
  • Accept that as we get closer to 2020 more programmers (both new and experienced) will be using Python 3.5+, so support for Python 2.7-based libraries will inevitably decline (there’s a big business risk here)

Graham and I did a lightning talk on jumping to Python 3 six months back, there’s a lot of new features in Python 3.4+ that will make your life easier (and make your code safer, so you burn less time hunting for problems). Jake also discussed the general problem for scientists back in 2013, it’ll be lovely when we get past this (now-very-boring) discussion.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

36 Comments | Tags: Data science, Python

7 February 2016 - 23:15Convert London Oyster (Travel) PDFs to Pandas DataFrames

As a part of analysing Emily’s allergic rhinitis we want to test whether using the London Underground (notoriously dirty!) increases the likelihood of sneezing. The “black snot” phenomenon is well known to Londoners, possibly the particulates (from oil and metal) cause irritation. You can get updates via our allergic rhinitis analysis mailing list (very very low volume).

Transport for London lets us download a log of journeys – either as a CSV file (just dates and costs, no details) or a PDF file (containing full details of the journey and time). It would be much nicer if they made the data available in a cleanly-formatted open format (e.g. at least a CSV, preferably as HDF5).

The goal is to take the detail-rich PDFs and to build a DataFrame like:

                             from is_train                to
date                                                        
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town

Using textract (see these Python 3.4 install notes, I also use pdftotext) and a very hacky parser (written this evening, it really is a stateful-messy-hack <sorry>) I can parse a single PDF or a folder to build a Pandas DataFrame of journeys. You’ll find London Oyster PDF to DataFrame Parser here. The output is an HDF5 which can be loaded by Python into Pandas (or R or Matlab or whatever).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Data science, Python

25 January 2016 - 21:27PyDataLondon 2016 Call for Proposals Open

Our Call for Proposals for PyDataLondon 2016 (May 6-8) is open until approx. end of February (5ish weeks), you need to get your submission in soon!

If you want to sponsor to talk with 330 cutting edge data scientists – you’d better hurry, we’ve already started signing deals.

In the CfP we’re looking for:

  • Stories about successful data science projects (including the highs and lows)
  • Machine learning (including Deep Learning) – especially why you used certain algorithms and how you diagnosed features
  • Visualisation – have you explained or explored something that’s good to share?
  • Data cleaning
  • Data process (getting data, understanding it, building models, deploying solutions)
  • Industrial and Academic stories
  • Big data including Spark

You might also be interested in PyDataAmsterdam on March 12-13th (their Call for Proposals is already open).

We’ve also got a new (temporary URL) webpage for our regular meetups here, this has notes on how to submit a talk to the meetup (not the conference, just the PyDataLondon meetup). Please take a look if you’d like to speak to 200 folk at our monthly meetup.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

25 Comments | Tags: Data science, pydata, Python

12 January 2016 - 11:28Data Scientist Jobs in London

Back in January 2015 I announced my Data Science Jobs UK email list. This has grown nicely, several hundred data scientists have joined it and are interested in (mostly) Python related jobs around London with an even split between contract and permanent roles. If you sign-up to the mailing list you’ll get:

  • 1-2 plain-ASCII mails a month with a summary of current jobs (typically 4-6), mostly focused around London
  • Sometimes the jobs are remote
  • Mostly they’re for Python but Matlab and R also come up

I manage the list, your email is never shared and the list is run by mailchimp so you can easily unsubscribe. Active data scientists who attend PyDataLondon can post for free, others can post at a commercial rate (e.g. recruiters and folk in companies). I vet all the jobs to ensure they’re relevant. Drop me an email if you’ve got a relevant job to share.

“After placing a contract ad on this list I was contacted by a number of high quality and enthusiastic data scientists, who all proposed innovative and exciting solutions to my research problem, and were able to explain their proposals clearly to a non-specialist; the quality of responses was so high that I was presented with a real dilemma in choosing who to work with”. – Hazel Wilkinson, Cambridge University

I put the list together to help local data scientists find more relevant jobs, feel free to dip in and out when it might be useful.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

9 Comments | Tags: Data science, Python

11 January 2016 - 23:57Allergic Rhinitis (“Why do I always sneeze?!”) research project using Machine Learning

Since April my wife (@fluffyemily) and I have been running a research project around her allergies. She sneezes all year and we’re trying to figure out the cause. Allergic Rhinitis affects 10-30% of Westerners, in Emily’s case it is all-year so it isn’t just pollen related. We figure that a good data-collection process coupled with robust analysis might reveal some of the causes of sneezing such that Emily’s in better control of her Rhinitis.

Emily’s a senior iOS developer with Mozilla, she wrote an open source App for her iPhone to log her sneezes, antihistamine use and interactions with “things” like animals. The App gives us a time-stamp and geolocation. Since she’s mostly in London we’ve got a rich source of events to join to other datasets.

This post is just to put down a marker. I’ve made some progress using Machine Learning to predict when an antihistamine might be used. Currently I can out-predict a Dummy (majority-class) classifier using many cross-validation runs, this is hardly brilliant but we didn’t expect diagnosing a long-term allergy to be a simple affair! Exploratory data analysis on the data shows lots of interesting behaviours, I hope to talk about some of these in the future.

We’ve tried (and so far rejected) air-born particulates as a reason for her allergies via Kings College LondonAir data (thanks!). Weather data is more promising using a local wunderground station (Emily seems to be a little sensitive to humidity and windspeed). I’ve recently started work on MyFitnessPal logged data (the Python 3.4 port was thankfully easy) to start to look at alcohol (a known histamine modifier) and possibly other food.

Behind the scenes I’ve got a collaborative group (thanks Frank and Giles!) in Slack and a private github repo, I plan to talk a little on how this works. I think talking about ways we can collaborate on research projects has value, anything that helps us move on from just working in an office seems like a good idea.

If you’re interested in hearing updates about this project and maybe getting involved to log your own allergy data, join this email announce list. Your email will be kept private, I’ll just send you an email every now and again when we’ve made some progress (which will probably appear here) and when we need volunteers.

Ultimately we’d like to help predict the causes of allergies for other folk. We’ve been talking about this for around 2 years, it is encouraging to see research like this pointing to the use of ML to predict and model the body’s behaviours.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

15 Comments | Tags: Data science, Life, Python