Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

On the growth of our PyDataLondon community

I haven’t spoken on our PyDataLondon meetup community in a while so I figure a few numbers are due. We’re now at an incredible 7,800 members and just this month we had 200 members in the room at AHL’s new venue. We’re a volunteer run community – you’ll see the list of our brilliant volunteers here along with their Twitter accounts.

I polled the attendees this month and 1/3 of the hands went up (see below) to the question “Who is a first-timer to this meetup?”. This shows the continued growth in our community and in the wider data science ecosystem in London. Welcome along!

 

One of our talks was on Pandas v1, that included an update by Marc on how Python 2.x is being deprecated in Pandas next year and the new Cyberpandas and Fletcher (dtype extensions including faster strings) libraries. Marc also noted that Pandas is estimated to have 5-10 million users! One of the benefits of internal Pandas updates will be the “UInt8” and related dtypes – we’ll have integers with NaNs for the first time ever (previously int arrays with NaNs were promoted to floats which have NaN support in numpy).

Given the continued growth of our ecosystem – this means we have more Python newbies and more Data Science newbies (including converts moving away from Excel and SPSS). We’re always looking for new speakers. New speakers don’t have to be experienced data scientists – a 5 minute lightning talk on how you are transitioning in to this ecosystem from elsewhere can be hugely valuable to other new members. A talk (5 mins or 30 minutes) on a technique you’re experienced in – even if there’s no equivalent Python library – is also incredibly educational. Please come and share your knowledge.

Talking will raise your profile and it’ll raise your employer’s profile (if that’s what you’re after) and that obviously helps with hiring. We continue after the meetup in the local pub (typically The Banker) so anyone who’s been speaking and who ends with “and we’re hiring!” tends to have interesting conversations in the pub afterwards. With 200 attendees it isn’t hard to find folk who’d be interested in your role. Remember – this is most effective for speakers as you have the entire audience’s attention. You’ll find instructions here on how to submit a talk.

AHL continue to support our open source PyData world (along with other open events like the London Machine Learning meetup), they now rent a professional auditorium next to their building each month for us with full hosting, mics on every chair and video recording (see them at PyDataTV) for speakers who consent. This isn’t cheap of course and it provides evidence of the growth of Python’s Data Science stack in the London financial community. AHL’s activity at the meetup is to say a few words before the break about who they’re hiring for, before everyone heads out for more beer. Thanks AHL for your continued support! You might also want to check their github repo.

There’s a whole pile of PyData conferences coming up, you might find some are closer to you or your offices than you imagine. Go check the list. The PyData meetup ecosystem has grown world-wide to 111 events now too!

PyData of course is supported by NumFOCUS, the non-profit in the US. NumFOCUS backs a lot of our open source tools. They’re having a summit late this September in the US – everyone is welcome, if you’re interested in the deeper direction of Python and the Data Science community then you might want to attend (or send a representative from your group?).

Of course you might also want to be hired by a company that works in our PyData ecosystem. I post out jobs (UK-centric but they stretch to western Europe and sometimes to the US) every 2 weeks to 650+ data scientists and engineers, typically 7 roles (mostly permie, some contract, all Python focused). You might want to join that list (note your email is always kept private and is never shared). Attending PyData members (i.e. anyone who helps build our ecosystem) gets a first post gratis.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Keynote at EuroPython 2018 on “Citizen Science”

I’ve just had the privilege of giving my first keynote at EuroPython (and my second keynote this year), I’ve just spoken on “Citizen Science”. I gave a talk aimed at engineers showing examples of projects around healthcare and humanitarian topics using Python that make the world a better place. The main point was “gather your data, draw graphs, start to ask questions” – this is something that anyone can do.

Last day. Morning keynote by @IanOzsvald (sp.) “Citizen Science”. Really cool talk! – @bz_sara

EuroPython crowd for my keynote

In the talk I covered 4 short stories and then gave a live demo of a Jupyter Lab to graph some audience-collected data:

  • Gorjan‘s talk on Macedonian awful-air-quality from PyDataAmsterdam 2018
  • My talks on solving Sneeze Diagnosis given at PyDataLondon 2017, ODSC 2017 and elsewhere
  • Anna‘s talk on improving baby-delivery healthcare from PyDataWarsaw 2017
  • Dirk‘s talk on saving Orangutangs with Drones from PyDataAmsterdam 2017
  • Jupyter Lab demo on “guessing my dog’s weight” to crowd-source guesses which we investigate using a Lab

The goal of the live demo was to a) collect data (before and after showing photos of my dog) and b) show some interesting results that come out of graphing the results using histograms so that c) everyone realises that drawing graphs of their own data is possible and perhaps is something they too can try. Whilst having folk estimate my dog’s weight won’t change the world, getting them involved in collecting and thinking about data will, I hope, get more folk engaged outside of the conference.

The slides are here.

One of the audience members took some notes:

Here’s some output. Approximately 440 people participated in the two single-answer surveys. The first (poor-information estimate) is “What’s the weight of my dog in kg when you know nothing about the dog?” and the second (good-information estimate) is “The same, but now you’ve seen 8+ pictures of my dog”.

With poor information folk tended to go for the round numbers (see the spikes at 15, 20, 30, 35, 40). After the photos were shown the variance reduced (the talk used more graphs to show this), which is what I wanted to see. Ada’s actual weight is 17kg so the “wisdom of the crowds” estimate was off, but not terribly so and since this wasn’t a dog-fanciers crowd, that’s hardly surprising!

Before showing the photos the median estimate was 12.85kg (mean 14.78kg) from 448 estimates. The 5% quantile was 4kg, 95% quantile 34kg, so 90% of the estimates had a range of 30kg.

After showing the photos the median estimate was 12kg (mean 12.84kg) from 412 estimates. The 5% quantile was 5kg, 95% quantile 25kg, so 90% of the estimates had a range of 20kg.

There were only a couple of guesses above 80kg before showing the photos, none after showing the photos. A large heavy dog can weight over 100kg so a guess that high, before knowing anything about my dog, was feasible.

Around 3% of my audience decided to test my CSV parsing code during my live demo (oh, the wags) with somewhat “tricky” values including “NaN”, “None”, “Null”, “Inf”, “∞”, “-15”, “⁴4”, “1.00E+106”, “99999999999”, “Nana”, “1337” (i.e. leet!), “1-30”, “+[[]]” (???). The “show the raw values in a histogram” cell blew up with this input but the subsequent cells (using a mask to select only a valid positive range) all worked fine. Ah, live demos.

The slides conclude with two sets of links, one of which points the reader at open data sources which could be used in your own explorations. Source code is linked on my github.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“Creating correct and capable classifiers” at PyDataAmsterdam 2018

This weekend I got to attend PyDataAmsterdam 2018 – this is my first trip to the Netherlands (Yay! It is lovely here). The conference grew on last year to 345 attendees with over 20% female speakers.

In addition to attending some lovely talks I also got to run another “Making your first open source contribution” session, with James Powell and a couple of people in 30 minutes we fixed some typos in Nick Radcliffe’s tdda project to improve his overview documentation. I’m happy to have introduced a couple of new people to the idea that a “contribution” can start with a 1 word typo-fix or adding notes to an existing bug report, without diving into the possibly harder world of making a code contribution.

We also had Segii along as our NumFOCUS representative (and Marci Garcia of the Pandas Sprints has done this before too). If you want to contribute to the community you might consider talking to NumFOCUS about how to be an ambassador at a future conference.

I gave an updated talk on my earlier presentation for PyDataLondon 2018, this time I spoke more on :

  • YellowBrick‘s ROC curves
  • SHAPley machine learning explanations
  • Along with my earlier ideas on diagnosis using Pandas and T-SNE

I had a lovely room, wide enough that I only got a third of my audience in the shot below:

Audience at PyDataAmsterdam 2018
My audience at PyDataAmsterdam 2018

I’ve updated some of the material from my London talk, particularly I’ve added a few slides on SHAPley debugging approaches to contrast against ELI5 that I used before. I’ll keep pushing this notion that we need to be debugging our ML models so we can explain why they work to colleagues (if we can’t – doesn’t that mean we just don’t understand the black box?).

Checking afterwards it is lovely to get supportive feedback, thank you Ondrej and Tobias:

Here are the slides (the code has been added to my data_science_delivered github repo):

I’m really happy with the growth of our international community (we’re up to 100 PyData meetups now!). As usual we had 5 minute lightning talks at the close of the conference. I introduced the nbdime Notebook diff tool.

I’m also very pleased to say that I’ve had a lot of people come up to say Thanks after the talk. This is no doubt because I now highlight the amount of work done by volunteer conference organisers and volunteer speakers (almost everyone involved in running a PyData conference is an unpaid volunteer – organisers and speakers alike). We need to continue making it clear that contributing back to the open source ecosystem is essential, rather than just consuming from it. James and I gave a lightning talk on this right at the end.

Update – I’m very happy to see this tweet about how James’ and my little talk inspired Christian to land a PR. I’m also very happy to see this exchange with Ivo about potentially mentoring newer community members. I wonder where this all leads?


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

PyDataLondon 2018 and “Creating Correct and Capable Classifiers”

This weekend we ran PyDataLondon 2018, the fifth iteration of our conference (connected with our monthly PyDataLondon meetup). This year we grew to 500 attendees! Read about the past PyDataLondon 2017 here.

Updatesvideos are online, reportedly we raised £91,000 towards open source support for NumFOCUS via ticket sales & sponsorship (all the London team are unpaid volunteers, this money goes back to NumFOCUS to support the PyData ecosystem).

Here’s a summary of what we covered with 500 attendees over 3 days:

On Thursday morning I co-ran a “Make your first open source contribution” with Nick (of PyDataEdinburgh). We had a group who’d rarely (or never) made a contribution to github. We managed to commit a couple of minor doc fixes, recreated a bug in ELI5 and subsequently a new (failing) test was submitted to the project. Great success! I’m interested in another bug if you want to make a contribution.

Each room was packed with 150-200 people (with a comfy number of chairs for everyone!):

One of our key NumFOCUS organisers is Leah Silen, she’s an unsung hero who makes every conference come together. She broke her foot recently and couldn’t fly over. It turns out the crowd rather misses her and all of her work. Get well soon!

At the conference I spoke on “Creating Correct and Capable Classifiers” (worked Notebook in my github repo, full video online). We took a look at starting with a baseline model, building a better stable model, visualising errors, diagnosing where it might be failing and explaining the end results to a colleague.

Many thanks to @matti of PyDataBerlin for taking a lovely photo of our speaker-duck gift for speakers:

Many thanks also to all of our volunteers and to the staff at the Tower Hotel – thanks for making the weekend so much fun 🙂


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

AHL Python Data Hackathon

Yesterday I got to attend Man AHL’s first London Python Data hackathon (21-22 April – photos online). I went with the goal of publishing my ipython_memory_usage tool from GitHub to PyPI (success!), updating the docs (success!) and starting to work on the YellowBrick project (partial-success).

This is AHL’s first crack at running a public Python hackathon – from my perspective it went flawlessly. They use Python internally and they’ve been hosting my PyDataLondon meetup for a couple of years (and, all going well, for years to come), they support the Python ecosystem with public open source contributions and this hackathon was another method for them to contribute back. This is lovely (since so many companies aren’t so good at contributing and only consume from open source) and should be encouraged.

Here’s Bernd of AHL introducing the hackathon. We had 85 or so folk (10% women) in the room:

Bernd introducing Python Data hackathon at AHL

I (and 10 or so others) then introduced our projects. I was very happy to have 6 new contributors volunteer to my project. I introduced the goals, got everyone up to speed and then we split the work to fix the docs and to publish to the test PyPI server and then finally to the official PyPI public server.

This took around 3 hours, most of the team had some knowledge of a git workflow but none had seen my project before. With luck one of my colleagues will post a conda-forge recipe soon too. Here’s my team in action (photo taken by AHL’s own CTO Gary Collier):

Team at AHL hackathon

Many thanks to Hetal, Takuma, Robin, Lucija, Preyesh and Pav.

Robin had recently published his own project to PyPI so he had some handy links. Specifically we used twine and these notes. In addition the Pandas Sprint guide was useful for things like pulling the upstream master between our collaborative efforts (along with Robin’s notes).

This took about 3 hours. Next we had a crack at the sklearn-visualiser YellowBrick – first to get it running and tested and then to fix the docs on a recent code contribution I’d made (making a sklearn-compatible wrapper for statsmodels’ GLM) with some success. It turns out that we might need to work on the “get the tests” running process, they didn’t work well for a couple of us – this alone will make for a nice contribution once we’ve fixed it.

Overall this effort helped 6 people contribute to two new projects, where 5 of the collaborators had only some prior experience (as best I remember!) with making an open source contribution. I’m very happy with our output – thanks everyone!


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More