Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

“On the Diagramatic Diagnosis of Data” at BudapestBI 2018

A couple of days back I spoke on using diagrams (matplotlib, seaborn, pandas profiling) to diagnose data during the exploratory data analysis phase. I also introduced my new tool discover_feature_relationships which helps prioritise which features to investigate in a new dataset by identifying pairs of features that have some sort of ‘interesting’ relationship. We finished with a short note on Bertil’s ‘data story‘ concept for documenting the EDA process.

I had a lovely room of international folk. We had a higher proportion of Hungarians this year as the organiser Bence has worked to build up the local community. This was followed by a variety of interesting questions around ways to tackle the EDA challenge:

BudapestBI room for my talk

My new tool discover_feature_relationships uses a Random Forest to identify predictive (and possibly non-linear) relationships between all pairs of columns in a dataframe. Typically we’d like to identify which features identify a target in machine learning, here instead I’m asking “what relationships exist throughout my data?”. I’ve used this to help me understand how data ‘works’, this is especially useful in semi-structured business data dumps which aren’t necessarily the right source of data to solve a particular task, but where up-front we don’t know what we have and what we need. I’d certainly welcome feedback on this idea, you’ll see diagrams and example for the Boston and Titanic datasets on the github page.

Next year I’d like to run some courses on the subject of successful project delivery (which includes “what have I got and what do I need to solve this challenge?!”), if you’d like to hear about that then you might want to join my training notification list.

Here are the slides for my talk:

 

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

On helping to open the inaugural PyDataPrague meetup

A couple of weeks back I had the wonderful opportunity to open the PyDataPrague meetup – this is the second meetup I’ve opened after our PyDataLondon started back in 2014. The core organisers Ondřej Kokeš, Jakub Urban and Jan Pipek asked me to give two short talks on:

We had over 100 people in the room, many from the extant local Python meetup.

 

Štěpán Roučka also gave a talk on SymPy with lots of lovely demos (video). The organisers were lovely – do please think on speaking out at PyDataPrague, you’ll get a lovely reception. I also got to see the wonderful architecture in Prague and even visit the local observatory where we saw the sun’s corona.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

On receiving the Community Leadership Award at the NumFOCUS Summit 2018

At the end of September I was honoured to receive the Community Leadership Award from NumFOCUS for my work building out the PyData community here in London and at associated events. This was awarded at the NumFOCUS 2018 Summit, I couldn’t attend the New York event and James Powell gave my speech on my behalf (thanks James!).

I’m humbled to be singled out for the award – things only worked out so well because of the work of all of my colleagues (and alumni) at PyDataLondon and all the other wonderful folk at events like PyDataBerlin, PyDataAmsterdam, EuroPython (which has had a set of PyData sub-tracks) and PyConUK (with similar sub-tracks).

NumFOCUS posted a blog entry on the awards, in addition Kelle Cruz received the Project Sustainability Award and Shahrokh Mortazavi received the Corporate Stewardship Award.

Cecilia Liao and Emlyn Clay and myself started the first PyDataLondon conference in 2014 with lots of help, guidance and nudging from NumFOCUS (notably Leah – thanks!), James and via Continuum (now Anaconda Inc) Travis and Peter. Many thanks to you all for your help – we’re now at 8,000+ members and our monthly events have 200+ attendees thanks to AHL’s hosting.

If you don’t know NumFOCUS – they’re the group who do a lot of the background support for a number of our PyData ecosystem packages (including numpy, Jupyter and Pandas and beyond to R and Julia), back the PyData conference series and help lots of associated events and group. They’re a non profit and an awful lot of work goes on that you never see – if you’d like to provide financial support, you can setup a monthly sponsorship here. If you currently don’t provide any contributions back into our open source ecosystem – setting up a regular monthly payment is the easiest possible thing you could do to help NumFOCUS raise more money which helps more development occur in our ecosystem.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

PyConUK 2018

Last weekend we had another fine PyConUK (2018) conference. Each year the conference grows, the Django Girls group had 70 or so women learning Django (and, often, Python for the first time). The kids hack day was a great success. The Pythonic-hardware demo session was fun.

Each year PyConUK encourages first-time speakers so we had the diverse-as-usual set of speakers and topics. If you’ve never attended – I’d encourage you to think on at least attending next year, and if you’re game do think about submitting a talk (even a 5 minute lightning talk as an easy first contribution).

This year I chaired two sets of sessions on the PyData track and spoke on the Diagramatic Diagnosis of Data. Slides are linked here (note that the PDF lacks some images and formatting), these are the PDF export from a live Jupyter Notebook presentation (here’s the repo).

I spoke on:

  • Styling Pandas
  • Initial exploratory data analysis using Google’s Facets and pandas_profiling
  • Data story-telling using matplotlib and Seaborn
  • Data stories by Bertil
  • Data relationship discovery using my discover_feature_relationships to help prioritise which columns to investigate

Here’s the talk:

Pete Inglesby also ran a rather fun competition for us to write a Python based limited-opcode Connect 4 solver. You wrote some code, uploaded it and watched it battle the other entrants. For a little while I held 2nd place but I dropped by the finals to 7th. Here’s Pete’s botany code, Rob’s winning set of solutions and Sev’s bots.. Here’s the diagnostic session after the competition (I’d gone home a day before ). A few lessons learned:

  • Analyse the bot failures against any default bots
  • Play the bot by hand to see how kind of mistakes it makes
  • Submitting many entries yields more information about placing than running local simulations (just as with Kaggle)
  • Don’t trust the bot titles (“minimax” and others didn’t actually use that strategy)
  • Don’t go complex early – check the simple ways you can lose and avoid these mistakes (I tried doing full-board scoring – that eats all of your scant opcodes in no time at all)
  • Check for traps which will play out against you and block where possible
  • A pre-calculated 8-ply deep solution, uploaded as a compressed data structure in the source file, is pretty sweet (this came 4th with no other strategies)

If you’re roughly in the area of Cardiff you might want to look at the PyDataCardiff and PyDataBristol meetups. They’d be great places for you to meet local community members and, perhaps, to practice giving a talk that you might later submit to PyConUK next year. If you’re in London then you’re very welcome to attend our PyDataLondon or maybe you’ll want to look at the London Python meetup.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

On the growth of our PyDataLondon community

I haven’t spoken on our PyDataLondon meetup community in a while so I figure a few numbers are due. We’re now at an incredible 7,800 members and just this month we had 200 members in the room at AHL’s new venue. We’re a volunteer run community – you’ll see the list of our brilliant volunteers here along with their Twitter accounts.

I polled the attendees this month and 1/3 of the hands went up (see below) to the question “Who is a first-timer to this meetup?”. This shows the continued growth in our community and in the wider data science ecosystem in London. Welcome along!

 

One of our talks was on Pandas v1, that included an update by Marc on how Python 2.x is being deprecated in Pandas next year and the new Cyberpandas and Fletcher (dtype extensions including faster strings) libraries. Marc also noted that Pandas is estimated to have 5-10 million users! One of the benefits of internal Pandas updates will be the “UInt8” and related dtypes – we’ll have integers with NaNs for the first time ever (previously int arrays with NaNs were promoted to floats which have NaN support in numpy).

Given the continued growth of our ecosystem – this means we have more Python newbies and more Data Science newbies (including converts moving away from Excel and SPSS). We’re always looking for new speakers. New speakers don’t have to be experienced data scientists – a 5 minute lightning talk on how you are transitioning in to this ecosystem from elsewhere can be hugely valuable to other new members. A talk (5 mins or 30 minutes) on a technique you’re experienced in – even if there’s no equivalent Python library – is also incredibly educational. Please come and share your knowledge.

Talking will raise your profile and it’ll raise your employer’s profile (if that’s what you’re after) and that obviously helps with hiring. We continue after the meetup in the local pub (typically The Banker) so anyone who’s been speaking and who ends with “and we’re hiring!” tends to have interesting conversations in the pub afterwards. With 200 attendees it isn’t hard to find folk who’d be interested in your role. Remember – this is most effective for speakers as you have the entire audience’s attention. You’ll find instructions here on how to submit a talk.

AHL continue to support our open source PyData world (along with other open events like the London Machine Learning meetup), they now rent a professional auditorium next to their building each month for us with full hosting, mics on every chair and video recording (see them at PyDataTV) for speakers who consent. This isn’t cheap of course and it provides evidence of the growth of Python’s Data Science stack in the London financial community. AHL’s activity at the meetup is to say a few words before the break about who they’re hiring for, before everyone heads out for more beer. Thanks AHL for your continued support! You might also want to check their github repo.

There’s a whole pile of PyData conferences coming up, you might find some are closer to you or your offices than you imagine. Go check the list. The PyData meetup ecosystem has grown world-wide to 111 events now too!

PyData of course is supported by NumFOCUS, the non-profit in the US. NumFOCUS backs a lot of our open source tools. They’re having a summit late this September in the US – everyone is welcome, if you’re interested in the deeper direction of Python and the Data Science community then you might want to attend (or send a representative from your group?).

Of course you might also want to be hired by a company that works in our PyData ecosystem. I post out jobs (UK-centric but they stretch to western Europe and sometimes to the US) every 2 weeks to 650+ data scientists and engineers, typically 7 roles (mostly permie, some contract, all Python focused). You might want to join that list (note your email is always kept private and is never shared). Attending PyData members (i.e. anyone who helps build our ecosystem) gets a first post gratis.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More