“Higher Performance Python” at PyDataCambridge 2019

I’ve had the pleasure of speaking at the first PyDataCambridge conference (2019), this is the second PyData conference in the UK after PyDataLondon (which colleagues and I co-founded 6 years back). I’m super proud to see PyData spread to 6 regional meetups and now 2 UK conferences.

We had over 200 attendees and the conference (and a swanky black-tie conference dinner) and the single-track event had a rich set of topics (schedule). For me scikit-multiflow (extending sklearn to streaming data) was a hit along with model stability checking (by FarFetch) and an overview of GA2M (an extended Generalised Adaptive Model with explainability). Thanks to the speakers for fine talks and the audience for fine questions and Cambridge Spark and the PyDataCambridge meetup for helping make it all happen!

I spoke on Higher Performance Python with a focus towards making Pandas operations go faster and an eye on the upcoming Second Edition of our High Performance Python (O’Reilly) book. The talk covers:

  • Using line_profiler to evaluate sklearn’s LinearRegression vs NumPy’s lstsq (spoiler – lstsq is much faster but that’s due to sklearn being much safer, the slow-down is all due to safety code in sklearn that helps keep your productivity higher overall)
  • Using Pandas for line-by-line iteration (slow) vs apply (faster) and apply with raw=True to expose NumPy arrays (fastest)
  • Using Numba to JIT compile lstsq using apply with raw=True for a huge speed-up
  • Using Dask to parallelise the Numba solution for further speed-ups
  • Advice on being a “highly performant data scientist”

The last point is important – going “compiler happy” and writing highly efficient code may well slow down your team and your overall velocity. Amongst other items I recommended profiling first, maybe introducing Dask & Numba only with a team’s consent and looking at tools like Bulwark to add tests to DataFrames to avoid being derailed by strange data bugs.

Right now Micha and I are busily working to complete the second edition of our book, all going well it’ll be in for Christmas with a publication date around April 2020.

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.