Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

“Making Pandas Fly” for PyDataAmsterdam 2020

I thank the PyDataAmsterdam 2020 organisers for another chance to speak on Making Pandas Fly (PyDataAmsterdam 2020). This variant of the talk focuses more on:

  • Understanding when categories beat strings and smaller floats beat larger ones
  • What’s happening with NumPy behind the scenes
  • How we can save 50% of our RAM (and so fit in more data to the same machine) by checking dtypes with my dtype_diet tool
  • Considering that float16 is simulated on modern hardware and so is memory efficient but slow for calculating!
  • Tips to install bottleneck & numexpr to make Pandas faster
  • Digging into some Pandas internals when I filed a bug – and what I learned as a result (you can learn too by reading the bug report!)

In a few months I’ll run another of my Higher Performance Python virtual training classes, you’re most welcome to join. You’ll find details on my very-lightly-used “training email list“, you should join this if you’d like to hear about my upcoming training courses.

I make notes on some of these topics in my irregular “weekish notes” here on the blog and in my every-2-weeks “thoughts & jobs” email list. You’re welcome to join the list (your email is always kept private) if getting it in your inbox is more convenient.

At the end of my talks I always ask for a postcard “if you learned something”, I’ve just received the first for last week’s talk from the Netherlands – thanks!


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Weeknote (dtype-diet)

Over the weekend I hacked on dtype_diet – a tool for Pandas users that checks their DataFrame to see if smaller datatypes might be applicable. If so they’d offer no data loss and a reduction in RAM, for Categorical data there’s also the possibility of faster calculations. This tool makes no changes, it recommends the code you might copy into your project. I developed this as one of the ideas I didn’t get around to building whilst I was working on my book, but since that’s now published…

Talking of which – Micha and I have cleaned up our profiles on Amazon and Goodreads for High Performance Python 2nd ed. It is a pity we can’t do any book signings at events (seeing as we won’t have physical events for a while!). One thing I might do via my “thoughts & jobs” email list is to organise a Friday “coffee morning” to chat with whoever turns up about high performance Python and other subjects.

I’ve also learned that if I make a 1kg sourdough dough, immediately after kneading I can put 0.5kg into the fridge and 4 days later it cooks just fine. This is a nice refinement. I’m now interspersing buying fresh bread (for variety and convenience) with making my own and it feels like a more sustainable practice.

All going well I’ll be talking this month at PyDataAmsterdam and next month at EuroPython on higher performance Python and Pandas, I figure the dtype_diet tool will fit in nicely with a discussion about the new Pandas dtypes and their benefits over numpy dtypes.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Week(ish) note

So – High Performance Python 2nd ed finally shipped (Amazon, Goodreads) – yay! In brief we’ve added notes on how you can be a “highly performant programmer”, added some more profiling, added Pandas onto NumPy, improved the Compiling to C chapter with more Numba and a new full section on GPUs (in the first edition we said – GPUs are hard, nobody uses them, avoid them – oh, how the world moved on there!), updated async, JobLib with multiprocessing, multi-core including Dask, less RAM with Pandas and NumPy and four new “lessons from the field” from very experienced colleagues (thanks to Soledad, Linda, Vincent and Valentin). I need to write some more about this.

I posted it to Twitter and got a lot of lovely feedback.

I’ve started working on a very interesting project aimed at helping the likes of the Bank of England, the UK Government, the Civil Service and more understand the impact of the pandemic on the UK economy so fixes can be designed. My involvement is just starting, I’m hoping this will grow in interesting ways.

I put a simple solution into the Kaggle Covid 19 Week 5 competition and on the private leaderboard I’m hovering around rank 10 of 90ish. My solution takes the 5 day moving average and builds wide percentile confidence bounds – it is really dumb but it turns out to be really robust too.

For now I’ve cancelled my upcoming courses, life with the pandemic continues to be weird so I’ve decided to simplify things a bit. They’ll restart in a few months.

In other news my sourdough baking continues to improve, my radishes got pretty big, my lettuce is up, the plums and pears have started to grow and generally the garden is looking awesome. I’m also working harder at training my Spaniel, we’ve got a long way to go but having her not try to flush every cat she sees in the street is a huge step forwards.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Week note

Well, mid-next-week note I guess. I gave another variant of my higher performance Python talk last night for PyDataUK to 250 live streamers, we had some good questions, cheers all.

On Friday Micha & I heard that the 2nd edition of our Higher Performance Python book has gone to the printers – we’d said we’d do 1 person-month each on it last summer and 9 months later (with many-person-months invested each) we’re finally there. Phew.

I now have an open PR on the dabl project to add ordinal-sorted y-axis box plot items in place of the default always-sort-by-median, which I think makes some of the exploratory process more intuitive. This also involved figuring out a new weird matplotlib rendering behaviour and writing my first unit test where I make up a dummy matplotlib figure in a test which is rendered but never displayed. There’s always a million new things to learn, right?

I’ve also been digging into Companies House data to look at how the economy is responding to the pandemic and this lets me play with some higher performance Pandas operations (for my talks) and to dig into pivot_table, pivot, groupby and crosstab along with stack & unstack in Pandas. I’ve always been confused about how many options I have available here, I’m less confused now, but I still don’t understand some of the performance differences I see for otherwise-equivalent operations. I also discovered the Pandas xs operation (take a cross-section of a dataframe) whilst reading a wikipedia page on crosstabs. Learning. Always learning.

My kneaded sourdough is improving, I’m up to 1kg now. I think I’m done with no-knead for a bit, that was fun but the really-wet dough is hard to handle. Radishes are great, pretty big now, but annoyingly the snails have found my lettuce.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“Flying Pandas” and “Making Pandas Fly” – virtual talks this weekend on faster data processing with Pandas, Modin, Dask and Vaex

This Saturday and Monday I’ve had my first experience presenting at virtual conferences – on Saturday it was for Remote Pizza Python (brilliant line-up!) and on Monday (note – this post predates the talk, I’ll update it tomorrow after I’ve spoken) at BudapestBI. UPDATE added 2nd variant of Making Pandas Fly for a short-notice PyDataUK talk too.

My slides for Remote Pizza Python are here “Flying Pandas – Modin, Dask & Vaex“. I cover the following in a 10 min talk:

  • Modin – new academic project, makes a new algebra for dataframes (not just Pandas), provides automated column & row parallelisation options for no code changes
  • Dask – great for blocked Pandas DataFrames in parallel on 1 or more machines (it can also parallelise on a single machine multi-core with in-RAM data which I didn’t cover)
  • Vaex – new Pandas-like DataFrame with a subset of operations, better string implementation so you fit more strings into RAM than with Pandas
  • I recommended sticking to Pandas if your code fits in RAM, trying Modin if you have it in RAM or using Dask if you have a bigger-than-RAM scenario, with Vaex being great for an experiment

The reaction was very positive and on the internal Discord chat we had some great Q&A about the use of Numba, Dask, Modin and other tools.

For PyDataBudapest I gave a longer 30 min talk on Making Pandas Fly (GitHub source as Notebooks for both parts):

  • Dask to preprocess larger dataset with nice diagnostics
  • Pandas – using Category and float32 or float16 to save RAM and to do faster lookups
  • Pandas – dropping to NumPy to calculate numeric operations faster with a look at James Powell’s great callgraph prototype to dig into the call history complexity
  • Pandas – using Numba to accelerate numeric functions

The remote talk to Budapest was slightly hampered by a chunk of the Virgin internet backbone disappearing just before I spoke, thankfully we got it back a few minutes later (else I was going to live present, with a live Dask demo, tethered via a 4G mobile connection!). I had some great questions from the Budapest audience – thanks for having me!

For PyDataUK, our inaugural event this week (a week after the two talks above) I gave a variant of Making Pandas Fly to 250 live streamers, oragnised by the lovely crowd at PyDataManchester. The YouTube link is available (via the meetup page), Paige Bailey of Google was the lead speaker on TensorFlow Probability which was an intriguing talk (sadly yet another thing I’ll run out of time before trying).

Thanks to James Powell for his CallGraph code (uploaded here) to show how many extra calls Pandas might add onto a NumPy operation on a column.

The sensible outcomes of both talks are:

  • Use Dask to preprocess large datasets that don’t fit into RAM
  • Use Pandas intelligently to save RAM and make your manipulations run faster for investigatory work
  • Look at Modin to see if making no changes to your code can result in speed-ups on larger in-RAM datasets
  • Check Vaex if you have large memory mapped (HDF5) datasets or if you want faster string processing

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More