Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

Week note

Well, mid-next-week note I guess. I gave another variant of my higher performance Python talk last night for PyDataUK to 250 live streamers, we had some good questions, cheers all.

On Friday Micha & I heard that the 2nd edition of our Higher Performance Python book has gone to the printers – we’d said we’d do 1 person-month each on it last summer and 9 months later (with many-person-months invested each) we’re finally there. Phew.

I now have an open PR on the dabl project to add ordinal-sorted y-axis box plot items in place of the default always-sort-by-median, which I think makes some of the exploratory process more intuitive. This also involved figuring out a new weird matplotlib rendering behaviour and writing my first unit test where I make up a dummy matplotlib figure in a test which is rendered but never displayed. There’s always a million new things to learn, right?

I’ve also been digging into Companies House data to look at how the economy is responding to the pandemic and this lets me play with some higher performance Pandas operations (for my talks) and to dig into pivot_table, pivot, groupby and crosstab along with stack & unstack in Pandas. I’ve always been confused about how many options I have available here, I’m less confused now, but I still don’t understand some of the performance differences I see for otherwise-equivalent operations. I also discovered the Pandas xs operation (take a cross-section of a dataframe) whilst reading a wikipedia page on crosstabs. Learning. Always learning.

My kneaded sourdough is improving, I’m up to 1kg now. I think I’m done with no-knead for a bit, that was fun but the really-wet dough is hard to handle. Radishes are great, pretty big now, but annoyingly the snails have found my lettuce.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“Flying Pandas” and “Making Pandas Fly” – virtual talks this weekend on faster data processing with Pandas, Modin, Dask and Vaex

This Saturday and Monday I’ve had my first experience presenting at virtual conferences – on Saturday it was for Remote Pizza Python (brilliant line-up!) and on Monday (note – this post predates the talk, I’ll update it tomorrow after I’ve spoken) at BudapestBI. UPDATE added 2nd variant of Making Pandas Fly for a short-notice PyDataUK talk too.

My slides for Remote Pizza Python are here “Flying Pandas – Modin, Dask & Vaex“. I cover the following in a 10 min talk:

  • Modin – new academic project, makes a new algebra for dataframes (not just Pandas), provides automated column & row parallelisation options for no code changes
  • Dask – great for blocked Pandas DataFrames in parallel on 1 or more machines (it can also parallelise on a single machine multi-core with in-RAM data which I didn’t cover)
  • Vaex – new Pandas-like DataFrame with a subset of operations, better string implementation so you fit more strings into RAM than with Pandas
  • I recommended sticking to Pandas if your code fits in RAM, trying Modin if you have it in RAM or using Dask if you have a bigger-than-RAM scenario, with Vaex being great for an experiment

The reaction was very positive and on the internal Discord chat we had some great Q&A about the use of Numba, Dask, Modin and other tools.

For PyDataBudapest I gave a longer 30 min talk on Making Pandas Fly (GitHub source as Notebooks for both parts):

  • Dask to preprocess larger dataset with nice diagnostics
  • Pandas – using Category and float32 or float16 to save RAM and to do faster lookups
  • Pandas – dropping to NumPy to calculate numeric operations faster with a look at James Powell’s great callgraph prototype to dig into the call history complexity
  • Pandas – using Numba to accelerate numeric functions

The remote talk to Budapest was slightly hampered by a chunk of the Virgin internet backbone disappearing just before I spoke, thankfully we got it back a few minutes later (else I was going to live present, with a live Dask demo, tethered via a 4G mobile connection!). I had some great questions from the Budapest audience – thanks for having me!

For PyDataUK, our inaugural event this week (a week after the two talks above) I gave a variant of Making Pandas Fly to 250 live streamers, oragnised by the lovely crowd at PyDataManchester. The YouTube link is available (via the meetup page), Paige Bailey of Google was the lead speaker on TensorFlow Probability which was an intriguing talk (sadly yet another thing I’ll run out of time before trying).

Thanks to James Powell for his CallGraph code (uploaded here) to show how many extra calls Pandas might add onto a NumPy operation on a column.

The sensible outcomes of both talks are:

  • Use Dask to preprocess large datasets that don’t fit into RAM
  • Use Pandas intelligently to save RAM and make your manipulations run faster for investigatory work
  • Look at Modin to see if making no changes to your code can result in speed-ups on larger in-RAM datasets
  • Check Vaex if you have large memory mapped (HDF5) datasets or if you want faster string processing

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Recent “week notes”

I’ve not done a public “week notes” before. I’ve been hacking on various things and I figure it is worth sharing some of it.

Using public Companies House data I’ve started to plot the decline in new company formations in the UK. Here’s a first crack, which shows a decline at the end of March. This data comes monthly as a single dump so it didn’t contain April. Here’s a second crack going back 10 years, it shows co-ordinated drops in activity during UK public holidays (and this March still looks awful).

For this third crack I’ve used the Companies House API to augment the static dump with up-to-date data for April (which’ll be replaced when the new data dump is provided in a week). There’s a 3 week current window showing “no dissolutions” which I suspect means they’ve not been added to the public database, the decline in registrations is clear. I’m guessing registrations go via a different human process than dissolutions and dissolutions might be very laggy due to admin.

In Pandas I learned about the “memory_usage” function which gives a per-column memory report. Benjamin noted that this appears in Dask and CuDF too in a reply.

For my upcoming Remote Pizza Python talk (tomorrow) on Modin, Dask & Vaex I’ve delved further into Modin and Dask. The Modin folk gave useful feedback for “how Modin is working” and I’ve got an open question on Dask on stackoverflow regarding memory usage.

On Monday I give a talk remotely for PyDataBudapest which focuses more on how to get more out of Pandas on a smaller-data scenario. Experiences for both of these talks will go into my upcoming Higher Performance Python training (start of June).

My garden is doing well – I’m now eating my new radishes. Kilos of flour will arrive soon so I can expand my bread making experiments too!


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

New Higher Performance Python class (June 1-3)

I’ve listed my next Higher Performance Python public class, it’ll run online for 3 mornings on June 1-3 during UK hours. We’ll use Zoom and Slack with pre-distributed Notebooks and modules and you’ll run it using an Anaconda environment. Here’s the write-up from my recent class.

We’ll focus on

  • Profiling to find what’s slow in your code so you spend your time fixing the right things (this is so important, our intuitions are always wrong!)
  • Switching to NumPy to get benefits from vectorisation
  • Compiling with Numba to get C-like speeds for very little effort (we’ll get a 200x speed-up overall)
  • Run in parallel with OpenMP and with JobLib to take advantage of multiple cores
  • Learn slow and faster ways of solving problems in Pandas (we’ll see a massive speed-up once we go slightly “under the covers” with Pandas and avoid doing silly access operations)
  • Use Numba compiled functions to process Pandas data (using the raw=True trick)
  • Use Dask to process Pandas in parallel to use all your cores when your data fits in RAM
  • Look into using Dask to process bigger-than-RAM datasets
  • Review other tooling and process options to make you generally more performant in your work

Feel free to contact me if you have questions about the course. I’m currently not planning to run another iteration of this for some months (possibly October for the next one).


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Notes on last week’s Higher Performance Python class

Last week I ran a two-morning Higher Performance Python class, we covered:

  • Profiling slow code (using a 2D particle infection model in an interactive Jupyter Notebook) with line_profiler & PySpy
  • Vectorising code with NumPy vs running the original with PyPy
  • Moving to Numba to make iterative and vectorised NumPy really fast (with up to a 200x improvement on one exercise)
  • Ways to make applying functions in Pandas much faster and multicore (with Dask & Swifter, along with Numba)
  • Best practice for each tool
  • Excellent discussion where I got taught a few new tips too (and in part this is why I love teaching smart crowds!)

If you’d like to hear about the upcoming iterations please join my low-volume training announce list and I offer a discount code in exchange for you spending two minutes filling in this very brief survey about your training needs.

Working through the exercises from day 1 of the high performance python course from. Who knew there was so much time to shave off from functions I use every day?.. apart from Ian of course” – Tim Williams

Here’s my happy class on the first morning:

Class attendees

We used Zoom to orchestrate the calls with a mix of screen-share for my demos and group discussion. Every hour we took a break, after the first morning I set some homework and I’m waiting to hear how the take-home exercise will work out. In 2 weeks we’ll have a follow-up call to clear up any remaining questions. One thing that was apparent was that we need more time to discuss Pandas and “getting more rows into RAM” so I’ll extend the next iteration to include this. A little of the class came directly from the 2nd edition of my High Performance Python book with O’Reilly (due out in May), almost all of it was freshly written for this class.

In the class Slack a bunch of interesting links were shared, we got to discuss how several people use Numba in their companies with success. Whilst I need to gather feedback from my class it feels like the “how to profile your code so you focus your effort on the real bottlenecks” was the winner from this class, along with showing how easily we can use Numba to speed up things in Pandas (if you know the “raw=True” trick!).

I plan to run another iteration of this class, along with online-only versions of my Successfully Delivering Data Science Projects & Software Engineering for Data Scientists – do let me know, or join my training email list, if you’d like to join.

 

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More