This Saturday and Monday I’ve had my first experience presenting at virtual conferences – on Saturday it was for Remote Pizza Python (brilliant line-up!) and on Monday (note – this post predates the talk, I’ll update it tomorrow after I’ve spoken) at BudapestBI. UPDATE added 2nd variant of Making Pandas Fly for a short-notice PyDataUK talk too.
My slides for Remote Pizza Python are here “Flying Pandas – Modin, Dask & Vaex“. I cover the following in a 10 min talk:
- Modin – new academic project, makes a new algebra for dataframes (not just Pandas), provides automated column & row parallelisation options for no code changes
- Dask – great for blocked Pandas DataFrames in parallel on 1 or more machines (it can also parallelise on a single machine multi-core with in-RAM data which I didn’t cover)
- Vaex – new Pandas-like DataFrame with a subset of operations, better string implementation so you fit more strings into RAM than with Pandas
- I recommended sticking to Pandas if your code fits in RAM, trying Modin if you have it in RAM or using Dask if you have a bigger-than-RAM scenario, with Vaex being great for an experiment
- Dask to preprocess larger dataset with nice diagnostics
- Pandas – using Category and float32 or float16 to save RAM and to do faster lookups
- Pandas – dropping to NumPy to calculate numeric operations faster with a look at James Powell’s great callgraph prototype to dig into the call history complexity
- Pandas – using Numba to accelerate numeric functions
The remote talk to Budapest was slightly hampered by a chunk of the Virgin internet backbone disappearing just before I spoke, thankfully we got it back a few minutes later (else I was going to live present, with a live Dask demo, tethered via a 4G mobile connection!). I had some great questions from the Budapest audience – thanks for having me!
For PyDataUK, our inaugural event this week (a week after the two talks above) I gave a variant of Making Pandas Fly to 250 live streamers, oragnised by the lovely crowd at PyDataManchester. The YouTube link is available (via the meetup page), Paige Bailey of Google was the lead speaker on TensorFlow Probability which was an intriguing talk (sadly yet another thing I’ll run out of time before trying).
The sensible outcomes of both talks are:
- Use Dask to preprocess large datasets that don’t fit into RAM
- Use Pandas intelligently to save RAM and make your manipulations run faster for investigatory work
- Look at Modin to see if making no changes to your code can result in speed-ups on larger in-RAM datasets
- Check Vaex if you have large memory mapped (HDF5) datasets or if you want faster string processing
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.