Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

“Making Pandas Fly” at EuroPython 2020

I’ve had a chance to return to talking about High Performance Python at EuroPython 2020 after my first tutorial on this topic back in 2011 in Florence. Today I spoke on Making Pandas Fly with a focus on making Pandas run faster. This covered:

  • Categories and RAM-saving datatypes to make 100-500x speed-ups (well, some of the time) including dtype_diet
  • Dropping to NumPy to make things potentially 10x faster (thanks James Powell and his callgraph code)
  • Numba for compilation (another 10x!)
  • Dask for parallelisation (2-8x!)
  • and taking a view on Modin & Vaex

We might ask “why do this” and my answer is “let’s go faster using the tools we already know how to use”. Specifically – without investing time learning a new tool (e.g. Intel SDC, Vaex, Modin, Dask, Spark and more) we can extend our ability to work with larger datasets without leaving the comfort of Pandas so you can get to your answers quicker. This message went down well:Feedback from EuroPython 2020

If you’re curious about this and want to go further you might want to look at my upcoming training courses (this includes Higher Performance, Software Engineering and Successful Data Science Projects). If you want tips and you want to stay on top of what I’m working on they join my twice-a-month mailing list (see the link for a recent example post).

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Weekish notes

I’ve recently switched back from Sourdough yeast to dried packet yeast mix, given a recipe by a colleague (thanks Nick!). I immediately set to work modifying his recipe (well, cutting out steps if we’re honest). The first loaf looked fine but was bland – I cut out too much salt. The next was really very good (“shop quality”). For the third I used off-boil water for my autolyse and I think the water was still too hot and killed some of the yeast later giving me this dense lump. Later that evening after 2.5 hours I had a luke-warm water repeat loaf and it was brilliant. I confirmed this with toast & jam this morning.

I’ve got quite a log of notes for my two main recipes now and will have a Sourdough on the go again this weekend.

Working with my “still secret” client in a safe haven locked down remote instance I lack most of my usual tools (part by design, part my ignorance during configuration). I’ve got Vi so I’m getting my hands dirty with the underlying operations (hey! :bnext and :e work fine! Ctrl P does some sort of autocomplete! :ls lists my buffers!). This is a little painful and Apache Guacamole’s remote viewer can be troublesome (stripping £ symbols, giving me 3 different keyboard configs depending on when I login, forgetting some of my windows!) but on the whole the setup is working well.

I’ve also had to get down and dirty with Git – no GitK or other fun tools. I’ve discovered some nice light git configs like “git logline” which help with terminal based navigation in our small team.

Training classes are now listed for:

  • Software Engineering for Data Scientists (September) – write strong, tested, reliable and defensible code from Notebooks to modules to improve collaboration and resilience
  • Higher Performance Python (October) – profile CPU & memory usage, speed up your code, compile where useful and improve your Pandas & Dask to enable faster iteration and faster processing on your projects with minimal effort on your part
  • Successful Data Science Projects (November) – discover new process & tools to design data science projects that’ll run successfully, improve collaboration between your team and the wider business (this is built out of 15 years of painful lessons so you don’t have to make the same mistakes!)

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Weekish notes

I gave another iteration of my Making Pandas Fly talk sequence for PyDataAmsterdam recently and received some lovely postcards from attendees as a result. I’ve also had time to list new iterations of my training courses for Higher Performance Python (October) and Software Engineering for Data Scientists (September), both will run virtually via Zoom & Slack in the UK timezone.

I’ve been using my dtype_diet tool to time more performance improvements with Pandas and I look forward to talking more on this at EuroPython this month.

In baking news I’ve improved my face-making on sourdough loaves (but still have work to do) and I figure now is a good time to have a crack at dried-yeast baking again.

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“Making Pandas Fly” for PyDataAmsterdam 2020

I thank the PyDataAmsterdam 2020 organisers for another chance to speak on Making Pandas Fly (PyDataAmsterdam 2020). This variant of the talk focuses more on:

  • Understanding when categories beat strings and smaller floats beat larger ones
  • What’s happening with NumPy behind the scenes
  • How we can save 50% of our RAM (and so fit in more data to the same machine) by checking dtypes with my dtype_diet tool
  • Considering that float16 is simulated on modern hardware and so is memory efficient but slow for calculating!
  • Tips to install bottleneck & numexpr to make Pandas faster
  • Digging into some Pandas internals when I filed a bug – and what I learned as a result (you can learn too by reading the bug report!)

In a few months I’ll run another of my Higher Performance Python virtual training classes, you’re most welcome to join. You’ll find details on my very-lightly-used “training email list“, you should join this if you’d like to hear about my upcoming training courses.

I make notes on some of these topics in my irregular “weekish notes” here on the blog and in my every-2-weeks “thoughts & jobs” email list. You’re welcome to join the list (your email is always kept private) if getting it in your inbox is more convenient.

At the end of my talks I always ask for a postcard “if you learned something”, I’ve just received the first for last week’s talk from the Netherlands – thanks!


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Weeknote (dtype-diet)

Over the weekend I hacked on dtype_diet – a tool for Pandas users that checks their DataFrame to see if smaller datatypes might be applicable. If so they’d offer no data loss and a reduction in RAM, for Categorical data there’s also the possibility of faster calculations. This tool makes no changes, it recommends the code you might copy into your project. I developed this as one of the ideas I didn’t get around to building whilst I was working on my book, but since that’s now published…

Talking of which – Micha and I have cleaned up our profiles on Amazon and Goodreads for High Performance Python 2nd ed. It is a pity we can’t do any book signings at events (seeing as we won’t have physical events for a while!). One thing I might do via my “thoughts & jobs” email list is to organise a Friday “coffee morning” to chat with whoever turns up about high performance Python and other subjects.

I’ve also learned that if I make a 1kg sourdough dough, immediately after kneading I can put 0.5kg into the fridge and 4 days later it cooks just fine. This is a nice refinement. I’m now interspersing buying fresh bread (for variety and convenience) with making my own and it feels like a more sustainable practice.

All going well I’ll be talking this month at PyDataAmsterdam and next month at EuroPython on higher performance Python and Pandas, I figure the dtype_diet tool will fit in nicely with a discussion about the new Pandas dtypes and their benefits over numpy dtypes.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More