Python – Entrepreneurial Geekiness

Upcoming discussion calls for Team Structure and Buidling a Backlog for data science leads

Ian — Fri, 01 Jul 2022 12:37:06 +0000

I ran another Executives at PyData discussion session for 50+ leaders at our PyDataLondon conference a couple of weeks back. We had great conversation which dug into a lot of topics. I’ve written up notes on my NotANumber newsletter. If you’re a leader of DS and Data Eng teams, you probably want to review those notes.

To follow on the conversations I’m going to run the following two (free) Zoom based discussion sessions. I’ll be recording the calls and adding notes to future newsletters. If you’d like to join, fill in this invite form and I can add you to the calendar invite. You can lurk and listen or – better – join in with questions.

~~Monday July 11, 4pm (UK time)~~, Data Science Team Structure – getting a good structure for your org, hybrid vs fully remote practices, processes that support your team, how to avoid being left out
Monday August 8th, 4pm (UK time), Backlog & Derisking & Estimation – how to build a backlog, derisking quickly and estimating the value behind your project

I’m expecting a healthy list of issues and good feedback and discussion for both calls. I’ll be sharing an agenda in advance to those who have contacted me. My goal is to turn these into bigger events in the future.

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

My first commit to Pandas

Ian — Wed, 10 Mar 2021 14:15:10 +0000

I’ve used the Pandas data science toolkit for over a decade and I’ve filed a couple of issues, but I’ve never contributed to the source. At the weekend I got to balance the books a little by making my first commit. With this pull request I fixed the recent request to update the pct_change docs to make the final example more readable.

The change was trivial – adding “periods=-1” to the argument and updating the docstring. The build process was a lot more involved – thankfully I was on a call with PyLadies London to try to help others make their first contribution to Pandas and I had organiser Marco Gorelli (a core contributor) to help when needed.

Ultimately it boiled down to setting up a docker environment, running a new example in my shell, updating the relevant docstring on the local filesystem and then following the “contributing to the documentation” guide. My initial commit fell foul of the docstring style rules and the automated checking tools in the docker environment point this out. Once the local filesystem checker scripts were happy I pushed to my fork, created a PR and shortly after everything was done.

All in it took 45 minutes to get the environment setup, another 45 minutes to make my changes and figure out how to run the right scripts, then a bit longer to push and submit a PR (followed by overnight patience before it got picked up by the team).

When I teach my classes I always recommend that a good way to learn new development practices (like the automated use of black & flake8 in a precommit process) is to submit small fixes to open source projects – you learn so much along the way. I’ve not used docker in years and I don’t use automated docstring checking tools, so both presented nice little points for learning. I also have never used the pct_change function in Pandas…and now I have.

If you’ve not yet made a commit to an open source project, do have a think about it – you’ll get lots of hand holding (just be patient, positive and friendly when you leave comments) and you can stick a reference to the result on your CV for bragging rights. And you’ll have made the world a slightly better place.

Skinny Pandas Riding on a Rocket at PyDataGlobal 2020

Ian — Mon, 30 Nov 2020 19:25:02 +0000

On November 11th we saw the most ambitious ever PyData conference – PyData Global 2020 was a combination of world-wide PyData groups putting on a huge event to both build our international community and to leverage the on-line only conferences that we need to run during Covid 19.

The conference brought together almost 2,000 attendees from 65 countries with 165 speakers over 5 days on a 5-track schedule. All speaker videos had to be uploaded in advance so they could be checked and then provided ahead-of-time to attendees. You can see the full program here, the topic list was very solid since the selection committee had the best of the international community uploading their proposals.

The volunteer organising committee felt that giving attendees a chance to watch all the speakers at their leisure took away constraints of time zones – but we wanted to avoid the common end result of “watching a webinar” that has plagued many other conferences this year. Our solution included timed (and repeated) “watch parties” so you could gather to watch the video simultaneously with others, and then share discussion in chat rooms. The volunteer organising committee also worked hard to build a “virtual 2D world” with Gather.town – you walk around a virtual conference space (including the speakers’ rooms, an expo hall, parks, a bar, a helpdesk and more). Volunteer Jesper Dramsch made a very cool virtual tour of “how you can attend PyData Global” which has a great demo of how Gather works – it is worth a quick watch. Other conferences should take note.

Through Gather you could “attend” the keynote and speaker rooms during a watch-party and actually see other attendees around you, you could talk to them and you could watch the video being played. You genuinely got a sense that you were attending an event with others, that’s the first time I’ve really felt that in 2020 and I’ve presented at 7 events this year prior to PyDataGlobal (and frankly some of those other events felt pretty lonely – presenting to a blank screen and getting no feedback…that’s not very fulfilling!).

I spoke on “Skinny Pandas Riding on a Rocket” – a culmination of ideas covered in earlier talks with a focus on getting more into Pandas so you don’t have to learn new technologies and see Vaex, Dask and SQLite in action if you do need to scale up your Pythonic data science.

I also organised another “Executives at PyData” session aimed at getting decision makers and team leaders into a (virtual) room for an hour to discuss pressing issues. Given 6 iterations of my “Successful Data Science Projects” training course in London over the last 1.5 years I know of many issues that repeatedly come up that plague decision makers on data science teams. We got to cover a set of issues and talk on solutions that are known to work. I have a fuller write-up to follow.

The conference also enabled a “pay what you can” model for those attending outside of a corporate ticket, this brought in a much wider audience that could normally attend a PyData conference. The goal of the non-profit NumFOCUS (who back the PyData global events) is to fund open source so the goal is always to raise more money and to provide a high quality educational and networking experience. For this on-line global event we figured it made sense to open out the community to even more folk – the “pay what you can” model is regarded as a success (this is the first time we’ve done it!) and has given us some interesting attendee insights to think on.

There are definitely some lessons to learn, notably the on-boarding process was complex (3 systems had to be activated) – the volunteer crew wrote very clear instructions but nonetheless it was a more involved process than we wanted. This will be improved in the future.

I extend my thanks to the wider volunteer organising committee and to NumFOCUS for making this happen!

“Making Pandas Fly” at EuroPython 2020

Ian — Fri, 24 Jul 2020 14:49:22 +0000

I’ve had a chance to return to talking about High Performance Python at EuroPython 2020 after my first tutorial on this topic back in 2011 in Florence. Today I spoke on Making Pandas Fly with a focus on making Pandas run faster. This covered:

Categories and RAM-saving datatypes to make 100-500x speed-ups (well, some of the time) including dtype_diet
Dropping to NumPy to make things potentially 10x faster (thanks James Powell and his callgraph code)
Numba for compilation (another 10x!)
Dask for parallelisation (2-8x!)
and taking a view on Modin & Vaex

We might ask “why do this” and my answer is “let’s go faster using the tools we already know how to use”. Specifically – without investing time learning a new tool (e.g. Intel SDC, Vaex, Modin, Dask, Spark and more) we can extend our ability to work with larger datasets without leaving the comfort of Pandas so you can get to your answers quicker. This message went down well:

If you’re curious about this and want to go further you might want to look at my upcoming training courses (this includes Higher Performance, Software Engineering and Successful Data Science Projects). If you want tips and you want to stay on top of what I’m working on they join my twice-a-month mailing list (see the link for a recent example post).

Weekish notes

Ian — Wed, 15 Jul 2020 19:20:59 +0000

I’ve recently switched back from Sourdough yeast to dried packet yeast mix, given a recipe by a colleague (thanks Nick!). I immediately set to work modifying his recipe (well, cutting out steps if we’re honest). The first loaf looked fine but was bland – I cut out too much salt. The next was really very good (“shop quality”). For the third I used off-boil water for my autolyse and I think the water was still too hot and killed some of the yeast later giving me this dense lump. Later that evening after 2.5 hours I had a luke-warm water repeat loaf and it was brilliant. I confirmed this with toast & jam this morning.

I’ve got quite a log of notes for my two main recipes now and will have a Sourdough on the go again this weekend.

Working with my “still secret” client in a safe haven locked down remote instance I lack most of my usual tools (part by design, part my ignorance during configuration). I’ve got Vi so I’m getting my hands dirty with the underlying operations (hey! :bnext and :e work fine! Ctrl P does some sort of autocomplete! :ls lists my buffers!). This is a little painful and Apache Guacamole’s remote viewer can be troublesome (stripping £ symbols, giving me 3 different keyboard configs depending on when I login, forgetting some of my windows!) but on the whole the setup is working well.

I’ve also had to get down and dirty with Git – no GitK or other fun tools. I’ve discovered some nice light git configs like “git logline” which help with terminal based navigation in our small team.

Training classes are now listed for:

Software Engineering for Data Scientists (September) – write strong, tested, reliable and defensible code from Notebooks to modules to improve collaboration and resilience
Higher Performance Python (October) – profile CPU & memory usage, speed up your code, compile where useful and improve your Pandas & Dask to enable faster iteration and faster processing on your projects with minimal effort on your part
Successful Data Science Projects (November) – discover new process & tools to design data science projects that’ll run successfully, improve collaboration between your team and the wider business (this is built out of 15 years of painful lessons so you don’t have to make the same mistakes!)

Weekish notes

Ian — Sun, 05 Jul 2020 15:42:33 +0000

I gave another iteration of my Making Pandas Fly talk sequence for PyDataAmsterdam recently and received some lovely postcards from attendees as a result. I’ve also had time to list new iterations of my training courses for Higher Performance Python (October) and Software Engineering for Data Scientists (September), both will run virtually via Zoom & Slack in the UK timezone.

I’ve been using my dtype_diet tool to time more performance improvements with Pandas and I look forward to talking more on this at EuroPython this month.

In baking news I’ve improved my face-making on sourdough loaves (but still have work to do) and I figure now is a good time to have a crack at dried-yeast baking again.

Week note

Ian — Wed, 06 May 2020 12:41:51 +0000

Well, mid-next-week note I guess. I gave another variant of my higher performance Python talk last night for PyDataUK to 250 live streamers, we had some good questions, cheers all.

On Friday Micha & I heard that the 2nd edition of our Higher Performance Python book has gone to the printers – we’d said we’d do 1 person-month each on it last summer and 9 months later (with many-person-months invested each) we’re finally there. Phew.

I now have an open PR on the dabl project to add ordinal-sorted y-axis box plot items in place of the default always-sort-by-median, which I think makes some of the exploratory process more intuitive. This also involved figuring out a new weird matplotlib rendering behaviour and writing my first unit test where I make up a dummy matplotlib figure in a test which is rendered but never displayed. There’s always a million new things to learn, right?

I’ve also been digging into Companies House data to look at how the economy is responding to the pandemic and this lets me play with some higher performance Pandas operations (for my talks) and to dig into pivot_table, pivot, groupby and crosstab along with stack & unstack in Pandas. I’ve always been confused about how many options I have available here, I’m less confused now, but I still don’t understand some of the performance differences I see for otherwise-equivalent operations. I also discovered the Pandas xs operation (take a cross-section of a dataframe) whilst reading a wikipedia page on crosstabs. Learning. Always learning.

My kneaded sourdough is improving, I’m up to 1kg now. I think I’m done with no-knead for a bit, that was fun but the really-wet dough is hard to handle. Radishes are great, pretty big now, but annoyingly the snails have found my lettuce.

“Flying Pandas” and “Making Pandas Fly” – virtual talks this weekend on faster data processing with Pandas, Modin, Dask and Vaex

Ian — Mon, 27 Apr 2020 11:42:15 +0000

This Saturday and Monday I’ve had my first experience presenting at virtual conferences – on Saturday it was for Remote Pizza Python (brilliant line-up!) and on Monday (note – this post predates the talk, I’ll update it tomorrow after I’ve spoken) at BudapestBI. UPDATE added 2nd variant of Making Pandas Fly for a short-notice PyDataUK talk too.

My slides for Remote Pizza Python are here “Flying Pandas – Modin, Dask & Vaex“. I cover the following in a 10 min talk:

Modin – new academic project, makes a new algebra for dataframes (not just Pandas), provides automated column & row parallelisation options for no code changes
Dask – great for blocked Pandas DataFrames in parallel on 1 or more machines (it can also parallelise on a single machine multi-core with in-RAM data which I didn’t cover)
Vaex – new Pandas-like DataFrame with a subset of operations, better string implementation so you fit more strings into RAM than with Pandas
I recommended sticking to Pandas if your code fits in RAM, trying Modin if you have it in RAM or using Dask if you have a bigger-than-RAM scenario, with Vaex being great for an experiment

The reaction was very positive and on the internal Discord chat we had some great Q&A about the use of Numba, Dask, Modin and other tools.

For PyDataBudapest I gave a longer 30 min talk on Making Pandas Fly (GitHub source as Notebooks for both parts):

Dask to preprocess larger dataset with nice diagnostics
Pandas – using Category and float32 or float16 to save RAM and to do faster lookups
Pandas – dropping to NumPy to calculate numeric operations faster with a look at James Powell’s great callgraph prototype to dig into the call history complexity
Pandas – using Numba to accelerate numeric functions

The remote talk to Budapest was slightly hampered by a chunk of the Virgin internet backbone disappearing just before I spoke, thankfully we got it back a few minutes later (else I was going to live present, with a live Dask demo, tethered via a 4G mobile connection!). I had some great questions from the Budapest audience – thanks for having me!

For PyDataUK, our inaugural event this week (a week after the two talks above) I gave a variant of Making Pandas Fly to 250 live streamers, oragnised by the lovely crowd at PyDataManchester. The YouTube link is available (via the meetup page), Paige Bailey of Google was the lead speaker on TensorFlow Probability which was an intriguing talk (sadly yet another thing I’ll run out of time before trying).

Thanks to James Powell for his CallGraph code (uploaded here) to show how many extra calls Pandas might add onto a NumPy operation on a column.

The sensible outcomes of both talks are:

Use Dask to preprocess large datasets that don’t fit into RAM
Use Pandas intelligently to save RAM and make your manipulations run faster for investigatory work
Look at Modin to see if making no changes to your code can result in speed-ups on larger in-RAM datasets
Check Vaex if you have large memory mapped (HDF5) datasets or if you want faster string processing

New Higher Performance Python class (June 1-3)

Ian — Fri, 24 Apr 2020 17:01:03 +0000

I’ve listed my next Higher Performance Python public class, it’ll run online for 3 mornings on June 1-3 during UK hours. We’ll use Zoom and Slack with pre-distributed Notebooks and modules and you’ll run it using an Anaconda environment. Here’s the write-up from my recent class.

We’ll focus on

Profiling to find what’s slow in your code so you spend your time fixing the right things (this is so important, our intuitions are always wrong!)
Switching to NumPy to get benefits from vectorisation
Compiling with Numba to get C-like speeds for very little effort (we’ll get a 200x speed-up overall)
Run in parallel with OpenMP and with JobLib to take advantage of multiple cores
Learn slow and faster ways of solving problems in Pandas (we’ll see a massive speed-up once we go slightly “under the covers” with Pandas and avoid doing silly access operations)
Use Numba compiled functions to process Pandas data (using the raw=True trick)
Use Dask to process Pandas in parallel to use all your cores when your data fits in RAM
Look into using Dask to process bigger-than-RAM datasets
Review other tooling and process options to make you generally more performant in your work

Feel free to contact me if you have questions about the course. I’m currently not planning to run another iteration of this for some months (possibly October for the next one).

Notes on last week’s Higher Performance Python class

Ian — Tue, 14 Apr 2020 15:13:24 +0000

Last week I ran a two-morning Higher Performance Python class, we covered:

Profiling slow code (using a 2D particle infection model in an interactive Jupyter Notebook) with line_profiler & PySpy
Vectorising code with NumPy vs running the original with PyPy
Moving to Numba to make iterative and vectorised NumPy really fast (with up to a 200x improvement on one exercise)
Ways to make applying functions in Pandas much faster and multicore (with Dask & Swifter, along with Numba)
Best practice for each tool
Excellent discussion where I got taught a few new tips too (and in part this is why I love teaching smart crowds!)

If you’d like to hear about the upcoming iterations please join my low-volume training announce list and I offer a discount code in exchange for you spending two minutes filling in this very brief survey about your training needs.

“Working through the exercises from day 1 of the high performance python course from. Who knew there was so much time to shave off from functions I use every day?.. apart from Ian of course” – Tim Williams

Here’s my happy class on the first morning:

We used Zoom to orchestrate the calls with a mix of screen-share for my demos and group discussion. Every hour we took a break, after the first morning I set some homework and I’m waiting to hear how the take-home exercise will work out. In 2 weeks we’ll have a follow-up call to clear up any remaining questions. One thing that was apparent was that we need more time to discuss Pandas and “getting more rows into RAM” so I’ll extend the next iteration to include this. A little of the class came directly from the 2nd edition of my High Performance Python book with O’Reilly (due out in May), almost all of it was freshly written for this class.

In the class Slack a bunch of interesting links were shared, we got to discuss how several people use Numba in their companies with success. Whilst I need to gather feedback from my class it feels like the “how to profile your code so you focus your effort on the real bottlenecks” was the winner from this class, along with showing how easily we can use Numba to speed up things in Pandas (if you know the “raw=True” trick!).

I plan to run another iteration of this class, along with online-only versions of my Successfully Delivering Data Science Projects & Software Engineering for Data Scientists – do let me know, or join my training email list, if you’d like to join.