Entrepreneurial Geekiness
Notes on last week’s Higher Performance Python class
Last week I ran a two-morning Higher Performance Python class, we covered:
- Profiling slow code (using a 2D particle infection model in an interactive Jupyter Notebook) with line_profiler & PySpy
- Vectorising code with NumPy vs running the original with PyPy
- Moving to Numba to make iterative and vectorised NumPy really fast (with up to a 200x improvement on one exercise)
- Ways to make applying functions in Pandas much faster and multicore (with Dask & Swifter, along with Numba)
- Best practice for each tool
- Excellent discussion where I got taught a few new tips too (and in part this is why I love teaching smart crowds!)
If you’d like to hear about the upcoming iterations please join my low-volume training announce list and I offer a discount code in exchange for you spending two minutes filling in this very brief survey about your training needs.
“Working through the exercises from day 1 of the high performance python course from. Who knew there was so much time to shave off from functions I use every day?.. apart from Ian of course” – Tim Williams
Here’s my happy class on the first morning:
We used Zoom to orchestrate the calls with a mix of screen-share for my demos and group discussion. Every hour we took a break, after the first morning I set some homework and I’m waiting to hear how the take-home exercise will work out. In 2 weeks we’ll have a follow-up call to clear up any remaining questions. One thing that was apparent was that we need more time to discuss Pandas and “getting more rows into RAM” so I’ll extend the next iteration to include this. A little of the class came directly from the 2nd edition of my High Performance Python book with O’Reilly (due out in May), almost all of it was freshly written for this class.
In the class Slack a bunch of interesting links were shared, we got to discuss how several people use Numba in their companies with success. Whilst I need to gather feedback from my class it feels like the “how to profile your code so you focus your effort on the real bottlenecks” was the winner from this class, along with showing how easily we can use Numba to speed up things in Pandas (if you know the “raw=True” trick!).
I plan to run another iteration of this class, along with online-only versions of my Successfully Delivering Data Science Projects & Software Engineering for Data Scientists – do let me know, or join my training email list, if you’d like to join.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Notes from Zoom call on “Problems & Solutions for Data Science Remote Work”
On Friday I held an open Zoom call to discuss the problems and solutions posed by remote work for data scientists. I put this together as I’ve observed from my teaching cohorts and from conversation with colleagues that for anyone “suddenly working remotely” the process has typically not been smooth. I invited folk to join and asked that they shared 1 pain and 1 tip via a GForm, some tips were also submitted via Twitter. We held a live video chat and I took notes, I’ve summarised these below.
Given that we’re likely to stay in this remote mode for a minimum of 3 months, possibly 6 months and to a greater & lesser extent over 1-2 years it’ll pay for your team to invest in building a good remote process.
This post by Chris Parsons at Gower covers a CTO’s view of collaboration in a tech org (including data scientists), I like the notion of relinquishing control and discouraging continual availability.
I was joined by Jon Markwell (@jot), founder of The Skiff co-working environment down in Brighton (he’s built a brilliant community of freelancers there who often work remotely). He helps companies with their remote transformations. He’s been working on a new tool for remote readiness prior to Coronavirus and I invited him to share his tech-focused remote practices on the call. He’s open to you reaching out via Twitter or LinkedIn if you’d like his advice.
We spoke at a high level on:
- Well-being
- Avoiding distactions and isolation
- Team discussions
- Whiteboarding, tools, knowledge sharing
Well-being came up frequently in my “share 1 problem and 1 tip”. Tips included:
- Getting a manager to set the tone that “it is ok to work at a slower pace, take it easy, adapting takes time” is important to help folk reduce stress. Saying “I’m in” and “I’m out” in slack can indicate clear working hours for the team to help everyone know when folk are around or not, this helps when there are core-hours that everyone should maintain in distributed teams
- Building up a back-log of tasks for lower priority but important work like back-filling tests, refactoring and reviewing untouched code is a good way to provide important but low-stress tasks that a colleague can take on when they’re feeling less productive such that a positive outcome is still achieved
- Make a routine and stick to it. Maybe go for a walk first-thing prior to work to simulate a commute. Put on your “work clothes” rather than PJs.
- Figuring out processes to keep morale up is a team-wide issue. Overwork should be watched for just like underworking. A “#wellbeing” slack channel at work might be a good place to share fun things, possibly with a “#covid19” channel kept separate to keep that news in 1 place (where it can also be avoided)
- For teams that don’t know each other Federico suggested GeoGuessr as a simple game that all can play on Zoom to break the ice
Distractions and isolation was also a frequent issue:
- Many of us on the call use the Pomodoro timed technique (working for 25 minutes via a timer then taking a short break), Sandrine suggested this physical timer and I use the countdown timer on my phone
- To avoid websites Flipd was suggested for phone focus, Freedom (now banned by Apple apparently) and Bertil suggested SelfControl (Mac) to limit website interaction
Team communication was more tricky:
- A frequently cited issued was the loss of ad-hoc in-person communication for discussion
- Jon reminded everyone on the importance of over-communicating whilst the team adapted with a focus on transparency to avoid people feeling left out
- Severin noted keeping core hours was helpful
- I suggested that if team members and bosses were unsure about how a remote process might work to point them at large, decentralised and demonstrably-capable teams like those behind the open source scikit-learn, Pandas and Linux projects. The newsgroups, occasional calls and github-fueled process work very well
- Calls need to be controlled – establish an agenda and a protocol for asking questions (perhaps using chat simultaneously)
- Have an always-open video call where folk can just drop-in and natter might simulate some of the relaxed chat in an office
Tools:
- A kanban/scrum process will work fine in a remote scenario, Trello boards work well (if you don’t have a system yet – columns like “ideas”, “blocked”, “in process”, “done” where they move left->right is a sensible basic flow)
- Tools noted include Miro (collaborative whiteboard), TandemChat (team chat), Slack or Microsoft Teams, Google suite during a video call
- Jon gave us a demo of Retrium for remote retrospectives, this looks fairly powerful and has a free 30 day trial
- Jon also showed us the output of his Remote Readiness tool which can help a team score where they’re good & bad for the move to remote – this will certainly help managers spot areas of weakness that they can avoid, given Jon’s prior experience (also – contact Jon if you’d like his advice!)
Home office setup (we didn’t get to discuss this):
- A physical cable to the router is likely to be more stable than wifi if you’re a distance from your router (I use 15m of cat-5e cable)
- I also have a Netgear EX6120 range extender but neighbours now have similar so I prefer to depend on the physical cable
- I use a Logitech C920 HD camera (the C920S seems to be the newer version) which has auto-focus and “just works” with my Linux, it sits on top of my external monitor
- I’ve also got a comfy wired headset with microphone (wired as having a battery fail during a call is less helpful)
- Some people use a greenscreen as they hate having their home on display (this is discussed in his hackernews thread)
Thoughts on the format of the Zoom call:
- This is the first time I’ve done a discussion like this, getting feedback from a group who don’t know each other was hard
- Prior to the call I created an agenda, agreed with my co-presenter, based on the GForm feedback
- At the start of the call everyone introduced themselves in the Zoom chat window (name, company, city) and I explained everyone should stay on mute unless they had a point to raise
- The fact that I knew 50% of the participants is great, it would have been much more wooden I feel if the number had been smaller
- Getting folk to contribute ideas was hard, maybe asking folk to physically raise a hand is a good indicator (for those with video on – about 1/3 of the attendees), maybe using Zoom’s “raise hand” is useful, or maybe reminding frequently that folk can share questions into the Zoom chat would help
- Once you close the meeting your Chat history seems to be deleted – this is awful, I lost some of the notes I hadn’t copied over and I couldn’t see a way to retrieve them
- Having Jon Markwell along was great, having a “guest co-presenter” is important and he’s got a ton of useful experience
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Another Successful Data Science Projects course completed
A week back I ran the 4th iteration of my 1 day Successful Data Science Projects course. We covered:
- How to write a Project Specification including a strong Definition of Done
- How to derisk a new dataset quickly using Pandas Profiling, Seaborn and dabl
- Building interactive data tools using Altair to identify trends and outliers (for quick discussion and diagnosis with business colleagues)
- A long discussion on best practice for designing and running projects to increase their odds of a good outcome
- Several diagnosis scenarios for prioritisation and valuation of potential projects
One of the lovely outcomes in the training slack is that new tools get shared by the attendees – I particularly liked Streamlit which was shared as an alternative to my Jupyter Widgets + sklearn estimator demo (which shows a way to hand-build an estimator under Widget control with GUI plots for interactive diagnosis). I’m going to look into integrating this in a future iteration of this course. Here’s my happy room of students:
If you’re interested in attending my future courses then make sure you’re on my low-volume training announce list (and/or you might want my more frequent Thoughts & Jobs email list). My upcoming Software Engineering for Data Scientists course has a seat left, if that’s sold out do contact me to be on the reserve list. If you’d like to get a discount code for future courses please complete my research survey for my 2020 courses.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Higher Performance Python (ODSC 2019)
Building on PyDataCambridge last week I had the additional pleasure of talking on Higher Performance Python at ODSC 2019 yesterday. I had a brilliant room of 300 Pythonic data scientists at all levels who asked an interesting array of questions:
This talk expanded on last week’s version at PyDataCambridge as I had some more time. The problem-intro was a little longer (and this helped set the scene as I had more first-timers in the room), then I dug a little further into Pandas and added extra advice at the end. Overall I covered:
- Robert Kern’s line_profiler to profile performance in sklearn’s “fit” method against a custom numpy function
- Pandas function calling using iloc/iterrows/apply and apply with raw=True (in increasingly-fast order)
- Using Swifter and Dask to parallelise over many cores
- Using Numba to get an easy additional 10x speed-up
- Discussed highly-performant team advice to sanity check some of the options
“It was a fantastic talk.” – Stewart
My publisher O’Reilly were also kind enough to send over a box of the 1st edition High Performance Python books for signing, just as I did in Cambridge last week. As usual I didn’t have enough free books for those hoping for a copy – sorry if you missed out (I only get given a limited set to give away). The new content for the 2nd edition is available online in O’Reilly’s Safari Early Access Programme.
The talk ends with my customary note requesting a postcard if you learned something useful – feel free to send me an email asking for my address, I love to receive postcards 🙂 I have an email announce list for my upcoming training in January with a plan to introduce a High Performance Python training day, so join that list if you’d like low-volume announcements. I have a twice-a-month email list for “Ian’s Thoughts & Jobs Listing” which includes jobs I know about in our UK community and my recommendations and notes. Join this if you’d like an idea of what’s happening in the UK Pythonic Data Science scene.
The 2nd edition of High Performance Python should be out for next April, preview it in the Early Access Programme here.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Training Courses for 2020 Q1 – Successful Data Science Projects & Software Engineering for Data Scientists
Early next year I run new iterations of two of my existing training courses for Pythonic Data Scientists:
- Successful Data Science Projects (Jan, 1 day)
- Software Engineering for Data Scientists (Feb, 2 day)
Successful Data Science Projects focuses on reducing uncertainty in a new data science project. We’ll look at the reasons why these projects can fail (and heck – this is research – they can and occasionally should fail), review ways to derisk a project with tools you’re probably not yet using, plan out a Project Specification for agreement with stakeholders and review techniques to make your team more highly performant overall.
“After attending the course I can identify and communicate to the project team and client the uncertainties of the project efficiently. I am using the techniques covered on the course to write project initiation documents and put in place the necessary processes to reduce uncertainty. The course was very engaging and I was very happy to learn from Ian’s experience to ensure a successful delivery on all future projects.” – Dani Papamaximou, Data Scientist at Arcadia
Software Engineering for Data Scientists is a 2 day course aimed at data scientists (perhaps from an academic background) who lack strong software engineering skills. We cover reviewing “bad Notebook code”, refactoring this code, using a standardised folder structure (with cookiecutter), adding unit tests and defensive Pandas tests along with checking how to introduce these techniques back into your team.
“Ian’s Software Engineering for Data Scientists course provides an excellent overview of best practices with focus on testing, debugging and general code maintenance. Ian has a wealth of experience and also makes sure to keep on top of the latest tools and libraries in the Data Science world. I would especially recommend the course to Data Science practitioners coming from an academic rather than software engineering background.” – LibertyGlobal Mirka
I am also thinking of introducing a High Performance Python course based on the updates coming to the 2nd edition of my High Performance Python book (for release April 2020). You’ll get details about this on my low-frequency email training list and if you have strong thoughts about this, please get in contact!
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read my book
AI Consulting
Co-organiser
Trending Now
1Leadership discussion session at PyDataLondon 2024Data science, pydata, RebelAI2What I’ve been up to since 2022pydata, Python3Upcoming discussion calls for Team Structure and Buidling a Backlog for data science leadsData science, pydata, Python4My first commit to PandasPython5Skinny Pandas Riding on a Rocket at PyDataGlobal 2020Data science, pydata, PythonTags
Aim Api Artificial Intelligence Blog Brighton Conferences Cookbook Demo Ebook Email Emily Face Detection Few Days Google High Performance Iphone Kyran Laptop Linux London Lt Map Natural Language Processing Nbsp Nltk Numpy Optical Character Recognition Pycon Python Python Mailing Python Tutorial Robots Running Santiago Seb Skiff Slides Startups Tweet Tweets Twitter Ubuntu Ups Vimeo Wikipedia