Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

Higher Performance Python (ODSC 2019)

Building on PyDataCambridge last week I had the additional pleasure of talking on Higher Performance Python at ODSC 2019 yesterday. I had a brilliant room of 300 Pythonic data scientists at all levels who asked an interesting array of questions:

Happy smiling audience

This talk expanded on last week’s version at PyDataCambridge as I had some more time. The problem-intro was a little longer (and this helped set the scene as I had more first-timers in the room), then I dug a little further into Pandas and added extra advice at the end. Overall I covered:

  • Robert Kern’s line_profiler to profile performance in sklearn’s “fit” method against a custom numpy function
  • Pandas function calling using iloc/iterrows/apply and apply with raw=True (in increasingly-fast order)
  • Using Swifter and Dask to parallelise over many cores
  • Using Numba to get an easy additional 10x speed-up
  • Discussed highly-performant team advice to sanity check some of the options

“It was a fantastic talk.” – Stewart

My publisher O’Reilly were also kind enough to send over a box of the 1st edition High Performance Python books for signing, just as I did in Cambridge last week. As usual I didn’t have enough free books for those hoping for a copy – sorry if you missed out (I only get given a limited set to give away). The new content for the 2nd edition is available online in O’Reilly’s Safari Early Access Programme.

Book signing

The talk ends with my customary note requesting a postcard if you learned something useful – feel free to send me an email asking for my address, I love to receive postcards 🙂 I have an email announce list for my upcoming training in January with a plan to introduce a High Performance Python training day, so join that list if you’d like low-volume announcements. I have a twice-a-month email list for “Ian’s Thoughts & Jobs Listing” which includes jobs I know about in our UK community and my recommendations and notes. Join this if you’d like an idea of what’s happening in the UK Pythonic Data Science scene.

The 2nd edition of High Performance Python should be out for next April, preview it in the Early Access Programme here.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Training Courses for 2020 Q1 – Successful Data Science Projects & Software Engineering for Data Scientists

Early next year I run new iterations of two of my existing training courses for Pythonic Data Scientists:

Successful Data Science Projects focuses on reducing uncertainty in a new data science project. We’ll look at the reasons why these projects can fail (and heck – this is research – they can and occasionally should fail), review ways to derisk a project with tools you’re probably not yet using, plan out a Project Specification for agreement with stakeholders and review techniques to make your team more highly performant overall.

“After attending the course I can identify and communicate to the project team and client the uncertainties of the project efficiently. I am using the techniques covered on the course to write project initiation documents and put in place the necessary processes to reduce uncertainty. The course was very engaging and I was very happy to learn from Ian’s experience to ensure a successful delivery on all future projects.” – Dani Papamaximou, Data Scientist at Arcadia

Software Engineering for Data Scientists is a 2 day course aimed at data scientists (perhaps from an academic background) who lack strong software engineering skills. We cover reviewing “bad Notebook code”, refactoring this code, using a standardised folder structure (with cookiecutter), adding unit tests and defensive Pandas tests along with checking how to introduce these techniques back into your team.

“Ian’s Software Engineering for Data Scientists course provides an excellent overview of best practices with focus on testing, debugging and general code maintenance. Ian has a wealth of experience and also makes sure to keep on top of the latest tools and libraries in the Data Science world. I would especially recommend the course to Data Science practitioners coming from an academic rather than software engineering background.” – LibertyGlobal Mirka

I am also thinking of introducing a High Performance Python course based on the updates coming to the 2nd edition of my High Performance Python book (for release April 2020). You’ll get details about this on my low-frequency email training list and if you have strong thoughts about this, please get in contact!


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“Higher Performance Python” at PyDataCambridge 2019

I’ve had the pleasure of speaking at the first PyDataCambridge conference (2019), this is the second PyData conference in the UK after PyDataLondon (which colleagues and I co-founded 6 years back). I’m super proud to see PyData spread to 6 regional meetups and now 2 UK conferences.

We had over 200 attendees and the conference (and a swanky black-tie conference dinner) and the single-track event had a rich set of topics (schedule). For me scikit-multiflow (extending sklearn to streaming data) was a hit along with model stability checking (by FarFetch) and an overview of GA2M (an extended Generalised Adaptive Model with explainability). Thanks to the speakers for fine talks and the audience for fine questions and Cambridge Spark and the PyDataCambridge meetup for helping make it all happen!

I spoke on Higher Performance Python with a focus towards making Pandas operations go faster and an eye on the upcoming Second Edition of our High Performance Python (O’Reilly) book. The talk covers:

  • Using line_profiler to evaluate sklearn’s LinearRegression vs NumPy’s lstsq (spoiler – lstsq is much faster but that’s due to sklearn being much safer, the slow-down is all due to safety code in sklearn that helps keep your productivity higher overall)
  • Using Pandas for line-by-line iteration (slow) vs apply (faster) and apply with raw=True to expose NumPy arrays (fastest)
  • Using Numba to JIT compile lstsq using apply with raw=True for a huge speed-up
  • Using Dask to parallelise the Numba solution for further speed-ups
  • Advice on being a “highly performant data scientist”

The last point is important – going “compiler happy” and writing highly efficient code may well slow down your team and your overall velocity. Amongst other items I recommended profiling first, maybe introducing Dask & Numba only with a team’s consent and looking at tools like Bulwark to add tests to DataFrames to avoid being derailed by strange data bugs.

Right now Micha and I are busily working to complete the second edition of our book, all going well it’ll be in for Christmas with a publication date around April 2020.

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“A starter data science process for software engineers” – talk at PyLondinium 2019

I’ve just spoken on “A starter data science process for software engineers” (slides linked) at PyLondinium 2019, this talk is aimed at software engineers who are starting to ask data related questions and who are starting a data science journey. I’ve noted that many software engineers – without a formal data science background – are joining our PyData/data science world but lack useful transitionary resources. [note – video to come]

In this talk (based in part upon my current training courses and my recent PyDataCambridge talk) I cover:

  • What enables a good data science project
  • Ways to plan a project spec for success (really, do this, it saves so much pain)
  • A live demo covering a Jupyter Notebook with Altair, matplotlib, sklearn, yellowbrick, Widgets and then serve this up with Voila and Binder

The Notebook lives in github and this link should start a live Binder version (in which Altair is interactive and the slider Widget at the bottom of the Notebook live-drives scikit-learn predictions).

After the talk it seems that both Altair and the message “make a project spec” were the main winners, with Voila as a close third.

PyLondinium were also kind enough to organise a book signing for my High Performance Python book where I got to talk a bit about our in-preparation 2nd edition (for January).

This conference builds on last year’s inaugural event, it has grown and has a lovely feel. You may want to think on putting in a talk for next year’s PyLondinium!

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

“On the Delivery of Data Science Projects” – talk at PyDataCambridge meetup

A few weeks I got to speak at PyDataCambridge (thanks for having me!), slides are here for “On The Delivery of Data Science Projects“.

This talk is based on my experiences coaching teams (whilst building IP for clients) to help them derisk, design and deliver working data science products. This talk is really in two halves – it takes the important lessons from my two training classes and boils them down into a 30 minute talk. We cover:

  • What makes for a successful data science project?
  • Developing a Project Specification for shared agreement including a Definition of Done
  • Using standard tools and processes to standardize and simplify
  • Ideas around best practice

Let me know if you found this talk useful? I really think the ideas around successful project delivery need to be collected and shared, we’re still in the “wild west” and I’m keen to collate more examples of successful process.

 


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More