About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

4 July 2014 - 16:08Second PyDataLondon Meetup a Javascript/Analystic-tastic event

This week we ran our 2nd PyDataLondon meetup (@PyDataLondon), we had 70 in the room and a rather techy set of talks. As before we hosted by Pivotal (@gopivotal) via Ian – many thanks for the beer and pizza! I took everyone to the pub after for a beer on out  data science consultancy to help get everyone talking.

UPDATE As of October 2014 I’ll be teaching Data Science and High Performance Python in London, sign-up here if you’d like to be notified of our courses (no spam, just occasional notes about our plans).

As a point of admin – we’re very happy that people who were RSVPd but couldn’t make it were able to unRSVP to free up spots for those on the waitlist. This really helps with predicting the number of attendees (which we need for beer & pizza estimates) so we can get in everyone who wants to attend.

We’re now looking for speakers for our 3rd event – please get in contact via the meetup group.

First up we had Kyran Dale (my old co-founder in our ShowMeDo educational site) talking around his consulting speciality of JavaScript and Python, he covered ways to get started including ways to export Pandas data into D3 with example code, JavaScript pitfalls and linting in “Getting your Python data into the Browser“:

 

Next we had Laurie Clark-Michalek talking on “Day of the Ancient 2 Game Analysis using Python“, Laurie went low-level into Cython with profiling via gprof2dot (which incidently we cover in our HPC book) and gave some insight into the professional game-play and analysis world:

We then had 2 lightning talks:

We finished with a small experiment – I brought a set of cards and people filled in a list of problems they’d like to discuss and skills they could share. Here’s the set, we’ll run this experiment next month (and iterate, having learned a little from this one). In the pub after I had a couple of nice chats from my ‘want’ (around “company name cleaning” from free-text sources):

Topics listed on the cards included Apache Spark, network analysis, numpy, facial recognition, geospatial and a job post. I expect we’ll grow this idea over the next few events.

Please get in contact via the meetup group if you’d like to speak, next month we have a talk on a new data science platform. The event will be on Tues August 5th at the same location.

I’ll be out at EuroPython & PyDataBerlin later this month, I hope to see some of you there. EuroSciPy is in Cambridge this year in August.

 


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: High Performance Python Book, Life, pydata, Python

23 June 2014 - 22:47High Performance Python manuscript submitted to O’Reilly

I’m super-happy to say that Micha and I have submitted the manuscript to O’Reilly for our High Performance Python book. Here’s the final chapter list:

  • Understanding Performant Python
  • Profiling to find bottlenecks (%timeit, cProfile, line_profiler, memory_profiler, heapy and more)
  • Lists and Tuples (how they work under the hood)
  • Dictionaries and Sets (under the hood again)
  • Iterators and Generators (introducing intermediate-level Python techniques)
  • Matrix and Vector Computation (numpy and scipy and Linux’s perf)
  • Compiling to C (Cython, Shed Skin, Pythran, Numba, PyPy) and building C extensions
  • Concurrency (getting past IO bottlenecks using Gevent, Tornado, AsyncIO)
  • The multiprocessing module (pools, IPC and locking)
  • Clusters and Job Queues (IPython, ParallelPython, NSQ)
  • Using less RAM (ways to store text with far less RAM, probabilistic counting)
  • Lessons from the field (stories from experienced developers on all these topics)

August is still the expected publication date, a soon-to-follow Early Release will have all the chapters included. Next up I’ll be teaching on some of this in August at EuroSciPy in Cambridge.

Some related (but not covered in the book) bit of High Performance Python news:

  • PyPy.js is now faster than CPython (but not as fast as PyPy) – crazy and rather cutting effort to get Python code running on a javascript engine through the RPython PyPy toolchain
  • Micropython runs in tiny memory environments, it aims to runs on embedded devices (e.g. ARM boards) with low RAM where CPython couldn’t possibly run, it is pretty advanced and lets us use Python code in a new class of environment
  • cytools offers Cython compiled versions of the pytoolz extended iterator objects, running faster than pytoolz and via iterators probably using significantly less RAM than when using standard Python containers

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, High Performance Python Book, Python

9 June 2014 - 22:297 chapters of “High Performance Python” now live

O’Reilly have just released another update to our High Performance Python book, in total we’ve now released the following:

  • Understanding Performance Python
  • Profiling to find bottlenecks (%timeit, cProfile, line_profiler, memory_profiler, heapy)
  • Lists and Tuples
  • Dictionaries and Sets
  • Iterators and Generators
  • Matrix and Vector Computation (numpy and scipy)
  • Compiling to C (Cython, Shed Skin, Pythran, Numba, PyPy)

We’re in the final edit cycle, we have a lot of edits to commit to the main chapters over the next week for the next Early Release. All going well the book will be published in August.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: High Performance Python Book, Python

2 June 2014 - 20:02New High Performance Python chapters online & teaching a 2 day course on HPC

The last month has been crazy busy, not least because I got to run my first High Performance Python 2 day tutorial at a university. I was out in Aalborg University teaching a PhD group, we covered four blocks:

  1. Profiling (CPU and RAM)
  2. Compilers and JITs
  3. Multi-core and distributed
  4. Using less RAM, storage systems and lessons

UPDATE As of October 2014 I’ll be teaching High Performance Python and Data Science in London, sign-up here to join our announce list (no spam, just occasional updates about our courses).

Here’s a picture of my class, it all went rather swimmingly. I plan to run the same class in London in the coming months (details to follow):

class_aalborg_teaching

On the same note we pushed some more chapters for our High Performance Python book on to O’Reilly’s build system a week back, we now have:

  • Introduction
  • Performant Python
  • Tuples and Dictionaries
  • Iterators and Generators
  • Profiling
  • Matrices with numpy
  • Compiling and JITs

More chapters will go live in a couple of weeks, we’re in the final editing phase now.

Don’t forget that PyDataBerlin is coming up in a couple of months, it runs during EuroPython. If you’re out for EuroPython then it makes a lot of sense to go to PyDataBerlin too :-)


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: High Performance Python Book, Python

16 April 2014 - 21:112nd Early Release of High Performance Python (we added a chapter)

Here’s a quick book update – we just released a second Early Release of High Performance Python which adds a chapter on lists, tuples, dictionaries and sets. This is available to anyone who has bought it already (login into O’Reilly to get the update). Shortly we’ll follow with chapters on Matrices and the Multiprocessing module.

One bit of feedback we’ve had is that the images needed to be clearer for small-screen devices – we’ve increased the font sizes and removed the grey backgrounds, the updates will follow soon. If you’re curious about how much paper is involved in writing a book, here’s a clue:

We announce each updates along with requests for feedback via our mailing list.

I’m also planning on running some private training in London later in the year, please contact me if this is interesting? Both High Performance and Data Science are possible.

In related news – the PyDataLondon conference videos have just been released and you can see me talking on the High Performance Python landscape here.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: High Performance Python Book, Life, pydata, Python

10 March 2014 - 14:15High Performance Book almost ready for Early Release preview

Our first few chapters of High Performance Python are nearly ready for the Early Release on O’Reilly’s platform, we’ll be releasing:

  • Understanding Performance Python (an overview of the virtual machine and modern PC hardware)
  • Profiling Python Code (for CPU and RAM profiling, lots of options)
  • Pure Python (generators, the guts of dicts and the like)

We’ll announce the release via our mailing list, sign-up if you want to know as soon as it is available. Overall Micha and I have written half the book now (although this first Early Release will be just the first 3 chapters), we aim to finish the other half in the next few months.

The process of writing is very iterative…which means it is far too easy to write a bunch of stuff and then realise there’s a bunch of holes that need filling in (some of which turn into real rabbit-holes one call fall into for days!). Out the other side you get a nice narrative, lots of verified results and some nice explanatory plots. It just takes rather a long time to get there.

Here’s the first couple of pages of the start of the (just being written-up) multiprocessing chapter, I’m using a Monte Carlo Pi estimator to discuss performance characteristics of threads vs processes. Notice the black pen and scrawls to improves the diagrams – this happens a lot:

HPC book multiprocessing scrawls

Right now I’ve been playing with Pi estimation using straight Python and numpy over many cores using both threads and processes (this chapter is just looking at multiprocessing, not JITs or compilers [that's a later chapter]). Processes obviously provide a linear speed-up for this sort of problem, exploring the right number of processes and the right job sizes (especially for variable-time workloads) was fun.

Threads on numpy do something interesting – you can actually use multiple cores and so code can run faster (about 10% in this case compared to a single core), although they’re often not as efficient as using multiple processes. Here’s a plot of CPU use for 4 cores (and 4 HyperThreads) over time with threads. It turns out that the random number generation is GIL bound but the vectorised operations for Pi’s estimation aren’t:colour_08_pi_lists_parallel_py_4

Join the mailing list for updates. Work on the book is supported via my London based Artificial Intelligence agency Mor Consulting.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

7 Comments | Tags: High Performance Python Book, Python

23 February 2014 - 11:53High Performance Python at PyDataLondon 2014

Yesterday I spoke on The High Performance Python Landscape at PyDataLondon 2014 (our first PyData outside of the USA – see my write-up). I was blessed with a full room and interesting questions. With Micha I’m authoring a High Performance Python book with O’Reilly (email list for early access) and I took the topics from a few of our chapters.

“@ianozsvald providing eye-opening discussion of tools for high-performance #Python: #Cython, #ShedSkin, #Pythran, #PyPy, #numba… #pydata” – @davisjmcc

Overall I covered:

  • line_profiler for CPU profiling in a function
  • memory_profiler for RAM profiling in a function
  • memory_profiler’s %memit
  • memory_profiler’s mprof to graph memory use during program’s runtime
  • thoughts on adding network and disk I/O tracking to mprof
  • Cython on lists
  • Cython on numpy by dereferencing elements (which would normally be horribly inefficient) plus OpenMP
  • ShedSkin‘s annotated output and thoughts on using this as an input to Cython
  • PyPy and numpy in PyPy
  • Pythran with numpy and OpenMP support (you should check this out)
  • Numba
  • Concluding thoughts on why you should probably use JITs over Cython

Here’s my room full of happy Pythonistas :-)

pydatalondon2014_highperformancepython

“Really useful and practical performance tips from @ianozsvald @pydata #pydata speeding up #Python code” – @iantaylorfb

Slides from the talk:

 

 

UPDATE Armin and Maciej came back today with some extra answers about the PyPy-numpy performance (here and here), the bottom line is that they plan to fix it (Maciej says it is now fixed – quick service!). Maciej also notes improvements planned using e.g. vectorisation in numpy.

VIDEO TO FOLLOW


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

9 Comments | Tags: High Performance Python Book, pydata, Python

14 January 2014 - 23:05Installing the numpy module in PyPy

Working on the High Performance Python book (mailing list here for our occasional announces) I’ve reinstalled PyPy a couple of times, each time I forget how to install the numpy module. Note that PyPy’s numpy is different and much smaller than CPython’s numpy. It does however work for smaller problems if you just need some of the core features (i.e. not the libs that numpy wraps). It used to be included in a branch, now it comes as a separate package.

I’m posting this as a reminder to myself and maybe as  a bit of help to another intrepid soul. The numpy PyPy install instructions are in this Nov 2013 blog post. You need to clone their numpy repo and then install it as a regular module using the “setup.py” that’s provided (it takes just a couple of minutes and installs fine). Download PyPy from here and just extract it somewhere.

Having installed it I can:
$ ../bin/pypy 
Python 2.7.3 (87aa9de10f9c, Nov 24 2013, 18:48:13)
[PyPy 2.2.1 with GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``it seems to me that once you
settle on an execution / object model and / or bytecode format, you've already
decided what languages (where the 's' seems superfluous) support is going to be
first class for''
>>>> import numpy as np
>>>> np.__version__
'1.8.0.dev-74707b0'

From here I can use the random module and do various vectorized operations, it isn’t as fast as CPython’s numpy for the Pi example I’m working but it does work. Does anyone know which parts offer comparable speed to its bigger brother?


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: High Performance Python Book, Python

23 December 2013 - 14:52Progress on High Performance Python book

I figured a short update was in order. Micha (@mynameisfiber, github) and I are progressing on our High Performance Python book, we have a proposed chapter outline below and hope to have a rough cut of some early chapters for January. The book should be finalised ‘earlier in 2014′ though we won’t be drawn on a date just yet.

The book is aimed at an Intermediate Pythonista audience for people who need to go beyond single core performance with CPython 2.7 and who want to use multi-cores, other compilers and clusters. It is not an HPC-Expert level book, though we hope that Python Experts will find some of the chapters to be useful.

We have a mailing list (signup here) which we use very sparingly to update our group about our progress (and soon we’ll be posting the rough-cut chapter notifications via the list). We’re posting once a month or so, you can opt out at any time.

Some people have questioned our focus on CPython 2.7. We ran a survey a few months ago and a couple of hundred people told us that they mainly use CPython 2.7 for their number crunching work, so that’s the focus of the book. Almost everything will or should run in Python 3.3+ with little or no change so we’re not worrying about the smaller differences.

Planned chapter list (this might be subject to change):
  • Understanding Performance Programming (high level introduction)
  • Profiling Python code (focusing on CPU and RAM profiling)
  • Pure Python (mainly CPython 2.7)
  • Matrix Computation with numpy
  • Disks and Network Processing
  • Calculating in Parallel (introduction)
  • Multiprocessing (multi-core on a single machine)
  • Cluster Computing (commodity clusters)
  • Just In Time Compiling
  • Static Compiling
  • Using less RAM
  • Lessons from the Field

 


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: High Performance Python Book, Python