About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

18 July 2014 - 9:12IPython Memory Usage interactive tool

I’ve written a tool (ipython_memory_usage) to help my colleague and I understand how RAM is allocated for large matrix work, it’ll work for any large memory allocations (numpy or regular Python or whatever) and the allocs/deallocs are reported after every command. Here’s an example – we make a matrix of 10,000,000 elements costing 76MB and then delete it:

IPython 2.1.0 -- An enhanced Interactive Python.
In [1]: %run -i  ipython_memory_usage.py
In [2]: a=np.ones(1e7)
'a=np.ones(1e7)' used 76.2305 MiB RAM in 0.32s, 
peaked 0.00 MiB above current, total RAM usage 125.61 MiB 
In [3]: del a 
'del a' used -76.2031 MiB RAM in 0.10s, 
peaked 0.00 MiB above current, total RAM usage 49.40 MiB

 

UPDATE As of October 2014 I’ll be teaching High Performance Python and Data Science in London, sign-up here to get on our course announce list (no spam, just occasional updates about upcoming courses). We’ll cover topics like this one from beginners to advanced, using Python, to do interestinng science and to give you the edge.

The more interesting behaviour is to check the intermediate RAM usage during an operation. In the following example we’ve got 3 arrays costing approx. 760MB each, they assign the result to a fourth array, overall the operation adds the cost of a temporary fifth array which would be invisible to the end user if they’re not aware of the use of temporaries in the background:

In [2]: a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8)
'a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8)' 
used 2288.8750 MiB RAM in 1.02s, 
peaked 0.00 MiB above current, total RAM usage 2338.06 MiB 
In [3]: d=a*b+c 
'd=a*b+c' used 762.9453 MiB RAM in 0.91s, 
peaked 667.91 MiB above current, total RAM usage 3101.01 MiB

 

If you’re running out of RAM when you work with large datasets in IPython, this tool should give you a clue as to where your RAM is being used.

UPDATE – this works in IPython for PyPy too and so we can show off their homogeneous memory optimisation:

# CPython 2.7
In [3]: l=range(int(1e8))
'l=range(int(1e8))' used 3107.5117 MiB RAM in 2.18s, 
peaked 0.00 MiB above current, total RAM usage 3157.91 MiB

And the same in PyPy:

# IPython with PyPy 2.7
In [7]: l=[x for x in range(int(1e8))]
'l=[x for x in range(int(1e8))]' used 763.2031 MiB RAM in 9.88s, 
peaked 0.00 MiB above current, total RAM usage 815.09 MiB

If we then add a non-homogenous type (e.g. adding None to the ints) then it gets converted back to a list of regular Python (heavy-weight) objects:

In [8]:  l.append(None)
'l.append(None)' used 3850.1680 MiB RAM in 8.16s, 
peaked 0.00 MiB above current, total RAM usage 4667.53 MiB

The inspiration for this tool came from a chat with my colleague where we were discussing the memory usage techniques I discussed in my new High Performance Python book and I realised that what we needed was a lighter-weight tool that just ran in the background.

My colleague was fighting a scikit-learn feature matrix scaling problem where all the intermediate objects that lead to a binarised matrix took >6GB on his 6GB laptop. As a result I wrote this tool (no, it isn’t in the book, I only wrote this last Saturday!). During discussion (and later validated with the tool) we got his allocation to <4GB so it ran without a hitch on his laptop.

UPDATE UPDATE excitedly I’ll note (and this will be definitely exciting to about 5 other people too including at least @FrancescAlted) that I’ve added proto-perf-stat integration to track cache misses and stalled CPU cycles (whilst waiting for RAM to be transferred to the caches), to observe which operations cause poor cache performance. This lives in a second version of the script (same github repo, see the README for notes). I’ve also experimented with viewing how NumExpr makes far more efficient use of the cache compared to regular Python.

I’m probably going to demo this at a future PyDataLondon meetup.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Python

4 July 2014 - 16:08Second PyDataLondon Meetup a Javascript/Analystic-tastic event

This week we ran our 2nd PyDataLondon meetup (@PyDataLondon), we had 70 in the room and a rather techy set of talks. As before we hosted by Pivotal (@gopivotal) via Ian – many thanks for the beer and pizza! I took everyone to the pub after for a beer on out  data science consultancy to help get everyone talking.

UPDATE As of October 2014 I’ll be teaching Data Science and High Performance Python in London, sign-up here if you’d like to be notified of our courses (no spam, just occasional notes about our plans).

As a point of admin – we’re very happy that people who were RSVPd but couldn’t make it were able to unRSVP to free up spots for those on the waitlist. This really helps with predicting the number of attendees (which we need for beer & pizza estimates) so we can get in everyone who wants to attend.

We’re now looking for speakers for our 3rd event – please get in contact via the meetup group.

First up we had Kyran Dale (my old co-founder in our ShowMeDo educational site) talking around his consulting speciality of JavaScript and Python, he covered ways to get started including ways to export Pandas data into D3 with example code, JavaScript pitfalls and linting in “Getting your Python data into the Browser“:

 

Next we had Laurie Clark-Michalek talking on “Day of the Ancient 2 Game Analysis using Python“, Laurie went low-level into Cython with profiling via gprof2dot (which incidently we cover in our HPC book) and gave some insight into the professional game-play and analysis world:

We then had 2 lightning talks:

We finished with a small experiment – I brought a set of cards and people filled in a list of problems they’d like to discuss and skills they could share. Here’s the set, we’ll run this experiment next month (and iterate, having learned a little from this one). In the pub after I had a couple of nice chats from my ‘want’ (around “company name cleaning” from free-text sources):

Topics listed on the cards included Apache Spark, network analysis, numpy, facial recognition, geospatial and a job post. I expect we’ll grow this idea over the next few events.

Please get in contact via the meetup group if you’d like to speak, next month we have a talk on a new data science platform. The event will be on Tues August 5th at the same location.

I’ll be out at EuroPython & PyDataBerlin later this month, I hope to see some of you there. EuroSciPy is in Cambridge this year in August.

 


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: High Performance Python Book, Life, pydata, Python