pyCUDA on Windows and Mac for super-fast Python math using CUDA

I’ve just started to play with pyCUDA which lets you run parallel math operations on a CUDA-compliant NVidia graphics card through Python.

Update – I’ve written a High Performance Python tutorial (July 2011, 55 pages) which covers pyCUDA and other technologies, you might find it useful.

CUDA stands for Compute Unified Device Architecture – it is an architecture that lets us program the Graphics Processing Unit (GPU) on a high powered graphics card to do scientific or graphical math calculations rather than the usual texture processing for games.  In essence it is a mini supercomputer that is specialised just for fast math operations – if you can figure out how to use it.

The goal is to off-load the CPU-intensive calculations for two of my clients (a physics company and a flood modelling company) to achieve 10* to 100* speed-ups using commodity graphics cards.

pyCUDA makes it easy to interactively program a CUDA device rather than hitting C++ code with the slow write/compile/debug loop.  Recent MacBooks (mine was bought in January 2009) have NVidia cards with CUDA-compatible devices built-in (mine is a 9400M).  For my desktop computer I have a 9800 GT (costing £100).

It turns out that this is bleeding-edge stuff – getting pyCUDA compiled on my MacBook and Win XP machine took some time (forum posts for Mac and Windows issues) thankfully the group is helpful and the wiki has an installation section for Windows, Mac and Linux and some reasonable documentation.

Right now I’ve got as far as running some of the demo code on my MacBook (showing a 5* speed-up over the CPU) and my desktop (showing a 30* speed-up over the CPU).  I’ll report more as I progress.

Update – pyCUDA works inside IPython too, lovely.

Update – I don’t have OpenGL working for but as noted here you need “CUDA_ENABLE_GL = True” in and you need PyOpenGL installed.  When rebuilding my MSVC threw a hissy fit, it isn’t essential to my work so I’m skipping this demo.

Update – I’ve submitted a patch and two examples to the wiki (SimpleSpeedTest, Mandelbrot). I get 200* speed-ups on the speed test (using a for loop on a sin() calculation) and 5 to 20* speed-up on Mandelbrots (it seems to scale very well vs numpy with increasing dimensions).

Update – There are lots of interesting papers for CUDA surfacing like this one showing a 3* speed-up for voice recognition tasks (using CPU and GPU together) and yet another way to improve fluid dynamic simulations. This Tom’s 3D article gives a great write-up (starting with the history of audio cards) on where 3D is right now and how NVidia is beating ATI for scientific computing.

Books to read:

The following CUDA books will help you understand the basics of CUDA programming – I particularly like the first (Kirk and Hwu).

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.