About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

30 November 2008 - 18:14Upgrading Ubuntu Hardy to Ibex

I’ve just upgraded from 8.04 LTS to 8.10.  Inevitably there were some hiccups – no sound, difficulties with the nVidia drivers.

The initial upgrade was flawless, it took under 30 minutes to prepare itself, download the new packages, install everything and reboot.

I have an NVidia 8500GT which requires the non-free driver for decent video performance.  I also have an NVidia 7050 built into the motherboard which I ignore.

After upgrading to Ibex the default video driver was the free NVidia driver – boring but stable.  I tried to use ‘envyng -t’ (the GTK frontend isn’t working for Ibex yet) but it crashes with

TypeError: list indices must be integers

Instead I used the restricted driver manager – this installed, but on reboot I got dumped at the console.

The clue was in /var/log/Xorg.0.log with a message like ‘(!!) More than one possible primary device found’.  dmesg showed nothing useful.  Previously I’d booted to safe mode with the latest kernel and tried ‘xfix’ but that didn’t seem to do anything useful.

The problem is explained in this bug report, the solution is to manually add a BusID line to the Devices section of /etc/X11/xorg.conf (after backing up your xorg.conf just in case).  My xorg.conf now looks like:

Section "Device"
Identifier     "Device0"
Driver         "nvidia"
VendorName     "NVIDIA Corporation"
BusID          "PCI:02:00:00" # !This is the line I manually added
EndSection

Now that I can play video I notice that there’s no sound.  Investigating System->Preferences->Sound I see that everythign is set to play using ALSA and the on-board HDA NVidia (Alsa mixer) default mixer device.

I only use my SoundBlaster Audigy for sound playback.  I changed the Default Mixer Track to Audigy 2 ZS [SB0350] Alsa Mixer, then each ALSA output device to Audigy 2 ZS p16v (ALSA) and sound now plays back in the sound tool.

To get sound playing in Amarok, VLC and mplayer I had to make sure that Audigy Analog/Digital Output Jack was ticked, ALSA was selected in each media player and pulseaudio was killed in the process list.  There are some notes here and here.  I unticked PulseAudio in Preferences->Sessions, as noted in the first link above.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Life

25 November 2008 - 17:19Girl Geeks, Flash Big Screen, £5 App Xmas Special

Here are three local events that I’m rather looking forward to:

Girl Geeks on Tues Dec 2nd with Emily on Building Robots.  Emily will walk you through a history of robotics, show neat demos and then show real live hand-built robots (4 of ‘em) Doing Stuff including, just possibly, sumo wrestling.  Someone remind Emily to update robochick with more robot posts!

Flash Brighton have a big event on Tues Dec 9th for 200 people called the Big Screen Bonanza.  Lots of Flash geekery, prizes and a whole lotta people.

Our £5 App Xmas Special is on Weds Dec 10th.  Aleks will launch SpaceShip with The Guardian then we’ll follow with a set of fun gamesy demos including bluetooth+accelerometer driven Lightsaber mobiles, Xmas snow+games in Flash and a 3D in-development iPhone game, with more talks to follow.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: projectbrightonblogs, sussexdigital, £5 App Meet

18 November 2008 - 2:10£5 App Xmas Games Special

We’re plotting our 14th £5 App meet.  This, our second Christmas Special, will have a gamesy happy crimbo feel.

Date: Wednesday 10th December, sign-up on Upcoming please.  Location TBC.

Our very own Aleks Krotoski will lead the evening with the launch of the Guardian’s new on-line text adventure SpaceShip!

We’ll probably run the second half of the night as a demo spot for some of the local gamers.  Beer, cake, good crowds and the pub will all occur as usual.  See photos on the fivepoundapp site if you’re not sure what to expect.

If you have a game to demo, please get in touch.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: projectbrightonblogs, sussexdigital, £5 App Meet

17 November 2008 - 18:16Making Python math 196* faster with shedskin

Dr. Michael Thomas approached me with an interesting A.I. job to see if we could speed up his neural network code from a 10 year old research platform called PlaNet. Using new Sun boxes they weren’t getting the speed-ups they expected, old libs or other monkey business were suspected.

As a first investigation I took Neil Schemenauer’s bpnn.py (a 200 line back-prop artificial neural network library with doc and comparison). The intention was to see how much faster the code might run using psyco and shedskin.

The results were really quite surprising, notes and src follow.

Addition – Leonardo Maffi has written a companion piece showing that his ShedSkin output is 1.5 to 7* slower than hand-coded C.  He also shows solutions using the D language and runtimes for Python 2.6 (I use Python 2.5 below).  He notes:

“I have translated the Python code to D (using my D libraries) in just few minutes, something like 15-20 minutes, and the translation was mostly painless and sometimes almost mechanical. I have translated the D code to C in many hours. Translating Python => C may require something like 20-30 times the time you need to translate Python => D + my libs. And this despite I have used a rigorous enough method to perform the translation, and despite at the end I am not sure the C code is bug-free. This is an enormous difference.”

End addition.

Addition – Robert Bradshaw has created a Cython version with src, see comments. End addition.

The run-time in minutes for the my harder test case are below.  Note that these are averages of 4 runs each:

  1. Vanilla Python 153 minutes
  2. Python + Psyco 1.6.0.final.0 57 minutes (2.6* faster)
  3. Shedskin 0.0.29 0.78 minutes [47 seconds] (196* faster)

The test machines uses Python 2.5.2 on Ubuntu 8.04. The box is an Intel Core Duo 2.4GHz running a single process.

The ‘hard’ problem trains the ANN using 508 patterns with 57 input neurons, 50 hidden and 62 output neurons over 1000 iterations. If you know ANNs then the configuration (0.1 learning rate, 0 momentum) might seem unusual, be assured that this is correct for my researcher’s problem.

There is a shorter version of this problem using just 2 patterns, this is useful if you want to replicate these results but don’t want to wait 3 hours on your first run.

My run times for the shorter problem are (again averaged using 4 runs):

  1. Vanilla Python 42 seconds
  2. Python + Psyco 14 seconds
  3. Shedskin 0.2 seconds (210* faster)

Shedskin has an issue with numerical stability – it seems that internally some truncation occurs with floating point math. Whilst the results for vanilla Python and Python+Psyco were identical, the results with Shedskin were similar but with fractional divergences in each result.

Whilst these divergences caused some very different results in the final weights for the ANN, my researcher confirms that all the results look equivalent.

Mark Dufour (Shedskin’s author) confirms that Python’s C double is used the same in Shedskin but notes that rounding (or a bug) may be the culprit. Shedskin is a young project, Mark will welcome extra eyes if you want to look into this.

Running the code with Shedskin was fairly easy. On Ubuntu I had to install libgc-dev and libpcre3-dev (detailed in the Shedskin docs) and g++, afterwards shedskin was ready. From download to first run was 15 minutes.

On my first attempt to compile bpnn.py with Shedskin I received an error as the ‘raise’ keyword isn’t yet supported. I replaced the ‘raise’ calls with ‘assert False’ for sanity, afterwards compilation was fine.

Edit – Mark notes that the basic form of ‘raise’ is supported but the version used in bpnn.py isn’t yet supported.  Something like ‘raise ValueError(‘some msg’)’ works fine.

Mark notes that Shedskin currently works well up to 500 lines (maybe up to 1000), since bpnn.py is only 200 lines compilation is quick.

Note that if you can’t use Psyco because you aren’t on x86, Shedskin might be useful to you since it’ll work anywhere that Python and g++ compile.

Running this yourself

If you want to recreate my results, download bpnn_shedskin_src_20081117.zip. You’ll see bpnn_shedskin.py, this is the main code. bpnn_shedskin.py includes either ‘examples_short.py’ or ‘examples_full.py’, short is the easier 2 pattern problem and full has 508 patterns.

Note that these patterns are stored as lists of tuples (Shedskin doesn’t support the csv module so I hardcoded the input patterns to speed development), the full version is over 500 lines of Python and this slows Shedskin’s compilation somewhat.

By default the imports for Psyco are commented out and the short problem is configured. At the command line you’ll get an output like this:

python bpnn_shedskin.py
Using 2 examples
ANN uses 57 input, 50 hidden, 62 output, 1000 iterations, 0.100000 learning rate, 0.000000 momentum
error 65.454309      2008-11-17 15:22:58.318593
error 45.176110      2008-11-17 15:22:59.060787
error 44.616933      2008-11-17 15:23:00.246280
error 44.026883      2008-11-17 15:23:01.743821
error 44.049276      2008-11-17 15:23:02.815876
error 44.905183      2008-11-17 15:23:03.860352
error 44.674506      2008-11-17 15:23:05.270307
error 43.365627      2008-11-17 15:23:06.757126
error 43.299160      2008-11-17 15:23:08.244466
error 42.540076      2008-11-17 15:23:09.732035
Elapsed: 0:00:41.472192

If you uncomment the two Psyco lines your code will run about 2.6* faster.

Using Shedskin

To use shedskin, first run the Python through shedskin and then ‘make’ the result. The compiled binary will run much faster than the vanilla Python code, the result below shows the short problem taking 0.19 seconds compared to 41 seconds above.

shedskin bpnn_shedskin.py
*** SHED SKIN Python-to-C++ Compiler 0.0.29 ***
Copyright 2005-2008 Mark Dufour; License GNU GPL version 3 (See LICENSE)
[iterative type analysis..]
***
iterations: 3 templates: 519
[generating c++ code..]
*WARNING* bpnn_shedskin.py:178: function (class NN, 'weights') not called!
*WARNING* bpnn_shedskin.py:156: function (class NN, 'test') not called!

make
g++  -O2 -pipe -Wno-deprecated  -I. -I/usr/lib/shedskin/lib /usr/lib/shedskin/lib/string.cpp /usr/lib/shedskin/lib/random.cpp /usr/lib/shedskin/lib/datetime.cpp examples_short.cpp bpnn_shedskin.cpp /usr/lib/shedskin/lib/builtin.cpp /usr/lib/shedskin/lib/time.cpp /usr/lib/shedskin/lib/math.cpp -lgc  -o bpnn_shedskin

./bpnn_shedskin
Using 2 examples
ANN uses 57 input, 50 hidden, 62 output, 1000 iterations, 0.100000 learning rate, 0.000000 momentum
error 65.454309      2008-11-17 16:11:08.452087
error 44.970416      2008-11-17 16:11:08.476869
error 46.444249      2008-11-17 16:11:08.506324
error 44.209054      2008-11-17 16:11:08.519375
error 44.058518      2008-11-17 16:11:08.532430
error 45.655892      2008-11-17 16:11:08.545741
error 44.518816      2008-11-17 16:11:08.558520
error 43.643572      2008-11-17 16:11:08.571705
error 44.800429      2008-11-17 16:11:08.584241
error 43.710905      2008-11-17 16:11:08.597465
Elapsed: 0:00:00.198747

Why is the math different?

An open question remains as to why the evolution of the floating point arithmetic is different between Python and Shedskin. If anyone is interested in delving in to this, I’d be very interested in hearing from you.

Extension modules

Mark notes that the extension module support is perhaps a more useful way to use Shedskin for this sort of problem.

A single module can be compiled (e.g. ‘shedskin -e module.py’) and with Python you just import it (e.g. ‘import module’) and use it…with a big speed-up.

This ties the code to your installed libs – not so great for easy distribution but great for lone researchers needing a speed boost.

Shedskin 0.1 in the works

Mark’s plan is to get 0.1 released over the coming months. One aim is to get the extension module to a similar level of functionality as SWIG and improve the core library support so that Shedskin comes with (some more) Batteries Included.

Mark is open to receiving code (up to 1000 lines) that doesn’t compile.  The project would always happily accept new contributors.

See the Shedskin homepage, blog and group.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: ArtificialIntelligence, Programming, Python