About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

18 February 2013 - 23:43PyCon Tutorial Notes for Applied Parallel Computing

This post is for students of the Applied Parallel Computing tutorial that Minesh B. Amin and I will run during March 2013 at PyCon.This is a wiki-post, I’ll update it over the next month. If you are attending the tutorial you must check this post in the run-up to the tutorial. Important notes are below for you to read now. This is linked to from our PyCon Tutorial Support page.

If you come to this after the tutorial you’ll probably find this useful for setup. The following is for my students:

  • Check this post before you come to PyCon, you will be expected to have followed instructions and installed the software and updates before the tutorial
  • You won’t have time to install/setup during the tutorial, you must arrive prepared, we have a lot to work through and we’ll start immediately
  • Accepting that the PyCon wifi has been great in past years you must assume that wifi will be broken – come prepared with a fully working environment
  • We recommend strongly that you use our VirtualBox (it has all the libs and the github repo pre-installed, it is open source, it’ll run on Win/Mac/Linux), if you install your own package set then we can’t help you if it doesn’t work as expected (it is also quite fiddly to setup yourself) – you can of course buddy-up with someone else during the tutorial if required

You will be able to get the VirtualBox (about 7GB GB) from this post in the next week, you’ll be better off using the torrent that we’ll provide (please seed if you can, if possible all the way until the tutorial runs to help fellow students).

Download link for VirtualBox (required!) for the tutorial:

(v1.1 torrent deleted as it didn’t run cleanly on Macs)

PyCON-2013_AppliedParallelComputing1.2.zip torrent (very robust – resume if download breaks, 2.2GB zip decompresses to 6.9GB) or via direct download (more brittle – no resume if the download breaks).


md5sum: ce43b52a18ca913e62842ae72cc8df74

NOTE – I had the v1.1 version linked in the torrent above for a few days – if you got that and you can’t start the VirtualBox, just right-click in VirtualBox and discard the saved state, then restart the image. If you have the v1.2 version (linked as of March 4th) then you’re fine.

Video – this YouTube Video Demo (7 minutes) shows you how to install the image.

Instructions:

  1. Unzip to a directory with 7GB of disk space (MAC USERS – the built-in unzip doesn’t seem to handle 64 bit files, use 7zip for success [maybe Windows users too?])
  2. Open VirtualBox (optional but useful – add the extension pack for host integration)
  3. Machine | Add and open the directory that contains the .vdi and .vbox files
  4. Start the machine, it’ll boot to the Linux desktop
  5. Open the web link on the Desktop if you want to see the latest version of this blog post
  6. Double click the “Download GITHUB Repo” script on the desktop and it’ll refresh the repository (in case we’ve added new code)
  7. Familiarise yourself with the environment (Linux Mint 14), GTK Vim and emacs are installed
  8. Open a terminal and run ./pycon2013_applied_parallel_computing/run_this_to_confirm_you_have_the_correct_libraries.py (from the home directory) which confirms to you that the necessary Python libraries are installed (I’ve done this, you can do it for confirmation)

The VirtualBox is a fully configured Linux Mint 14 32 bit (based on Ubuntu 12.10) distribution, with gui, also with gvim installed. Feel free to add anything else. You don’t need to bother installing further system updates, the OS was up to date when we released it. It is configured to provide 2 CPUs and 3GB RAM – you might need to reduce these figures to get it running on your machine.

It runs on my 64 bit laptop (Linux Mint 13 64 bit) and on 32 bit machines, it should work equally well on Windows and Mac (we’ve tested it on both). You should install the Guest Additions (when the Ubuntu installation has booted use the Devices menu at the top of the VirutalBox window and “install guest additions” – this installs integration features like copy/paste with your host OS) as they provide things like shared clipboard to the host machine.

Instructions if you can’t/won’t use our VirtualBox (but you’re on your own in this case):

You can get the github repo here – if you set this up yourself then we can’t offer help if it doesn’t work (go to the relevant forums and ask there). There is a test script in the root of the repo (run_this_to_confirm_you_have_the_correct_libraries.py) which will confirm if you have the right libraries installed (it only checks for the presence of Disco, it doesn’t confirm that it is configured correctly). The README will give you some guidance but we really recommend that you get our VirtualBox (to be released in the next week via this post).


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: Life, Python

10 February 2013 - 14:28Applied Parallel Computing at PyCon 2013 (March)

Minesh B. Amin (MBA Sciences) and I (Mor Consulting) are teaching Applied Parallel Computing at PyCon in San Jose in just over a month, here’s an outline of the tutorial. The conference is sold out but there’s still tickets for the tutorials (note that they’re selling quickly too).

Typically a recording of the tutorial is released a couple of months after PyCon to PyVideo – you miss out on the networking but you can at least catch up on the material. The source code will also be released.

Our tutorial uses a lot of tools so we’re providing a VirtualBox image (32 bit requiring about 5GB of disk space, runs on Win/Lin/Mac). Those who choose not to use the VBox image will have to install the requirements themselves, for some parts this is a bit tough so we strong recommend using the VBox image. Details of the image will be provided to students a few weeks before the conference.

Parts of my tutorial build on my PyCon 2012 High Performance Python 1 tutorial. You might also be interested in the (slightly vague!) idea I have of writing a book on these topics – if so you should add your name to my High Performance Python Mailing List (it is an announce list for when/if I make progress on this project, very lightweight).

This year’s 3 hour tutorial is split into five sections:

  1. Types of parallelism
  2. Hard-won lessons in building reliable/debuggable/extensible parallel systems
  3. “List of tasks” – solving a Mandelbrot task using multiprocessing (single machine), parallelpython (can run multi-machine), redis queue (multi machine and language)
  4. “Map/reduce” – investigating and understanding a set of Tweets using Disco, practical guide to configuration, visualisation with word-cloud and matplotlib, possibly moving on to social network connectivity analysis and visualisation
  5. “Hyperparameter optimisation” – solving a many-paramemter optimisation problem whose parameter space is not fixed at the start of the run

During the Mandelbrot solver we’ll look at where the complexity lies in generating an image like this:

Mandelbrot Surface

During the Disco problem we’ll visualise the results using Andreas’ word-cloud tool, we may also cover the use of map/reduce for social network exploration:

Word-cloud of Apple mentions

Install requirements will be announced closer to the tutorial along with the (recommended!) VirtualBox image. I’m probably providing more material than we can cover for my two sections (Mandelbrot, Disco – how far we get depends on the size and capabilities of the class), all the material will be provided for keen students to continue and we’ll run an after-class session for those with more questions.

 


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Python