Entrepreneurial Geekiness
Use of VirtualBox to prepare students (PyCon tutorials)
Minesh and I ran a tutorial (Applied Parallel Computing) at PyCon 2013 yesterday, we’ve been working on building and distributing a VirtualBox (7GB) for students to simplify the teaching with a unified, preconfigured environment. This process took a while, below are my notes. Others (e.g. Kat teaching SimpleCV) also had a VirtualBox.
The upside of a VirtualBox is that everyone has a unified environment, so students see on their screen exactly what you have on your screen. The downside is that this doesn’t help them install the tools onto their laptop for normal use. If you’re teaching a medley of tools (as we were) and especially if some require non-trivial installation (e.g. Disco map/reduce for us) then VirtualBoxes are a clear win.
- We zipped the directory containing the VDI file, Kat used a single OVF file (both for VirtualBox), I think the single OVF file might be easier to distribute and might work in other (non-VirtualBox) environments. Our zip took 7GB down to 2.2GB
- Your VirtualBox will be configured for you…but students might have foreign keyboards (e.g. Minesh made our VBox image with a US keyboard, I have a UK keyboard, some students have German etc keyboards) – provide notes on how to reconfigure the Guest OS so the student can setup their keyboard
- git clone a read-only repo into the VBox, students can then just git pull to get updates
- We added a run_this_to_confirm_you_have_the_correct_libraries.py script, it checks that everything is installed, students can run this to double check that their install is good
- Use a standard user and password – we used “pycon:pycon”
- I made a YouTube screencast using RecordMyDesktop (with desktop compositing disabled to reduce flicker)
- Bundle everything into a blog post that you can easily update – here are our install notes and video
- A large zip is harder to distribute – I linked to the zip on my blog (I have lots of bandwidth) and created a torrent using the super-easy burnbit site (here’s my download page) – you can see the torrent link on the install notes page linked above
- You probably want to use a 32 bit OS for the Guest OS (we used Linux Mint 14 32 bit), a 64 bit Guest OS won’t run on a 32 bit system (but a 32 bit Guest OS will run on a 64 bit host)
- Despite linking our tutorial notes to the tutorial page on the PyCon website (and mailing students), many didn’t have a preinstalled environment – we had a set of USB Thumb Drives which simplified the setup. Our first 30 minutes was talking so students had time to install the VBox
- Github is a great place to store code, data (if not huge) and slides
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
PowerPoint: Brief Introduction to NLProc. for Social Media
For my client (AdaptiveLab) I recently gave an internal talk on the state of the art of Natural Language Processing around Social Media (specifically Twitter and Facebook), having spent a few days digesting recent research papers. The area is fascinating (I want to do some work here via my Annotate.io) as the text is so much dirtier than in long form entries such as we might find with Reuters and BBC News.
The Powerpoint below is just the outline, I also gave some brief demos using NLTK (great Python NLP library).
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
ANN: twitter-text-python 1.0.0.2 release (Python Tweet parsing library)
A few weeks back I took over as maintainer of the twitter-text-python library (source on github). This library lets you take a tweet like:
"@ianozsvald, you now support #IvoWertzel's tweet ... parser! https://github.com/ianozsvald/"
and extract the Twitter entities as defined in the Twitter conformance tests. The entities in the above tweet would be:
-
reply: 'ianozsvald'
-
users: ['ianozsvald']
-
tags: ['IvoWertzel']
-
urls: ['https://github.com/ianozsvald/']
-
lists: [] # no lists in this tweet
-
output html: u'<a href="http://twitter.com/ianozsvald">@ianozsvald</a>, ...
-
you now support <a href="http://search.twitter.com/search?q=%23IvoWertzel">#IvoWertzel</a>\'s
-
tweet parser! <a href="https://github.com/ianozsvald/">https://github.com/ianozsvald/</a>'
If you’re parsing Tweets or status-update-like-entities (from e.g. App.net) in Python then this library makes it easy to extract @people, URLs and #hashtags. You can also request the spans (character locations) for each entity, very useful if you have repeated phrases and you’re doing a search/replace.
The library is easily installed using “$ pip install twitter-text-python” (MIT license) via the Python Package Index, currently at version 1.0.0.2.
Credit – the library was developed by Ivo Wertzel (BonsiaDan on github), I merged a few Pull requests after forking to fix some bugs and have now taken over official maintenance.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
PyCon Tutorial Notes for Applied Parallel Computing
This post is for students of the Applied Parallel Computing tutorial that Minesh B. Amin and I will run during March 2013 at PyCon.This is a wiki-post, I’ll update it over the next month. If you are attending the tutorial you must check this post in the run-up to the tutorial. Important notes are below for you to read now. This is linked to from our PyCon Tutorial Support page.
If you come to this after the tutorial you’ll probably find this useful for setup. The following is for my students:
- Check this post before you come to PyCon, you will be expected to have followed instructions and installed the software and updates before the tutorial
- You won’t have time to install/setup during the tutorial, you must arrive prepared, we have a lot to work through and we’ll start immediately
- Accepting that the PyCon wifi has been great in past years you must assume that wifi will be broken – come prepared with a fully working environment
- We recommend strongly that you use our VirtualBox (it has all the libs and the github repo pre-installed, it is open source, it’ll run on Win/Mac/Linux), if you install your own package set then we can’t help you if it doesn’t work as expected (it is also quite fiddly to setup yourself) – you can of course buddy-up with someone else during the tutorial if required
You will be able to get the VirtualBox (about 7GB GB) from this post in the next week, you’ll be better off using the torrent that we’ll provide (please seed if you can, if possible all the way until the tutorial runs to help fellow students).
Download link for VirtualBox (required!) for the tutorial:
(v1.1 torrent deleted as it didn’t run cleanly on Macs)
PyCON-2013_AppliedParallelComputing1.2.zip torrent (very robust – resume if download breaks, 2.2GB zip decompresses to 6.9GB) or via direct download (more brittle – no resume if the download breaks).
md5sum: ce43b52a18ca913e62842ae72cc8df74
NOTE – I had the v1.1 version linked in the torrent above for a few days – if you got that and you can’t start the VirtualBox, just right-click in VirtualBox and discard the saved state, then restart the image. If you have the v1.2 version (linked as of March 4th) then you’re fine.
Video – this YouTube Video Demo (7 minutes) shows you how to install the image.
Instructions:
- Unzip to a directory with 7GB of disk space (MAC USERS – the built-in unzip doesn’t seem to handle 64 bit files, use 7zip for success [maybe Windows users too?])
- Open VirtualBox (optional but useful – add the extension pack for host integration)
- Machine | Add and open the directory that contains the .vdi and .vbox files
- Start the machine, it’ll boot to the Linux desktop
- Open the web link on the Desktop if you want to see the latest version of this blog post
- Double click the “Download GITHUB Repo” script on the desktop and it’ll refresh the repository (in case we’ve added new code)
- Familiarise yourself with the environment (Linux Mint 14), GTK Vim and emacs are installed
- Open a terminal and run ./pycon2013_applied_parallel_computing/run_this_to_confirm_you_have_the_correct_libraries.py (from the home directory) which confirms to you that the necessary Python libraries are installed (I’ve done this, you can do it for confirmation)
The VirtualBox is a fully configured Linux Mint 14 32 bit (based on Ubuntu 12.10) distribution, with gui, also with gvim installed. Feel free to add anything else. You don’t need to bother installing further system updates, the OS was up to date when we released it. It is configured to provide 2 CPUs and 3GB RAM – you might need to reduce these figures to get it running on your machine.
It runs on my 64 bit laptop (Linux Mint 13 64 bit) and on 32 bit machines, it should work equally well on Windows and Mac (we’ve tested it on both). You should install the Guest Additions (when the Ubuntu installation has booted use the Devices menu at the top of the VirutalBox window and “install guest additions” – this installs integration features like copy/paste with your host OS) as they provide things like shared clipboard to the host machine.
Instructions if you can’t/won’t use our VirtualBox (but you’re on your own in this case):
You can get the github repo here – if you set this up yourself then we can’t offer help if it doesn’t work (go to the relevant forums and ask there). There is a test script in the root of the repo (run_this_to_confirm_you_have_the_correct_libraries.py) which will confirm if you have the right libraries installed (it only checks for the presence of Disco, it doesn’t confirm that it is configured correctly). The README will give you some guidance but we really recommend that you get our VirtualBox (to be released in the next week via this post).
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Applied Parallel Computing at PyCon 2013 (March)
Minesh B. Amin (MBA Sciences) and I (Mor Consulting) are teaching Applied Parallel Computing at PyCon in San Jose in just over a month, here’s an outline of the tutorial. The conference is sold out but there’s still tickets for the tutorials (note that they’re selling quickly too).
Typically a recording of the tutorial is released a couple of months after PyCon to PyVideo – you miss out on the networking but you can at least catch up on the material. The source code will also be released.
Our tutorial uses a lot of tools so we’re providing a VirtualBox image (32 bit requiring about 5GB of disk space, runs on Win/Lin/Mac). Those who choose not to use the VBox image will have to install the requirements themselves, for some parts this is a bit tough so we strong recommend using the VBox image. Details of the image will be provided to students a few weeks before the conference.
Parts of my tutorial build on my PyCon 2012 High Performance Python 1 tutorial. You might also be interested in the (slightly vague!) idea I have of writing a book on these topics – if so you should add your name to my High Performance Python Mailing List (it is an announce list for when/if I make progress on this project, very lightweight).
This year’s 3 hour tutorial is split into five sections:
- Types of parallelism
- Hard-won lessons in building reliable/debuggable/extensible parallel systems
- “List of tasks” – solving a Mandelbrot task using multiprocessing (single machine), parallelpython (can run multi-machine), redis queue (multi machine and language)
- “Map/reduce” – investigating and understanding a set of Tweets using Disco, practical guide to configuration, visualisation with word-cloud and matplotlib, possibly moving on to social network connectivity analysis and visualisation
- “Hyperparameter optimisation” – solving a many-paramemter optimisation problem whose parameter space is not fixed at the start of the run
During the Mandelbrot solver we’ll look at where the complexity lies in generating an image like this:
During the Disco problem we’ll visualise the results using Andreas’ word-cloud tool, we may also cover the use of map/reduce for social network exploration:
Install requirements will be announced closer to the tutorial along with the (recommended!) VirtualBox image. I’m probably providing more material than we can cover for my two sections (Mandelbrot, Disco – how far we get depends on the size and capabilities of the class), all the material will be provided for keen students to continue and we’ll run an after-class session for those with more questions.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read my book
AI Consulting
Co-organiser
Trending Now
1Leadership discussion session at PyDataLondon 2024Data science, pydata, RebelAI2What I’ve been up to since 2022pydata, Python3Upcoming discussion calls for Team Structure and Buidling a Backlog for data science leadsData science, pydata, Python4My first commit to PandasPython5Skinny Pandas Riding on a Rocket at PyDataGlobal 2020Data science, pydata, PythonTags
Aim Api Artificial Intelligence Blog Brighton Conferences Cookbook Demo Ebook Email Emily Face Detection Few Days Google High Performance Iphone Kyran Laptop Linux London Lt Map Natural Language Processing Nbsp Nltk Numpy Optical Character Recognition Pycon Python Python Mailing Python Tutorial Robots Running Santiago Seb Skiff Slides Startups Tweet Tweets Twitter Ubuntu Ups Vimeo Wikipedia