I’m a little behind with the blogging so here’s the short version. StrongSteam has been under constant dev for 2 months, we’re close to putting up the first AI tools behind a few Python demos (hopefully it’ll be up next week). I’m talking on this at HackerNewsLondon tomorrow night.
We haven’t (quite) finished the demos so it’ll be a slideshow, I’m thinking of running a workshop in a month or so to show what’s possible, talk through the limitations and possibilities and help people got comfy with the API.
I’m also very pleased to say that we were accepted into the StartupChile programme alongside RadicalRobot (my better half). In StrongSteam Kyran and I will get 6 months in Santiago with a $40k budget (for no equity!) to build our API and this opens the door to further travel. We’re also very happy to welcome Balthazar Rouberol (linkedin) to our team, he’ll be joining us remotely as an intern for 6 months.
Our biggest priority now is to get the alpha out there. If you’re curious to see what we’re doing please follow us via @strongsteamapi and join the mailing list on the strongsteam homepage.
We also have two surveys – the first is so you can tell us about your general AI interest, the second focuses on some of the points raised in the first to tell us more about your needs. We’d really appreciate your input here if you have 10 minutes to spare.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
Kyran and I are starting work on a new project – strongsteam offers a web API with artificial intelligence and data mining tools. The goal is to make it easy for you to do things like:
get the text out of images using optical character recognition
determine whether two images look the same and if one object (e.g. a certain book or a can of coke) can be found in another
use natural language processing to analyse, cluster and compare text
extract text from audio (e.g. to pull out keywords from podcasts)
use machine learning on text to derive new data
If you’d like to join the closed alpha then visit strongsteam and add your email to the announce list on the homepage.
We’ve started with Python bindings which make it easy to talk to the strongsteam web service. Initially we’ll wrap open source tools that we’ve used along with lots of our own A.I. data mining tools from years of work in my Mor Consulting A.I. consultancy.
At EuroSciPy last week I demo’d using O.C.R. to extract the words from plant labels at Wakehurst Place gardens so you can lookup the plant on Wikipedia once you’ve taken a photo like this one:
Plant label for Ostrich Plume Fern at Wakehurst Place (Sussex)
Now we’re looking at applying O.C.R. to conference name-badges, this will be a bit of a mash-up from data used in our SocialTies conference app and Lanyrd.com‘s data. Next we’ll look at image matching and some text processing tools.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
My updated High Performance Python tutorial is now available as a 55 page PDF. The goal is to take you on several journeys which show you different ways of making Python code run much faster (up to 75* on the CPU, faster with a GPU).
This is an update to the 49 page v0.1 I published three weeks ago after running the tutorial at EuroPython 2011 in Florence.
PyPy – Python’s new Just In Time compiler, a note on the new numpy module
Cython – annotate your code and compile to C
numpy integration with Cython – fast numerical Python library wrapped by Cython
ShedSkin – automatic code annotation and conversion to C
numpy vectors – fast vector operations using numpy arrays
NumExpr on numpy vectors – automatic numpy compilation to multiple CPUs and vector units
multiprocessing – built-in module to use multiple CPUs
ParallelPython – run tasks on multiple computers
pyCUDA – run tasks on your Graphics Processing Unit
Other algorithmic choices and options you have
The improvement over the last version (v0.1) is that I’ve filled in all the sections now including pyCUDA (there are still a few IAN_TODOs marked, I hope to finish these in a future v0.3). I’ve also added a short section on Algorithmic Choices, link to the new Cython prange operator and show the new numpy module in PyPy.
The source code is on my github page. The original slides are on slideshare too. If you’re after a challenge then at the end of the report I suggest some ported versions of the code that I’d like to see.
The report is licensed Creative Commons by Attribution (please link back here) – I’ll also happily accept a beer if you meet me in person! If you’re curious about this sort of work then note that I offer A.I. and high performance computing consulting and training via my Mor Consulting.
Update – ShedSkin 0.9 adds faster complex number support. I haven’t added it to the report yet, evidence in the ShedSkin Group suggests it gets closer to the non-complex-number version (i.e. you don’t have to do more work but you get a nice speed boost whilst still using complex numbers).
Update (Nov 2011) – Antonio and Armin posted a note which explains some of the slowness in PyPy and show how it is competitive, under the right conditions. Armin also contributed a C version which shows PyPy to run as fast as C (for their chosen configuration).
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
I enjoyed running a 4 hour tutorial on High Performance Python at EuroPython last week (great event guys!). The class was limited to 40 people and I’d love for more people to benefit from the several weeks of work that went into it so I’ve written it up as a 49 page PDF (license: Creative Commons By Attribution).
This is v0.1, please take a look and give me feedback so I can release an improved v0.2 within a few weeks. Is anything missing? Sure! A couple of sections just have src (no write-up) and there’s a bunch of IAN_TODO markers for me to complete for the next revision. The 49 pages should have something useful for you to chew on though.
numpy integration with Cython – fast numerical Python library wrapped by Cython
ShedSkin – automatic code annotation and conversion to C
numpy vectors – fast vector operations using numpy arrays
NumExpr on numpy vectors – automatic numpy compilation to multiple CPUs and vector units
multiprocessing – built-in module to use multiple CPUs
ParallelPython – run tasks on multiple computers
pyCUDA – run tasks on your Graphics Processing Unit
If you haven’t been to a EuroPython – I definitely recommend them. Next year’s will also be in Florence (a lovely city with lovely people), the science/HPC tracks were very interesting to me and I hope to see more of the same next year.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
Python Text Processing with NLTK 2.0 Cookbook (Amazon US, UK) is a cookbook for Python’s Natural Language Processing Toolkit. I’d suggest that this book is seen as a companion for O’Reilly’s Natural Language Processing with Python (available for free at nltk.org). The older O’Reilly book gives a lot of explanation for how to use NLTK’s component, Packt’s new book shows you lots of little recipes which build to larger projects giving you a great hands-on toolkit.
Overall the book is easy to read, has a huge set of sample recipes and feels very useful. I’ll be referring to it for our upcoming @socialties mobile app.
You’ll need to download NLTK, you can also refer to some sample articles at Packt’s site and get Chapter 3 as a free PDF (see below). The author is Jacob Perkins, his blog links to many related articles, he also has a nice ‘how it started‘ article.
Here are my thoughts on the book. Disclosure – I was sent a free copy of the book by Packt for review, the thoughts below are entirely my own.
Chapter 1: Tokenizing Text and WordNet Basics
If you haven’t tried tokenising text before you may not realise how complicated it can be (expressing even basic rules for English is jolly hard!). This chapter has a good overview of tokenisation and the excellent WordNet library. Filtering stopwords (low value words like ‘the’, ‘of’) and synsets approaches (synonym groups in WordNet) are also covered. The word similarity measure was new to me, the book certainly throws up nice nuggets.
Chapter 2: Replacing and Correcting Words
Stemming approaches are covered, the goal is to find common root words (e.g. “running”, “runs” and “run” can each have “run” as their stem) to simplify your input text. Synonym replacement (e.g. converting “bday” to “birthday”) and negating words using antonyms are nicely treated. Babelfish is provided through NLTK for translation and the PyEnchant spellchecker is introduced.
Chapter 3: Creating Custom Corpora (sample PDF chapter)
This chapter discusses MongoDB (a NoSQL document store) as a way to store your own corpora in NLTK’s format, it also introduces part of speech tagging. File locking using lockfile is mentioned in case you’re using multiple processes (discussed later).
Chapter 4: Part-of-Speech Tagging and Chapter 5: Extracting Chunks
I was less interested in this part, I’ve had to extract Named Entities before and there’s a nice discussion in Chapter 5.
Chapter 6: Transforming Chunks and Trees
The section on filtering out insignificant words using part of speech tags was interesting (i.e. using the Determiner tag DT to filter words like “a”, “all”, “an”, “that”, “that”). Cardinals (numbers) are discussed, I liked the recipe for swapping noun cardinal phrases so e.g. “Dec 10″ becomes “10 Dec” (whilst “10 Dec” doesn’t change).
Chapter 7: Text Classification
This feels like it will be useful – bag of words classification and the Naive Bayes Classifier are discussed (along with some other classifiers). Here the author starts to build a movie rating classifier. Precision and Recall are explained nicely. A high-information classifier is built, this is useful as we can then remove low-information words (those that aren’t biased to a single class in the classifier) which can improve classification results. Combining classifiers to further improve results is also covered.
Chapter 8: Distributed Processing and Handling Large Datasets
This chapter has promise – I wasn’t aware of the share-nothing distributed execution engine execnet. Redis is also used, Jacob builds towards a distributed word scoring engine which uses Redis as a single storage system. I’ve yet to use Redis but really want to hook it into our future @socialtiesapp, distributed processing will definitely be on the agenda too.
Chapter 9: Parsing Specific Data
This is a little gem, tucked at the end of the book. Ages ago I’d come across a date parsing module (which I then forgot about), having needed it recently I was super-happy to see dateutil discussed. It makes the parsing of different date formats incredibly easy and also handles timezones.
The timex module in NLTK is introduced (I’d never heard of it before) – it takes a fuzzy reference to a date or time and marks it up. An example would be “let’s go sometime <TIMEX2>this week</TIMEX2>”, you can then extract the fuzzy reference and decide how to interpret it in your application.
Overall I recommend this book, if you have the original O’Reilly book (and you really ought to) then this makes for a great companion. I also spotted these twoother reviews.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
I’ve been using the O’Reilly book for over a year, I’m curious to see what’s different between the two. I’ll post a full review once I’ve been through it.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
Over the last couple of months I’ve been building up a social microprinter (inspired by Tom Taylor‘s implementation and Matt Webb‘s original idea). Here’s the current version – Arduino+WiShield+CBM231+off-site server (powered partly by BenOSteen’s Python driver):
The goal is to build a social microprinter – a printer that’d live in a social environment (currently The Skiff co-working office in Brighton) which would help bring people a little bit closer. Currently it prints tweets (for ‘theskiff’) and shows events, later it’ll show recent Gowalla check-ins and maybe some local news headlines or the weather (but there’s got to be better stuff to show, right?…ideas on a postcard please).
My original intent was to build a device that could be stuck on the wall in a cafe, it would show tweets on a screen (probably under the cafe’s or Brighton’s hashtag) and let non-Internet folk post their own messages back. Doing this nicely would have needed a screen, machine, wall space etc – using a receipt printer seemed like an easy way to prototype the idea.
Jumping forward, here’s an early version – this is a CBM231 connected to my Ubuntu laptop via a USB->RS232 lead (note – this lead is good, the cheap ones on eBay can be bad – see below). Here I’m using BenOSteen’s Python driver to send tweets via serial to the printer.
This device has done the rounds, here it is on display at BuildBrighton’s talk to the British Computer Society:
Here it is in use at Likemind Brighton showing international #likemind tweets as other groups meet around the world on Friday morning (note – unicode converted to ‘?’ as I haven’t figured out if/how to get international characters out of the printer yet!):
It ran during the weekend of Barcamp Brighton and printed out barcampy stuff, I added some notes about local cafes and a job ad for one of the companies:
The goal all along was to build an independent controller (so removing the laptop from the equation). For this I coupled an Arduino with a WiShield 1.0. The WiShield libraries are easy enough to work with, after an hour’s experimentation I got WPA2 working (it takes 25 seconds to negotiate the connection on each attempt), we use WPA2 at home and in The Skiff.
Coupling the Arduino to the printer was easy enough, I have been trying (and so far failing) to get a Max233 chip acting as a voltage level converter so for now I’m using a pre-built RS232 Level Shifter. This converts the Arduino’s 0V/5V TTL to +12V/-12V RS232 levels (powered from the Aruino’s 5V out). To output text I’m using Roo Reynold’s Aduino sketch, this handily includes some control codes to cut the receipt after printing.
Next I wanted live data. At first I simply put a short plain text file on a web site, used the WiShield to fetch it and Roo’s code to print it. Now I’m using a hacked version of Ben’s code to write tweets (including bold and underline control codes) to a text file which is stored online (microprinter.ianozsvald.com), this ready-to-print file is grabbed over the WiShield, printed and then cut. The online file is updated every 2 minutes.
The final tweak was to add a button to the printer. Using the Arduino’s demo button sketch I hooked up a big thumb-sized button. The Arduino’s main loop is looking for a combination of ‘at least 5 seconds have passed since the last print’ and ‘button pressed’, then it’ll kick off the web request for new data. Once this request returns it prints out the text.
I look for the pattern “————–” (14 dashes) to start and end the message, before this we get HTTP headers (from the WiShield) that I didn’t want to print.
Here’s the finished hardware:
This is a WiShield 1.0. The button (shown just out of shot top-left) is connected 3.3V->button, button->Pin 6 AND Ground (via a 15k resistor). For the printer I’m using Pin 8 for tx (blue lead on the RS232 level converter) and Ground, the level converter is powered by the 5V out.
Here’s the connector:
The connector is overly-connected in this image. I think all you actually need is Pin 2 from the RS232 Level Converter to Pin 3 on the 25 pin connector along with Pin 5 (GND) to Pin 7 (GND on 25 pin connector). With yellow wires I’ve shorted Pins 4&5 and 8&20 but I think this is overkill (they’re used for bus control but they’re probably ignored in this configuration). Here’s a full pinout.
During all the hacking our faithful cat Mia has attempted to assist whenever she could. Here she’s taken ownership of the bag used to transport the early versions:
Along the way I also acquired an Epson TM T88 II receipt printer, it is ‘just another serial printer’ but takes different control codes (and it looks like it might have a smaller character set than the CBM 231). As yet I’ve only tried printing plain ASCII, I’d like to investigate further and build a library that supports this printer too.
Note on buying leads from eBay! be aware that if you buy cheap leads from eBay (e.g. £2 silver/blue leads) then you might end up with a pack of 5 (because if you buy 5 and one breaks, you’ve got 4 more that work, right?), you might have 5 dead-on-arrival leads. You could then report the problem and the nice people could then ship you a replacement set, but then you might discover that you’ve got another 5 DOA leads. You have been warned.
If you’re buying your first microprinter do try to buy a working serial lead with it (it’ll probably be a 9 pin to 25 pin converter lead) – if you get the wrong lead (null modem vs straight serial – I forget which you need!) then you won’t get anything (the bane of my first few week’s of testing). Buy a printer+lead that’s known to work and you won’t go wrong.
Spend the £8 per lead and buy from Amazon if you don’t want to waste hours wondering why your printer is just printing out reams of ‘?’ rubbish:
If you want to build your own then the first best source of info is the microprinter wiki. Roo Reynolds has Arduino drivers (which I hacked a bit for my implementation) that don’t depend on external data sources.
You’ll find my Python server source and Arduino sketch (which assumes you’ve got a WiShield 1.0) here: social_microprinter. Note that the code is horribly hacky, it was written over many short sessions when I could steal an hour or two from other projects.
It could do with being straightened out and commented and a few nice new features would include Gowalla check-in notifications, event RSS reading and weather printing.
Many thanks to my fellow hackers at BuildBrighton for help debugging my early serial problems and to Barney for the lend of his RS232 Shifter (I’ll soon get this Max233 working, promise!).
Here’s the finished, installed unit on the work bench at BuildBrighton in The Skiff (just by the social kitchen space). Once it is a bit more robust it’ll move to the front of the building:
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
Over the weekend at BarCampBrighton5 I demonstrated a quick visualisation that Kyran and I built over breakfast in Berlin last Friday. It looks like:
To see it yourself open Bar Camp Brighton 5 Visualisation using Chrome or WebKit (it’ll work in Firefox but might be rather slow). It is interactive so it is worth opening, about 60 people are shown here.
If you reload the page you’ll see the force directed graph bouncing around as it settles into a low energy configuration. The nodes are people attending the event, edges are friend links to other people at the event. The image sizes for the nodes reflect the number of links a person has at that event.
As you can see above Jot (the main host and co-organiser) is most connected at the event. Two people aren’t following anyone at the event, they’ve been pushed to the bottom left of the window.
You can drag nodes using the grab-handles (blue circles) or move the entire graph by dragging the image.
Here you can see that seb_ly is the most connected, closely followed by niqui and bitchwhocodes. At the bottom left is a sub graph of two nodes – these two people follow each other but don’t follow anyone in the main graph.
In both cases the data is extracted from the relevant Lanyrd pages (BCB5, FOTB), friends for each attendee are read from Twitter and then a graph is built as a JSON dictionary which links nodes (screen_names) to friends (lists of screen_names). Ready to run Python source code is at github: LanyrdViewerUsingProtoVis.
Both of these links should work on a mobile device but they’ll be awfully slow (they’re useless on my iPhone 3G!)
Kyran used ProtoVis to build the force directed graph, it includes a bit of a hack to make images work on the nodes.
If you’re interested in seeing more of this stuff then Kyran will have more to demo at our upcoming £5 App show and tell.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
In the hope that this’ll save someone else the bother…if you’re installing the web scraping Python library scrapy on your Mac (I’m on Leopard 10.5.8) and you come across an error like:
checking for libxml libraries >= 2.6.8... configure: error:
Version 2.6.7 found. You need at least libxml2 2.6.8 for this
version of libxslt
then here’s the solution.
Presumably you’ll be following the Scrapy install instructions. I used the supplied links for libxml2-2.7.3 and libxslt-1.1.24. libxml built and installed to /usr/local/lib just fine. libxslt wouldn’t ./configure – it kept reporting that it could only see the older libxml from /usr/lib, not the newer one in /usr/local/lib.
At this point libxslt configured, built and installed just fine. To make python see it I had to update my .bash_profile so PYTHONPATH linked to the default output directory:
Side note – whatever you do, don’t mess with /usr/lib. I tried moving the default libxml and libxslt libraries and I had the same consequence mentioned by Kevin Watters – lots of system tools (including su!) depend on libxslt to be in /usr/lib. I had to boot to Single User Mode to copy the files back before the system would work again.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.
On Wednesday night I jumped on a train up to London to visit the London Financial Python User Group to give a short demo of pyCUDA. I’m using CUDA heavily for my physics consultancy and I figured the finance guys would be interested in 10-1000* speed-ups for their calculations.
To introduce pyCUDA I used P. Narayanan’s GPUs: For Graphics and Beyond PDF presentation (the first 13 pages), his explanation and diagrams are very clear.
To put CUDA in context against regular CPUs I used the recent Peak MHz graph and the main power/speed/transistor count graph in The Free Lunch is Over: A Fundamental Turn to Concurrency in Software. The main point here is that we’ve topped out at 2-3GHz CPUs and now we have to parallelise our code. Doing so on CPUs means we get 4, 8, 16 (and soon 24 then 32) cores to play with…but with CUDA if the problem is mathematics based we have 480 cores to use!
If you’re interested in the general use of CUDA and GPUs then check out the excellent gpgpu.org.
You may wonder about real-world performance with CUDA. Without naming names I can say that I’m now delivering a 115* speed-up on a particularly gnarly problem (I mentioned during the talk that I’d reached 80* – I’ve managed to improve that in the last 2 days). On an earlier problem when I knew far less about CUDA I delivered a 100* speed-up for the same company.
It was grand to meet a lot of new faces at the group, a few people I’ve met before at PyCons (hi Ben! Giles!). Making a contact with Didrik of Enthought was rather grand too. I hope to visit again.
Ian applies Artificial Intelligence as an Artificial Intelligence Researcher for companies (Mor Consulting), co-founded the StrongSteam A.I. datamining toolkit, co-authored SocialTies, programs Python, writes The Screencasting Handbook and is also a sea-side dweller and consumer of fine coffees.