About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, an AI consultant, founder of the Annotate.io social media mining API, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Brightonian. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

10 February 2013 - 14:28Applied Parallel Computing at PyCon 2013 (March)

Minesh B. Amin (MBA Sciences) and I (Mor Consulting) are teaching Applied Parallel Computing at PyCon in San Jose in just over a month, here’s an outline of the tutorial. The conference is sold out but there’s still tickets for the tutorials (note that they’re selling quickly too).

Typically a recording of the tutorial is released a couple of months after PyCon to PyVideo – you miss out on the networking but you can at least catch up on the material. The source code will also be released.

Our tutorial uses a lot of tools so we’re providing a VirtualBox image (32 bit requiring about 5GB of disk space, runs on Win/Lin/Mac). Those who choose not to use the VBox image will have to install the requirements themselves, for some parts this is a bit tough so we strong recommend using the VBox image. Details of the image will be provided to students a few weeks before the conference.

Parts of my tutorial build on my PyCon 2012 High Performance Python 1 tutorial. You might also be interested in the (slightly vague!) idea I have of writing a book on these topics – if so you should add your name to my High Performance Python Mailing List (it is an announce list for when/if I make progress on this project, very lightweight).

This year’s 3 hour tutorial is split into five sections:

  1. Types of parallelism
  2. Hard-won lessons in building reliable/debuggable/extensible parallel systems
  3. “List of tasks” – solving a Mandelbrot task using multiprocessing (single machine), parallelpython (can run multi-machine), redis queue (multi machine and language)
  4. “Map/reduce” – investigating and understanding a set of Tweets using Disco, practical guide to configuration, visualisation with word-cloud and matplotlib, possibly moving on to social network connectivity analysis and visualisation
  5. “Hyperparameter optimisation” – solving a many-paramemter optimisation problem whose parameter space is not fixed at the start of the run

During the Mandelbrot solver we’ll look at where the complexity lies in generating an image like this:

Mandelbrot Surface

During the Disco problem we’ll visualise the results using Andreas’ word-cloud tool, we may also cover the use of map/reduce for social network exploration:

Word-cloud of Apple mentions

Install requirements will be announced closer to the tutorial along with the (recommended!) VirtualBox image. I’m probably providing more material than we can cover for my two sections (Mandelbrot, Disco – how far we get depends on the size and capabilities of the class), all the material will be provided for keen students to continue and we’ll run an after-class session for those with more questions.

 


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Python

23 January 2013 - 0:09Layers of “data science”?

The field of “data science” covers a lot of areas, it feels like there’s a continuum of layers that can be considered and lumping them all as “data science” is perhaps less helpful than it could be. Maybe by sharing my list you can help me with further insight. In terms of unlocking value in the underlying data I see the least to most valuable being:

  • Storing data
  • Making it searchable/accessible
  • Augmenting it to fashion new data and insights
  • Understanding what drives the trends in the data
  • Predicting the future

Storing a “large” amount of data has always been feasible (data warehouses of the 90s don’t sound all that different to our current Big Data processing needs). If you’re dealing with daily Terabyte dumps from telecomms, astro arrays or LHCs then storing it might not be economical but it feels that more companies can easily store more data this decade than in previous decades.

Making the data instantly accessible is harder, this used to be the domain of commercial software and now we have the likes of postgres, mongodb and solr which scale rather well (though there will always be room for higher-spec solutions that deal with things like fsync down to the platter level reliably regardless of power supply and modeling less usual data structures like graphs efficiently). Since CPUs are cheap building a cluster of commodity high-spec machines is no longer a heavy task.

Augmenting our data can makes it more valuable. By example – applying sentiment analysis to a public tweet stream and adding private demographic information gives YouGov’s SoMA (disclosure – I’m working on this via AdaptiveLab) an edge in the brand-analysis game. Once you start joining datasets you have to start dealing with the thorny problems – how do we deal with missing data? If the tools only work with some languages (e.g. English), how do we deal with other languages (e.g. the variants of Spanish) to offer a similarly good product? How do we accurately disambiguate a mention of “apple” between a fruit and a company?

Modeling textual data is somewhat mainstream (witness the availability of Sentiment, NER and categorisation tools). Doing the same for photographs (e.g. Instagram photos) is in the quite-hard domain (have you ever seen a food-identifier classifier for photos that actually works?). We rarely see any augmentations for video. For audio we have song identification and speech recognition, I don’t recall coming across dog-bark/aeroplane/giggling classifiers (which you might find in YouTube videos). Graph network analysis tools are at an interesting stage, we’re only just witnessing them scale to large data amounts of data on commodity PCs and tieing this data to social networks or geographic networks still feels like the domain of commercial tools.

Understanding the trends and communicating them – combining different views on the data to understand what’s really occurring is hard, it still seems to involve a fair bit of art and experience. Visualisations seems to take us a long way to intuitively understanding what’s happening. I’ve started to play with a few for tweets, social graphs and email (unpublished as yet). Visualising many dimensions in 2 or 3D plots is rather tricky, doubly so when your data set contains >millions of points.

Predicting the future – in ecommerce this would be the pinacle – understanding the underlying trends well enough to be able to predict future outcomes from hypothesised actions. Here we need mathematical models that are strong enough to stand up to some rigorous testing (financial prediction is obviously an example, another would be inventory planning). This requires serious model building and thought and is solidly the realm of the statistician.

Currently we just talk about “data science” and often we should be specifying more clearing which sub-domain we’re involved with. Personally I sit somewhere in the middle of this stack, with a goal to move towards the statistical end. I’m not sure one how to define the names for these layers, I’d welcome insight.

This is probably too simple a way of thinking about the field – if you have thoughts I’d be most happy to receive them.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life

15 January 2013 - 21:35Do self-driving cars make the courier redundant?

I’ll start with a quote via “Why workers are losing the war against the machines” taken from A Farewell to Alms by economist Gregory Clark:

“There was a type of employee at the beginning of the Industrial Revolution whose job and livelihood largely vanished in the early twentieth century. This was the horse. The population of working horses actually peaked in England long after the Industrial Revolution, in 1901, when 3.25 million were at work. … There was always a wage at which all these horses could have remained employed. But that wage was so low that it did not pay for their feed. “

Now I’m back in London I’m watching the prevalence of couriers and delivery people bringing a constant stream of packages through the busy streets. I’m betting this will be automated in the near future. Couple self-driving cars and a physical-packet-delivery-platform that looks a bit like the Internet protocol and then you’ve got (I think) a bit of a game changer.

Self-driving cars have the potential to be legal in cities (they’re legal in a few US states at present, accepting longer legal battles to come). They’ll drive safely and predictably, they’re unlikely to react erratically (e.g. no pulling out in busy streets for a foolish maneuver and hitting a cyclist), they don’t need a lunch break and they could pick-up and drop-off from depots a long way from traditional storage facilities (as nobody has to commute to the facility).

Consider one of these vehicles arriving outside your office and phoning you to give you a secret ID number. You come out to the street, key in the number, a panel pops open and there’s your package. Internally the packages are retrieved in a similar way to automated warehouses. Since the system is always calling home to report its status it could notify all upcoming delivery recipients of its expected ETA. You could probably buy an upgrade to reserve your delivery slot (giving delivery companies a new revenue stream?).

If they’re controlled via a derivative of the Internet Protocol then we have a decentralised physical-packet-routing system. If the cars can ‘mate’, perhaps by backing on to each other, they can trade packages so the packages travel further without human intervention. Maybe you end up with an open market for atoms-distribution, assuming compatible protocols exist amongst the courier companies.

I’ve followed John Robb’s recent discussion of DroneNet (more) – it is the same idea (props – I’m tagging on his/others’ thinking) applied to low cost drones. I think drones will follow later as they’re constrained by weight and flight restrictions and so they are far less useful in the city at present.

At the end of the day I think that humans will be pushed out of the physical package delivery game (be it via drones or via delivery cars). Trying to understand the speed at which humans will be removed from traditional working disciplines in specialist area continues to baffle me.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Life

13 January 2013 - 20:10Map/Reduce (Disco) on millions of tweets

Whilst working on data sciencey problems for AdaptiveLab I’m becoming more involved in simple visualisations for proof-of-concepts for clients. This ties in nicely with my PyCon Parallel Computing tutorial with Minesh. I’ve been prototyping a Disco map/reduce tutorial (part 2 for PyCon) using tweets collected during the life of SocialTies during 2011-2012.

Using 11,645,331 tweets on 1 machine running through Disco with a modified word_count example it is easy to filter to keep tweets with a certain word (“loving” in this case) and to plot a word cloud (thanks Andreas!) of the remaining tweets:

Words in “loving” tweets

Tweet analysis often shows a self-referential nature – here we see “i’m” as one of the most popular words. It is nice to see “:)” making an appearance. Brands mentioned include “Google”, “iPhone”, “iPad”. We also see “thanks”, “love”, “nice” and “watching” along with “London” and “music”. Annoyingly I’m not cleaning the words so we see “it!”, “it.”, “(via” (with erroneous brackets) and the like which clutter the results a bit.

Next I’ve applied “hating” as the filter to the same set:

Words in “hating” tweets

One of the most mentioned words is “people” which is a bit of a shame, along with “i’m”. Thankfully we see some “love” and “loving” there. “apple” appears more frequently than “twitter” or “google”. Lots of related negative words also appear e.g. “stupid”, “hate”, “shit”, “fuck”, “bitch”.

Interestingly few of the terms shown include Twitter users or hashtags.

Finally I tried the same using “apple” on an earlier smaller set (859,157 tweets):

Words in “apple” tweets

Unsurprisingly we see “store”, “iphone”, “ipad”  “steve”. Hashtags include “#wwdc”, “#apple” and “#ipad”. The Twitter accounts shown are errors due to string-matching on “apple” except for @techcrunch.

I find it interesting to see competitor brands being mentioned in the same tweets (e.g. “google”, “microsoft”, “android”, “samsung”, “amazon”, “nokia”), although the firms are obviously related to “apple”.

An improvement would be to remove words from the chart that match the original pattern (hence removing words like “apple” and “#apple” but keeping everything else). Removing near-duplicate terms (e.g. “apple”, “apples”, “apple’”) and performing common string clean-ups (removing punctuation) which also help.

It would also be good to change the colour channels – perhaps using red for commonly-negative words and green for commonly-positive words, with the rest in a neutral colour. Maybe we could also colour the neutral words differently if they’re commonly associated with the key word (e.g. brands of the key word).

Getting started with Disco was easy enough. The installation takes a few hours (the Disco project instructions assume a certain familiarity with networked systems), after that editing the examples is straightforward. Visualising using Andreas’ code was very straight-forward. The source will be posted around the time of my PyCon tutorial in March.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: ArtificialIntelligence, Data science, Python

13 December 2012 - 0:23Office social graph connectivity using NetworkX

I wanted an excuse to play with the Python NetworkX graph visualisation library and recently I joined AdaptiveLab to consult on some data science & visualisation problems. Thus formed the question – how were we all connected together? I figured that looking at who follows us all will yield a little insight into the people we have in common. I’m particularly interested in this question seeing as I was living in Brighton, then lived in Chile for most of the year and have only recently moved to London – my social graph is likely to be disjointed to the graph of the existing London-based team.

Below I show the follower graph with my new colleagues at the top (James, Kat, Ben, Mark, Steve), Emily, Jon and myself in the middle and my collaborator Balthazar at the bottom:

sample_full_network_thumb

I chose to visualise followers rather than who-we-follow as I cared about the graph of who-pays-(some)-attention-to-us. I figure this is a good surrogate for people who might actually know us, suggesting a good chance that we have friends and colleagues in common.

Balthazar worked in France with me in StrongSteam (whilst I was in Chile), he’s followed by almost nobody from my usual network. Emily and I are a couple, we’re followed by a lot of the same people. Our friend Jon lives in Brighton and runs the central co-working environment (where we were for 10 years), he is followed by many of the people who follow us. The top of the graph shows that my colleagues are followed by only a few people who follow others in the company (so we all have different social networks), with the exception of boss-James who shares a set of followers with my Jon and myself (I guess because we’re all outspoken in the UK tech scene).

In the above graph I deliberately reduced the number of nodes drawn if they were only connected to one person in the network. Seeing as a few of us have over a thousand followers the graph got  too busy too quickly. Below is a subsampled version of the early network with no limit on the number of one-edge-only nodes:

sample_network_thumb

The subsampled network looks nicely organic, like living cells.

The code is on github as twitter-social-graph-networkx, it includes some patches that have just been added back to the python-twitter module to enable whole-graph downloading. You can use this code to download the follower graph for your own network, then plot it using NetworkX (it is configured to use GraphViz as the plots are faster, you can use pure NetworkX if you don’t have GraphViz). The git project has pickles of my social network so if you satisfy the dependencies, you should be good to plot straight away.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Life, Python

25 November 2012 - 22:39Testing 3 modern face detection libraries (face.com, openCV, libccv)

As a research project months back Balthazar and I tested 3 modern face detection libraries (definitely see Balthazar’s write-up). Face.com had just been acquired by facebook, they had a great and free service which annotated not just face locations but also sex, age and emotion. We also tested OpenCV (popular and free) and the lesser known libccv.

Previously I’d used openCV to build a face tracking robot head in Python and we figured a review of what’s easily available might be fun:

Balthazar ran the face detection process with face.com and OpenCV, I added libccv. We used 200 images kindly provided by Rosario Rascuna (@_sarhus), collected from Instagram and annotated by us. We listed 150 images with faces and 50 without to test how often faces are correctly detected and whether faces are seen where they shouldn’t be.

We did not test the locations of the face, just the absolute count per image. This means that a face could be incorrectly spotted in an image whilst the true face was missed – our scoring system would still say ’1 was expected and 1 was found so that is correct’. Manual inspection suggested that this is a minor problem (though if I ran the experiment again I’d take the time to hand-annotate every face’s location and check that faces were detected in the right place).

OpenCV provides a set of pre-trained data files (as xml with names like alt_tree_cascade), we tested them individually and then combined all their detections into an uber-detector. The goal for OpenCV was just to see how well it might do without fine tuning.

For OpenCV we used v2.3, for libccv we used v0.1.

I’ll be posting some of the code that we used along with the dataset, I’ll update a link here when I’ve done that.

Results:

  • face.com found 144 of the 150 images with faces with 0 false positives (i.e. it didn’t say once that an image without a face had a face)
  • OpenCV found 93 images with faces of the 150 images and an additional 4 that were false positives
  • libccv found 99 images with faces of the 150 images and an additional 6 that were false positives

The short story is that the open source tools are ‘pretty good’ but face.com was better (and is now unavailable). Since this piece of work Stephen’s LambdaLabs offers a RESTful face detection (and recognition) API, I’ve not evaluated it.

There’s clearly room for a web based service in this area, training it with feedback would be a nice feature. Adding face recognition (as LambdaLabs has, but OpenCV/libccv doesn’t) is an obvious bonus. I’ve seen face detection used for:

  • cropping uploaded faces in web profile pictures
  • filtering non-face photos from photo albums
  • filtering face photos from restaurant review sites

I suspect we’ll see more computer vision APIs that make it easier to annotate images (much the reason why I’ve registered this skeleton site for annotate.io), given the rise in photos on sites like Instagram (and flickr before).


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: ArtificialIntelligence, Life

25 November 2012 - 19:50StartupChile (Round 2.1) all finished, thoughts

The odd thing is that I’ve been trying to write this post for 3 months. Having started and stopped several times (including during the flight back from Chile on Oct 15th) I figure I ought to put something out. The journey was, it turns out, somewhat of a roller-coaster ride.

Early in January Kyran Dale and I flew to Santiago for Round 2.1 of StartupChile to build StrongSteam, a cloud-based computer vision API. Emily (my fiancée) also won funding and came out to build TinyEars. Sadly StrongSteam didn’t make it (my co-founder and I went in different directions, it was easier to end the project).

The goal of the StartupChile project is to bring working entrepreneurs in from around the world to teach Chileans how to build start-ups. Teaching includes running events, building partnerships, explaining lessons-learnt in prior experiences and explaining that failure/experimentation is a part of the process. In return we stay for 6 months, get a $40k reimbursement package (90% of our expenses up to $40k USD are reimbursed via a slightly torturous bureaucratic process) and are free to leave at the end. We never have to register our business our there, give up shares or pay tax on foreign earnings.

During the last 8 months I:

  • ran a pair of Python programming courses (material open-sourced)
  • started private self-mentorship groups (now an official part of the StartupChile programme)
  • built a novel AI backend with Kyran for using Optical Character Recognition to replace the need for QR codes (which is now OpenPlants)
  • won ‘best choice for investment‘ on the Jason Calacanis show This Week in Startups (ha!)
  • played with Kinects and Python for rock-sizing with computer vision for the Chilean mining industry
  • organised some data meetups
  • spoke on agile lessons-learnt
  • presented to VCs and Angel groups (and got offered $500k investment lumps in both San Francisco and Chile)
  • received acquisition offers from companies in San Francisco and Chile
  • presented at conferences like PyCon and got mentions in places like the BBC
  • wrote up demo day meets
  • finished the programme by moving with Emily to San Francisco for 2 months to continue our networking

The main upsides of the programme are:

  • time to build your idea without the need to work/consult to pay the bills (your living expenses are covered)
  • lovely group of proactive people to meet from both around the world and locally
  • supportive (if overworked) staff members who do their best to help
  • lovely people in Chile in general (warm, friendly, interested, those building companies are particularly open and friendly)
  • increasing recognition in the investment/startup community which opens doors (e.g. The Economist and others covered it recently) – a few months ago StartupChile held its first Demo Day in San Francisco to ease fund-raising
  • easy access to North America if you’re coming in from outside the US (I used it as a springboard in our final two months to head to San Francisco to continue the networking)
  • you’re encouraged to travel within Chile to teach other groups, you also have easy access to places like Argentina and Uruguay if you fancy traveling (we certainly did) and can justify it as work-related
  • other related spaces like the Santiago Hackerspace and new co-work venues are popping up

The main goal of the programme definitely seems to be working for the Chileans. In our time in Chile we saw many Chileans step forwards with either young working companies or ideas (some high-tech, many not), who then got on with building, partnering and growing their businesses. The company registration process is being massively simplified, failure is becoming more acceptable (generally it is not socially acceptable to fail – much the case in the UK only 20 years ago – and thankfully that attitude is changing in Chile).

More Chileans are traveling around the world, more doors are being opened in cities like San Francisco and more money, connections and opportunities are flowing back into Chile. Being part of a government’s experiment to change their citizens’ attitude to risk (and seeing it work) has been a very rewarding experience.

On a personal level I’ve also made some lovely contacts – people I’d work with who I consider friends who I’d never have met otherwise. I suspect that the “StartupChile Mafia” (ex-StartupChile folk) will open doors for all of us in the programme in the future too. I’ve met a few ex-StartupChile folk here in London (one by accident in the pub last week – hi Michael!) and I’m wondering if we can run a Mafia meetup before Christmas.

There are several downsides to Chile which should be considered by future applicants:

  • there’s a reason we’re paid to be entrepreneurs in Chile – the ecosystem is lacking certain things and maybe you’d not setup shop there otherwise. Make sure your eyes are open to the very young/conservative investment scene, the small tech community and the conservative nature of businesses (bureaucracy and caution->long time to get things done)
  • things that worked elsewhere in the world a few years ago will probably be successful now in Chile (e.g. people building online food services and education sites were doing well, persons trying to offer novel AI/data applications and things requiring iPads had a, well, harder time of it) so don’t assume your cutting edge idea from California will move quickly in Chile
  • the air in winter is polluted and horrid (bad news if e.g. you have asthma) but lovely in summer
  • the programme’s goals are focused on making Chile successful (and not you, per se, but that’s a nice side-effect for StartupChile if it occurs)
  • most people only speak the Chilean-variant of Spanish called Chileno (StartupChile participants and staff all speak some level of English) – this can make buying things in the street a bit of a challenge – try to learn some Spanish before you come
  • there was little explanation about the interests & needs of companies within Chile – for example it took me months to learn just how large and hungry the mining industry is for innovative solutions (and it is a rich industry)

I spoke with Mitch Altman (a founder of the San Franciscan hackerspace Noisebridge) recently and, paraphrased, he pointed out that in most places in the world (he travels a lot to promote hackerspaces) if you open the door to encourage experiments, accept failure and encourage small business and knowledge sharing then It Just Tends To Happen. I suspect that this model can be applied around the world, without big Government funding, and I expect to see many more countries try this bottom-up approach of bringing entrepreneurs in (rather than building expensive ‘innovation clusters’ that rarely seem to perform).

There are other positive and negative write-ups about the programme including Emily‘s, Liis Peetermanns‘s, another, Nathan Lustig, Maptia (lovely British team!). My posts here are under the startup-chile tag.

If you’re interested in building your business in South America then this is the go-to programme. If you need 6 months time in an interesting country with an increasing investor scene, this is not a bad choice. If you want mentorship and hands-on help or you want to deal with the large corporates that you might find in London, New York or Frankfurt then Chile hasn’t proven itself here yet (though it may, given time). What’s impressed me most about the programme is the way it keeps on improving – keep an eye on it, definitely consider it! Seek a wide set of opinions if you want to apply, lots of people experience the programme differently.

Emily and I have discussed what we’d like to see in future StartupChile-like programmes (I suspect we’ll see more, with further innovation, as Governments wake up to the positive change that can occur):

  • invite academics and industrialists to a country to work on a specific problem for a fixed time period without heavy-handed IP controls but funded like StartupChile – this could be a wonderful way to foster innovation and collaboration and to build new IP that could be exploited (perhaps with a share in the IP being owned by all in these projects)
  • setup targets for sector improvement in a country – e.g. in Chile perhaps choose to make mining more energy efficient – then invite companies to come with industrial doors opened and primed for collaboration (so many StartupChile companies could have formed local partnerships if only doors had been opened so the incumbents knew we were coming!)
  • list the problems that entrepreneurs could solve and make it public – actively seek entrepreneurs to visit to try to fix things (e.g. in Chile the winter pollution must be fixable, education is super-expensive [which led to student protests] and surely can be improved, the mining industry suffers from growing energy and mine-discovery costs)
  • encourage an alumni group so past members can easily help future members (something that’s been long discussed in StartupChile but seems to be low on the agenda)
  • work harder to jump language & cultural barriers – in Chile we were told everyone on the programme would speak English but the locals notably didn’t so the very people we were trying to help were hard to communicate with – add language & cultural lessons to a programme to ease the transition for both sides

As of now I’m back to my AI consulting for natural language processing (working with the lovely team at AdaptiveLab in Shoreditch), tinkering on the side with industrial needs learned via StrongSteam in annotate.io. If you’re ex-StartupChile and you’d be interested in meeting in London, drop me a line.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: ArtificialIntelligence, Business Idea, Entrepreneur, Life, StartupChile

1 November 2012 - 20:00aMaking “from lxml import etree” work with virtualenv (Python)

Update – these steps are overly complicated and *unnecessary*! See fizyk and Marius’ comments below. I’ll leave this post just in case it helps anyone – hopefully anyone coming here will realise it isn’t hard (now) to install lxml, as long as the OS dependencies are installed

I use virtualenv for all development. Recently I was stumped with the need for the lxml module – installing it using virtualenv on Linux requires a bit of work.

Let’s see the problem first:

$ virtualenv testlibxml
 New python executable in testlibxml/bin/python
 Installing distribute.............................................................................................................................................................................................done.
 Installing pip...............done.
.../virtualenvs/testlibxml $ source bin/activate
$ pip install lxml
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/home/ian/workspace/virtualenvs/testlibxml/build/lxml/src/lxml/includes -I/usr/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.linux-x86_64-2.7/src/lxml/lxml.etree.o
In file included from src/lxml/lxml.etree.c:254:0:
/home/ian/workspace/virtualenvs/testlibxml/build/lxml/src/lxml/includes/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1

Following these instructions and noting to follow the instructions for *both* libxml2 and libxml (further below) I run (using this change for my local path):

./configure --with-python=/home/ian/workspace/virtualenvs/testlibxml/bin/python

And now we can start python and import libxml2

(testlibxml)ian@ian-Latitude-E6420 ~/workspace/virtualenvs/testlibxml $ python
 Python 2.7.3 (default, Aug  1 2012, 05:14:39)
 [GCC 4.6.3] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import libxml2 # works

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Life, Python

4 September 2012 - 19:51EuroSciPy Parallel Python tutorial now online

I taught Parallel Python at EuroSciPy 2012 last week in Brussels, I’ve uploaded all the necessary stuff. In the talk we covered:

  • multiprocessing (built in)
  • parallelpython (an easy shift from multiprocessing to do mult-machine and -core processing)
  • gearman (cross-platform job server for heterogeneous job processing)
  • picloud.com (was python only, now any infrastructure cloud-based processing using EC2)
  • ipython cluster (for easy parallelisation with ipython)

Here’s my github repo with the powerpoint and python source, the README outlines the install requirements. I used python 2.7 on Ubuntu but it should run on any platform. This is related to my PyCon 2012 High Performance Python tutorial. Here’s my class of 70, taught by a bleary-eyed me as I’d just flown from San Francisco (9 hours time difference ugh!):

I’m thinking of writing an updated High Performance+Parallel Python guide (probably as a self-published book), if you’re interested in hearing about it please join the High Performance Python Mailing List (I’ve only got a list right now). I’ll make an announce once I know more.

We had I think 190 folk at the tutorials and 170 for the conference over the weekend (along with a discussion about whether weekend conferences hurt attendance…). 25% of the attendees were small companies (my tweet) and I was jolly pleased to attend my first sprint, which I promptly ignored, such that I could hack a change to Fabian’s new memory_profiler such that you don’t have to use a decorator to choose the function to profile (which, um, I must submit back in the next few days).

I also gave a 3 minute lightning talk on my experiences building (and ending) StrongSteam via StartupChile. The point I forgot to make (dang just 3 minutes!) is that whilst StartupChile will be on various government radars around the world, I doubt it’d be on related radars for governments looking to build a similar programme that includes open research (hopefully also coupled with entrepreneurship). As and when a government/institution suggests inviting researchers to their country to collaborate on large, ground breaking research problems requiring much collaboration, perhaps with business opportunities, I hope they look to StartupChile as a template for successfully inviting 1000s of folk under a simple grant systems to a foreign country.

As noted in my slides I’m also looking for new work opportunities around Big Data/Natural Language Processing/High Performance Computing – take a look at my work site and please drop me an email.

As usual we had a big social, organised by Ludovic Gasc of AFPyro. Here are my mussels nestled in their bed with snails wrapped in warm cheese blankets (fine food!):

I also got introduced to the fine Free Beer, modelled by Didrik of Enthought:


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: Python

18 August 2012 - 0:51EuroSciPy2012 Parallel Python tutorial requirements now online

My EuroSciPy 2012 Parallel Python tutorial requirements are online in this github repo. If you’re coming to my tutorial next Thursday please make sure everything is installed beforehand. The repo includes the slides (not quite yet finished) and a ‘solutions/’ directory which you shouldn’t peek at (that’s there in case we run behind in the tutorial). In the course we’ll cover:

  • multiprocessing
  • parallelpython
  • gearman
  • picloud
  • ipython cluster

This course builds on the Mandelbrot example from my previous High Performance Python course. As noted at the end of the slides I’m probably looking for some long-running consultancy work around Parallel Python/High Performance Python work around London (or maybe Euro/US areas) as the start-up I took to StartupChile didn’t work out. I’m in California at present, returning to London to stay in October.

Also – I’m thinking of writing an updated High Performance+Parallel Python guide (probably as a self-published book) that builds on my original 55 page High Performance Python guide, if you’re interested in hearing about it please join the High Performance Python Mailing List (I’ve only got a list right now). I’ll make an announce once I know more.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Python