Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

8 October 2013 - 10:23What confusion leads from self driving vehicles and their talking to each other?

This is a light follow-up from my “Do self driving cars make the courier redundant?”  post from January. I’m wondering which first- and second-order effects occur from self-driving cars talking to each other.

Let’s assume they can self-drive and self-park and that they have some ability to communicate with each other. Noting their speed and intent should help self-driving cars make better utilisation of the road (they could drive closer together), they could quickly signal if they have a failure (e.g. “My brake readings have just become odd – everyone pull back! I’m slowing using the secondary brake system”), they can signal that e.g. they intend to reverse park and that other cars should slow further back along the road to avoid having to halt. It is hard to see how a sensibly designed system of self-driving cars could be worse than a similar sized pack of normal humans (who might be tired, overconfident, in a rush etc) behind the wheel.

Would cars deliberately lie? There are many running jokes about drivers (often “elsewhere” in the world) where some may signal one way and then exploit nearby gaps regardless of their signalled intention. Might cars do the same? By design or by poor coding? I’d guess people might mod their driving computer to help them get somewhere faster – maybe they’d ask it to be less cautious in its manoeuvres  (taking turns quicker, giving less distance between other vehicles) or hypermile more closely than a human would. Manufacturers would fight back as these sorts of modifications would increase their liabilities and accidents would damage their brand.

What about poorly implemented protocols? On the Internet with TCP/IP we suffer from bufferbloat – many intermediate devices between packet destinations have varying sized buffers, they all try to cache to manage traffic but we end up with lower throughput and odd jams that are rather unpredictable and contrary to the design goal. Cars could have poor implementations of communication protocols (just as some smartphones and laptop brands have trouble with certain WiFi routers), so they’d fail to talk or maybe talk with errors.

Maybe cars would not communicate directly but would implement some boids-like behaviours based on local sensing (probably more robust but also less efficient due to no longer-range negotiation). Even so local odd behaviours might emerge – two cars backing off from each other, then accelerating to close the gap, then repeating – maybe a group of cars get into an unstable ‘dance’ whilst driving down the motorway. This might only be visible from the air and would look rather inhuman.

Presumably self-driving cars would have to avoid hitting humans at all costs. This might make humans less observant as they cross the road – why look if you know that a car is always anticipating (and avoiding) your arrival into the road? This presumably leaves self-driving cars at the mercy of mischievous humans – leaving out human-like dolls in the road that cause slow-and-avoid behaviours, just for kicks.

Governments are likely to introduce some kind of control overrides into the cars in the name of safety and national security (NSA/GCHQ – looking at you). This is likely to be as secure as the “unbreakable” DVD encryption, since any encryption system released into the wild is subject to various attacks. Having people steal cars or subvert their behaviours once the backdoors and overrides are noticed seems inevitable.

I wonder what sort of second order effects we’d see? I suspect that self-driving delivery vehicles would shift to more night work (when the roads are less congested and possibly petrol is dynamically priced to be cheaper), so roads could be less congested by day (and so could be filled by more humans as they commute longer distances to work?). Maybe people en-mass forget how to drive? More people will never have to drive a car, so we’d need fewer driving instructors. Maybe we’d need fewer parking spaces as cars could self-park elsewhere and return when summoned – maybe the addition of intelligence helps us use parking resources more efficiently?

If we have self-driving trucks then maybe the cost of removals and deliveries drop. No longer would I need to hire a large truck with a driver, instead the truck would drive itself (it’d still need loading/unloading of course). This would mean fewer people taking the larger-vehicle licensing exams, so fewer test centres (just as for driving schools) would be needed.

An obvious addition – if cars can self-drive then repair centres don’t need to be small and local. Whither the local street of car mechanics (inevitably of varying quality and, sadly, honesty)? I’d guess larger, out of town centralised garages more closely monitored by the manufacturers will surface (along with a fleet of pick-up trucks for broken-down vehicles). What happens to the local street of car mechanic shops? More hackspaces and assembly shops? Conversion to housing seems more likely.

If we need less parking spaces (e.g. in Hove [1927 photo!] there are huge boulevards – see Grand Avenue lanes here) then maybe we get more cycle lanes and maybe we can repurpose some of the road space for other usages – communal green patches (for kids and/or for growing stuff?).

The NYTimes has a good article on how driverles cars could reshape cities.

Charles Stross has a nice thread on geo-political consequences of self-driving cars. One comment alludes to improved social lives – if we can get to and from a party/restaurant/pub/nice social scene very easily (without e.g. hoping for the last Tube train home in London or a less pleasant bus journey), maybe our social dimension increases? The comment on flying vs driving  is interesting – you’d probably drive further rather than fly if you could sleep for much of the journey, so that hurts flight companies and increases the burden on road maintenance (but maybe preserves motorway service stations that might otherwise get less business since you’d be less in need of a break if you’re not concentrating on driving all the time!).

Hmmm…drone networks look like they might do interesting things for delivery to non-road locations, but drones have a limited range. What about coupling an HGV ‘mother truck’ with a drone fleet for the distribution of goods to remote locations, with the ‘mother truck’ containing a generator and a large storage unit of stuff-to-distribute. I’m thinking about feeding animals in winter that are stuck in fields, reaching hurricane survivors, more extreme running races (and hopefully helping to avoid deaths) or even supplying people living out of cities and in remote areas (maybe Amazon-by-drone deliveries whilst living up a mountain become feasible?).

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Life

22 September 2013 - 12:13PyConUK 2013

I’m just finishing with PyConUK, it has been a fun 3 days (and the sprints carry on tomorrow).


Yesterday I presented a lightly tweaked version of my Brand Disambiguation with scikit-learn talk on natural language processing for social media processing. I had 65 people in the room (cripes!), 2/3 had used ML or NLP for their own projects though only a handful of the participants had used either ‘in anger’ for commercial work. The slides below are slightly updated from my DataScienceLondon talk earlier in the year, there’s more on this blog over the last 2 months that I hadn’t integrated into the talk.



The project is in github if you’re interested, I’m looking for new collaborators and I can share the dataset of hand-tagged tweets.

I’d like to see more scientific talks at PyConUK, a lightning talk for later today will introduce EuroSciPy 2014 which will take place in Cambridge. I’d love to see more Pythonistas talking about scientific work, numerical computing and parallel computing (rather than quite so much web and db development). I also met David Miller who spoke on censorship (giving a call-out to the OpenRightsGroup – you too should pay them a tenner a month to support digital freedoms in the UK), but looked over a long period of censorship in the UK and the English language. As ever, there were a ton of interesting folk to meet.

David mentioned the Andrews and Arnold ISP who pledge not to censor their broadband, apparently the only ISP in the UK to put up a strong pledge. This is interesting.

Shortly in London I’ll organise (or co-opt) some sort of Natural Language Processing meetup, I’m keen to meet others (Pythonistas, R, Matlab, whoever) who are involved in the field. I’ll announce it here when I’ve figured something out.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: ArtificialIntelligence, Python, SocialMediaBrandDisambiguator

7 July 2013 - 17:52Overfitting with a Decision Tree

Below is a plot of Training versus Testing errors using a Precision metric (actually 1.0-precision, so lower is better) that shows how easy it is to over-fit a decision tree to the detriment of generalisation. It is important to check that a classifier isn’t overfitting to the training data such that it is just learning the training set, rather than generalising to the true patterns that make up the entire dataset. It will only be a good a good predictor on unseen data if it has generalised to the true patterns.


Looking at the first column (depth 1 decision tree) the training error (red) is around 0.29 (so the Precision is around 71%). If we look at the exported depth 1 decision tree (1 page pdf) we see that it picks out 1 feature (“http”) as the most informative feature to split the dataset (ignore the threshold, that’s held at a constant 0.5 as we only have 0 or 1 values in our training matrix). It has 935 samples in the dataset with 465 in class 0 (not-a-brand) and 470 in class 1 (is-the-brand).

The right sub-tree is chosen if the term “http” is seen in the tweet. In that case the the training set is left with 331 samples of which 95 are class 0 and 236 are class 1. 1.0/331*236 == 71%. If “http” isn’t seen then the left branch is taken where 234 class 1 samples are given a false negative labelling.

As we allow greater depth in the decision tree we see both the training and the testing error improves. By around depth 35 we have a very low training error and (roughly) the optimum testing error. By allowing the decision tree to add new branches it overfits, becoming a great predictor for the training set (the error goes to 0) but with worsening testing errors (the thin green line is the average – it increases past a depth of 35 layers). Decision trees tend to overfit due to their greedy nature.

I’ve added an example of a depth 50 (1 page pdf) decision tree if you’re curious. The social media disambiguator project has example code (learn1_biasvar.py) to generate this plot.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: ArtificialIntelligence, Python, SocialMediaBrandDisambiguator

17 June 2013 - 20:13Demonstrating the first Brand Disambiguator (a hacky, crappy classifier that does something useful)

Last week I had the pleasure of talking at both BrightonPython and DataScienceLondon to about 150 people in total (Robin East wrote-up the DataScience night). The updated code is in github.

The goal is to disambiguate the word-sense of a token (e.g. “Apple”) in a tweet as being either the-brand-I-care-about (in this case – Apple Inc.) or anything-else (e.g. apple sauce, Shabby Apple clothing, apple juice etc). This is related to named entity recognition, I’m exploring simple techniques for disambiguation. In both talks people asked if this could classify an arbitrary tweet as being “about Apple Inc or not” and whilst this is possible, for this project I’m restricting myself to the (achievable, I think) goal of robust disambiguation within the 1 month timeline I’ve set myself.

Below are the slides from the longer of the two talks at BrightonPython:

As noted in the slides for week 1 of the project I built a trivial LogisticRegression classifier using the default CountVectorizer, applied a threshold and tested the resulting model on a held-out validation set. Now I have a few more weeks to build on the project before returning to consulting work.

Currently I use a JSON file of tweets filtered on the term ‘apple’, obtained using the free streaming API from Twitter using cURL. I then annotate the tweets as being in-class (apple-the-brand) or out-of-class (any other use of the term “apple”). I used the Chromium Language Detector to filter non-English tweets and also discard English tweets that I can’t disambiguate for this data set. In total I annotated 2014 tweets. This set contains many duplicates (e.g. retweets) which I’ll probably thin out later, possibly they over-represent the real frequency of important tokens.

Next I built a validation set using 100 in- and 100 out-of-class tweets at random and created a separate test/train set with 584 tweets of each class (a balanced set from the two classes but ignoring the issue of duplicates due to retweets inside each class).

To convert the tweets into a dense matrix for learning I used the CountVectorizer with all the defaults (simple tokenizer [which is not great for tweets], minimum document frequency=1, unigrams only).

Using the simplest possible approach that could work – I trained a LogisticRegression classifier with all its defaults on the dense matrix of 1168 inputs. I then apply this classifier to the held-out validation set using a confidence threshold (>92% for in-class, anything less is assumed to be out-of-class). It classifies 51 of the 100 in-class examples as in-class and makes no errors (100% precision, 51% recall). This threshold was chosen arbitrarily on the validation set rather than deriving it from the test/train set (poor hackery on my part), but it satisfied me that this basic approach was learning something useful from this first data set.

The strong (but not generalised at all!) result for the very basic LogisticRegression classifier will be due to token artefacts in the time period I chose (March 13th 2013 around 7pm for the 2014 tweets). Extracting the top features from LogisticRegression shows that it is identifying terms like “Tim”, “Cook”, “CEO” as significant features (along with other features that you’d expect to see like “iphone” and “sauce” and “juice”) – this is due to their prevalence in this small dataset (in this set examples like this are very frequent). Once a larger dataset is used this advantage will disappear.

I’ve added some TODO items to the README, maybe someone wants to tinker with the code? Building an interface to the open source DBPediaSpotlight (based on WikiPedia data using e.g. this python wrapper) would be a great start for validating progress, along with building some naive classifiers (a capital-letter-detecting one and a more complex heuristic-based one, to use as controls against the machine learning approach).

Looking at the data 6% of the out-of-class examples are retweets and 20% of the in-class examples are retweets. I suspect that the repeated strings are distorting each class so I think they need to be thinned out so we just have one unique example of each tweet.

Counting the number of capital letters in-class and out-of-class might be useful, in this set a count of <5 capital letters per tweet suggests an out-of-class example:

This histogram of tweet lengths for in-class and out-of-class tweets might also suggest that shorter tweets are more likely to be out-of-class (though the evidence is much weaker):


Next I need to:

  • Update the docs so that a contributor can play with the code, this includes exporting a list of tweet-ids and class annotations so the data can be archived and recreated
  • Spend some time looking at the most-important features (I want to properly understand the numbers so I know what is happening), I’ll probably also use a Decision Tree (and maybe RandomForests) to see what they identify (since they’re much easier to debug)
  • Improve the tokenizer so that it respects some of the structure of tweets (preserving #hashtags and @users would be a start, along with URLs)
  • Build a bigger data set that doesn’t exhibit the easily-fitted unigrams that appear in the current set

Longer term I’ve got a set of Homeland tweets (to disambiguate the TV show vs references to the US Department and various sayings related to the term) which I’d like to play with – I figure making some progress here opens the door to analysing media commentary in tweets.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life, Python, SocialMediaBrandDisambiguator

3 June 2013 - 20:24Social Media Brand Disambiguator first steps

As noted a few days back I’m spending June working on a social-media focused brand disambiguator using Python, NLTK and scikit-learn. This project has grown out of frustrations using existing Named Entity Recognition tools (like OpenCalais and DBPediaSpotlight) to recognise brands in social media messages. These tools are generally trained to work on long-form clean text and tweets are anything but long or cleanly written!

The problem is this: in a short tweet (e.g. “Loving my apple, like how it werks with the iphon”) we have little context to differentiate the sense of the word “apple”. As a human we see the typos and deliberate spelling errors and know that this use of “apple” is for the brand, not for the fruit. Existing APIs don’t make this distinction, typically they want a lot more text with fewer text errors. I’m hypothesising that with a supervised learning system (using scikit-learn and NLTK) and hand tagged data I can outperform the existing APIs.

I started on Saturday (freshly back from honeymoon), a very small github repo is online. Currently I can ingest tweets from a JSON file (captured using curl), marking the ones with a brand and those with the same word but not-a-brand (in-class and out-of-class) in a SQLite db. I’ll benchmark my results against my hand-tagged Gold Standard to see how I do.

Currently I’m using my Python template to allow environment-variable controlled configurations, simple logging, argparse and unittests. I’ll also be using the twitter text python module that I’m now supporting to parse some structure out of the tweets.

I’ll be presenting my progress next week at Brighton Python, my goal is to have a useful MIT-licensed tool that is pre-trained with some obvious brands (e.g. Apple, Orange, Valve, Seat) and software names (e.g. Python, vine, Elite) by the end of this month, with instructions so anyone can train their own models. Assuming all goes well I can then plumb it into my planned annotate.io online service later.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: ArtificialIntelligence, Python, SocialMediaBrandDisambiguator

5 May 2013 - 14:32June project: Disambiguating “brands” in Social Media

Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I’m planning on a change for June.

I’m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following tweet mentions Apple-the-brand:
“I love my apple, though leopard can be a pain”
and that this tweet does not:
“Really enjoying this apple, very tasty”

I’ve used AlchemyAPI, OpenCalais, DBPedia Spotlight and others for client projects and it turns out that these APIs expect long-form text (e.g. Reuters articles) written with good English.

Tweets are short-form, messy, use colloquialisms, can be compressed (e.g. using contractions) and rely on local context (both local in time and social group). Linguistically a lot is expressed in 140 characters and it doesn’t look like”good English”.

A second problem with existing APIs is that they cannot be trained and often don’t know about European brands, products, people and places. I plan to build a classifier that learns whatever you need to classify.

Examples for disambiguation will include Apple vs apple (brand vs e.g. fruit/drink/pie), Seat vs seat (brand vs furniture), cold vs cold (illness vs temperature), ba (when used as an abbreviation for British Airways).

The goal of the June project will be to out-perform existing Named Entity Recognition APIs for well-specified brands on Tweets, developed openly with a liberal licence. The aim will be to solve new client problems that can’t be solved with existing APIs.

I’ll be using Python, NLTK, scikit-learn and Tweet data. I’m speaking on progress at BrightonPy and DataScienceLondon in June.

Probably for now I should focus on having no computer on my honeymoon…

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: ArtificialIntelligence, Life, Python, SocialMediaBrandDisambiguator

17 April 2013 - 11:38Visualising London, Brighton and the UK using Geo-Tweets

Recently I’ve been grabbing Tweets some some natural language processing analysis (in Python using NetworkX and NLTK) – see this PyCon and PyData conversation analysis. Using the London dataset (visualised in the PyData post) I wondered if the geo-tagged tweets would give a good-looking map of London. It turns out that it does:


You can see the bright centre of London, the Thames is visible wiggling left-to-right through the centre. The black region to the left of the centre is Hyde Park. If you look around the edges you can even see the M25 motorway circling the city. This is about a week’s worth of geo-filtered Tweets from the Twitter 10% firehose. It is easier to locate using the following Stamen tiles:


Can you see Canary Wharf and the O2 arena to its east? How about Heathrow to the west edge of the map? And the string of reservoirs heading north north east from Tottenham?

Here’s a zoom around Victoria and London Bridge, we see a lot of Tweets around the railway stations, Oxford Street and Soho. I’m curious about all the dots in the Thames – presumably people Tweeting about their pleasure trips?


Here’s a zoom around the Shoreditch/Tech City area. I was surprised by the cluster of Tweets in the roundabout (Old Street tube station), there’s a cluster in Bonhill Street (where Google’s Campus is located – I work above there in Central Working). The cluster off of Old Street onto Rivington Street seems to be at the location of the new and fashionable outdoor eatery spot (with Burger Bear). Further to the east is a more pubby/restauranty area.


I’ve yet to analyse the content of these tweets (doing something like phrase extraction from the PyCon/PyData tweets onto this map would be great). As such I’m not sure what’s being discussed, probably a bunch of the banal along with chitchat between people (“I”m on my way”…). Hopefully some of it discusses the nearby environment.

I’m using Seth’s Python heatmap (inspired by his lovely visuals). In addition I’m using Stamen map tiles (via OpenStreetMap). I’m using curl to consume the Twitter firehose via a geo-defined area for London, saving the results to a JSON file which I consume later (shout if you’d like the code and I’ll put it in github) – here’s a tutorial.

During London Fashion Week I grabbed the tagged tweets (for “#lfw’ and those mentioning “london fashion week” in the London area), if you zoom on the official event map you’ll see that the primary Tweet locations correspond to the official venue sites.


What about Brighton? Down on the south coast (about 1 hour on the train south of London), it is where I’ve spent the last 10 years (before my recent move to London). You can see the coastline, also Sussex University’s campus (north east corner). Western Road (the thick line running west a little way back from the sea) is the main shopping street with plenty of bars.


It’ll make more sense with the Stamen tiles, Brighton Marina (south east corner) is clear along with the small streets in the centre of Brighton:


Zooming to the centre is very nice, the North Laines are obvious (to the north) and the pedestriansed area below (the “south laines”) is clear too. Further south we see the Brighton Pier reaching into the sea. To the north west on the edge of the map is another cluster inside Brighton Station:


Finally – what about all the geo-tagged Tweets for the UK (annoyingly I didn’t go far enough west to get Ireland)? I’m pleased to see that the entirety of the mainland is well defined, I’m guessing many of the tweets around the coastline are more from pretty visiting points.


How might this compare with a satellite photograph of the UK at night? Population centres are clearly visible but tourist spots are far less visible, the edge of the country is much less defined (via dailymail):

Europe satellite

I’m guessing we can use these Tweets for:

  • Understanding what people talk about in certain areas (e.g. Oxford Street at rush-hour?)
  • Learning why foursquare checkings (below) aren’t in the same place as tweet locations (can we filter locations away by using foursquare data?)
  • Seeing how people discuss the weather – is it correlated with local weather reports?
  • Learning if people talk about their environment (e.g. too many cars, poor London tube climate control, bad air, too noisy, shops and signs, events)
  • Seeing how shops, gigs and events are discussed – could we recommend places and events in real time based on their discussion?
  • Figuring out how people discuss landmarks and tourist spots – maybe this helps with recommending good spots to visit?
  • Looking at the trail people leave as they Tweet over time – can we figure out their commute and what they talk about before and after? Maybe this is a sort of survey process that happens using public data?

Here are some other geo-based visualisations I’ve recently seen:

If you want help with this sort of work then note that I run my own AI consultancy, analysing and visualising social media like Twitter is an active topic for me at present (and will be more so via my planned API at annotate.io).

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

13 Comments | Tags: ArtificialIntelligence, Data science, Entrepreneur, Life, Python

2 April 2013 - 8:32Applied Parallel Computing (PyCon 2013 Tutorial) slides and code

Minesh B. Amin (MBASciences) and I (Mor Consulting Ltd) taught Applied Parallel Computing over 3 hours at PyCon 2013. PyCon this year was a heck of a lot of fun, I did the fun run (mentioned below), received one of the free 2500 RaspberryPis that were given away, met an awful lot of interesting people and ran two birds-of-a-feather sessions (parallel computing for our tutorial, another on natural language processing).

I held posting this entry until the video was ready (it came out yesterday). All the code and slides are in the github repo. Currently (but not indefinitely) there’s a VirtualBox image with everything (Redis, Disco etc) pre-installed.

After the conference, partly as a result of the BoF NLP session I created a Twitter graph “Concept Map” based on #pycon tweets, then another for #pydata. They neatly summarise many of the topics of conversation.

Here’s our room of 60+ students, slides and video are below:

Applied Parallel Computing PyCon 2013 (left side of room)

Applied Parallel Computing PyCon 2013 (left side)

The video runs for 2 hours 40:

Here’s a list of our slides:

  1. Intro to Parallelism (Minesh)
  2. Lessons Learned (Ian)
  3. List of Tasks with Mandelbrot set (Ian)
  4. Map/Reduce with Disco (Ian)
  5. Hyperparameter optimisation with grid and random search (Minesh)

These are each of the slide decks:


I also had fun in the 5k fun run (coming around 77th of 150 runners), we raised $7k or so for cancer research and the John Hunter Memorial Fund.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: ArtificialIntelligence, Life, Python

22 March 2013 - 1:16Analysing #pydata, London and Brighton tweets for concept mapping

Below I’ve visualised tweets for #PyData conference and the cities of London and Brighton – this builds on my ‘concept cloud‘ from a few days ago at the #PyCon conference. Props to Maksim for his Social Media Analysis tutorial for inspiration.

Update – Maksim’s Analying Social Networks tutorial video is online.

For the earlier #PyCon 2013 analysis I visualised #hashtags and @usernames from #pycon tagged tweets during the conference. I’ve built upon this to add some natural language processing for ‘noun phrase extraction’ which I detail below – this helps me to pull out phrases that are descriptive but haven’t been tagged. It also helps us to see which people are connected with which subjects. For the PyCon analysis I collected 22k tweets, after removing retweets I was left with 7,853 for analysis.

#PyData (PyData Santa Clara 2013)


PyData 2013 is a much smaller conference than PyCon (PyCon had 2,500 people and 20% female attendance, PyData had around 400 with 10% female attendance). Being smaller it had far fewer tweets – after removing retweets I had just 225 tweets to analyse. Cripes! This is clearly not big data. The other problem was that people weren’t using many #hashtags, they were referring to topics using natural language. For example:

“Peter Norvig was giving a talk at PyData in Santa Clara, CA on the topic of innovation in education.” (source)

Clearly some natural language processing was required. I took two approaches:

  • Extract capitalised sub-phrases (e.g. “Peter Norvig”, “Santa Clara”) of one or more words
  • Use NLTK’s bigram collocation analyser (to find lowercased phrases such as “ipython notebook”, “machine learning”)

Starting at the bottom of the plot we see three types of colour:

  • white is for #hashtags
  • light blue is for @usernames
  • dark green is for phrases (extracted using natural language processing)

We see a cluster of references around @fperez_org (Fernando Perez of IPython), one cluster is around @swcarpentry (the scientist-friendly software carpentry movement), the other is around IPython and the IPython Notebook (@minrk of IPython/parallel is linked too). I like the connection to Julia – Fernando discussed during his keynote that Julia now interoperates with Python.

The day before we had Peter Norvig (Director of research at Google) giving a keynote on the use of Python in education at Udacity including a discussion of how machine learning could be used to identify the mistakes that new coders make so we could make friendlier error messages to help students correct their code. See the clustering around this at the top of the graph.

Later the same day Henrik (@brinkar) spoke on Wise.io‘s Random Forest classifier. Their approach was efficient enough to demo live on a RaspberryPi. The connection from Peter to Henrik goes via #venturebeat who covered wise.io’s new software release during the conference.

Connecting IPython and Wise.io is @ogrisel (Olivier Grisel) of scikit-learn. He gave an impressive (and given the variability of conference wifi – slightly ballsy) live demo of scaling a machine learning system via IPython Parallel on EC2.

In the middle we see @teoliphant (Travis Oliphant) joined to Continuum (his company). Off to the right I get to blow my own trumpet – the phrases “awesome python” and “network analysis” connect to “russel brand” which is how one wag described my lightning talk. I got a chance to demo the earlier version of this at the end of @katychuang‘s talk on networkx.

London (geo-tagged tweets)


For the last month I’ve been grabbing tweets in the London geo area for another project. I had to raise my filtering levels to bring the network down to a sane (and easily visualised) number of nodes. After removing ReTweets I have 497,771 tweets from just a subset of my data. Some obvious clusters can be seen:

  • #weather and #rain and (presumably a rather wet) “St Albans” (a very British discussion)
  • The “O2 Arena” near the centre with “Justin Beiber” and #believetour, linked with #amazing, #excited, #nowplaying
  • @onedirection must have been playing (connected with band members @louis_tomlinson and @real_liam_payne amongst others)
  • To the top-right we have a football cluster with “Manchester United”, “Champions League”, #cpfc, #realmadrid and “Old Trafford”
  • The usual tourist spots like “Tower Bridge”, “Covent Garden”, “Hyde Park”, “Big Ben”, “Trafalgar Square” are  discussed with #happy #sun #loveit, linked just off of here is “London Heathrow Airport” and “New York”

Brighton (geo-tagged tweets)


This is my favourite, analysed using 40,379 tweets after removing ReTweets. The nature of the two cities (Brighton is 50 miles south of London on the coast, it is a university town with a young & party-friendly population) is quite apparent:

  • Top left there is discussion around “One Direction”, #justinbeiber and #seo (a particular Brighton tech thing)
  • Just south of @justinbieber is a single chain of not-safe-for-work ranting (another particular Brighton thing)
  • If you jump to the bottom right you’ll see #underwear, #lingerie, #teenagers – not as dodgy as you might expect, Sweetling were doing a social media bra campaign
  • #hove is joined with #sunny #morning and nearby places #lewes #shoreham
  • #brightonbeach and “Brighton Pier” connect with #birds (Seagulls – a bane!) and #sun
  • #friends, #memories#, #happy, #goodtimes, #marina, #fun, #girls cluster around the centre (Brighton does like a party)
  • Off down to the bottom left is a some sort of political discussion (what were they doing in Brighton?)

Reproducing this

All the code is in github at twitter_networkx_concept_map including the one line cURL command to capture the data. An example .gephi file is included for visualisation in Gephi. The built-in networkx viewer (optionally using GraphViz) works reasonably well but isn’t interactive. Maksim’s tutorial and utils class were jolly useful (utils is in my repo), I’m also using twitter-text-python for parsing @usernames, #hashtags and URLs from the tweets.

If you want some custom work around this, give me a shout via Mor Consulting.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

19 Comments | Tags: ArtificialIntelligence, Life, Python

7 March 2013 - 17:10PowerPoint: Brief Introduction to NLProc. for Social Media

For my client (AdaptiveLab) I recently gave an internal talk on the state of the art of Natural Language Processing around Social Media (specifically Twitter and Facebook), having spent a few days digesting recent research papers. The area is fascinating (I want to do some work here via my Annotate.io) as the text is so much dirtier than in long form entries such as we might find with Reuters and BBC News.

The Powerpoint below is just the outline, I also gave some brief demos using NLTK (great Python NLP library).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life