About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

1 November 2013 - 12:10“Introducing Python for Data Science” talk at SkillsMatter

On Wednesday Bart and I spoke at SkillsMatter to 75 Pythonistas with an Introduction to Data Science using Python. A video of the 4 talks is now online. We covered:

Since the group is more of a general programming community we wanted to talk at a high level on the various ways that Python can be used for data science, it was lovely to have such a large turn-out and the following pub conversation was much fun.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

16 Comments | Tags: Data science, Life, Python

7 October 2013 - 17:10Future Cities Hackathon (@ds_ldn) Oct 2013 on Parking Usage Inefficiencies

On Saturday six of us attended the Future Cities Hackathon organised by Carlos and DataScienceLondon (@ds_ldn). I counted about 100 people in the audience (see lots of photos, original meetup thread), from asking around there seemed to be a very diverse skill set (Python and R as expected, lots of Java/C, Excel and other tools). There were several newly-released data sets to choose from. We spoke with Len Anderson of SocITM who works with Local Government, he suggested that the parking datasets for Westminster Ward might be interesting as results with an economic outcome might actually do something useful for Government Policy. This seemed like a sensible reason to tackle the data. Other data sets included flow-of-people and ASBO/dog-mess/graffiti recordings.

Overall we won ‘honourable mention’ for proposing the idea that the data supported a method of changing parking behaviour whilst introducing the idea of a dynamic pricing model so that parking spaces might be better utilised and used to generate increased revenue for the council. I suspect that there are more opportunities for improving the efficiency of static systems as the government opens more data here in the UK.

Sidenote – previously I’ve thought about the replacement of delivery drivers with self-driving cars and other outcomes of self-driving vehicles, the efficiencies discussed here connect with those ideas.

With the parking datasets we have over 4 million lines of cashless parking-meter payments for 2012-13 in Westminster to analyse, tagged with duration (you buy a ticket at a certain time for fixed periods of time like 30 minutes, 2 hours etc) and a latitude/longitude for location. We also had a smaller dataset with parking offence tickets (with date/time and location – but only street name, not latitude/longitude) and a third set with readings from the small number of parking sensors in Westminster.

Ultimately we produced a geographic plot of over 1000 parking bays, coloured by average percentage occupancy in Westminster. The motivation was to show that some bays are well used (i.e. often have a car parked in them) whilst other areas are under-utilised and could take a higher load (darker means better utilised):

Westminster Parking Bays by Percentage Occupancy

At first we thought we’d identified a striking result. After a few more minutes hacking (around 9.30pm on the Saturday) we pulled out the variance in pricing per bay and noted that this was actually quite varied and confusing, so a visitor to the area would have a hard time figuring out which bays were likely to be both under-utilised and cheap (darker means more expensive):

Westminster parking bays by cost

If we’d have had more time we’d have checked to see which bays were likely to be under-utilised and cheap and ranked the best bays in various areas. One can imagine turning this into a smartphone app to help visitors and locals find available parking.

The video below shows the cost and availability of parking over the course of the day. Opacity (how see-through it is) represents the expense – darker means more expensive (so you want to find very-see-through areas). Size represents the number of free spaces, bigger means more free space, smaller (i.e. during the working day) shows that there are few free spaces:

Behind this model we captured the minute-by-minute stream of ticket purchases by lat/lng to model the occupancy of bays, the data also records the number of bays that can be maximally used (but the payment machines don’t know how many are in use – we had to model this). Using Pandas we modelled usage over time (+1 for each ticket purchase and -1 for each expiry), the red line shows the maximum number of bays that are available, the sections over the line suggest that people aren’t parking for their full allocation (e.g. you might buy an hour’s ticket but only stay for 20 minutes, then someone else buys a ticket and uses the same bay):

parking_starts_and_ends

We extended the above model for one Tuesday over all the 1000+ plus parking bays in Westminster.

Additionally this analysis by shows the times and days when parking tickets are most likely to be issued. The 1am and 3am results were odd, Sunday (day 6) is clearly the quietest, weekdays at 9am are obviously the worst:

parking_fines_bucketed_over_many_weeks_cropped

Conclusion:

We believe that the carrot and stick approach to parking management (showing where to park – and noting that you’ll likely get fined if you don’t do it properly) should increase the correct utilisation of parking bays in Westminster which would help to reduce congestion and decrease driver-frustration, whilst increasing income for the local council.

Update – at least one parking area in New Zealand are experimenting with truly dynamic demand-based pricing.

We also believe the data could be used by Traffic Wardens to better patrol the high-risk areas to deter poor parking (e.g. double-parking) which can be a traffic hazard (e.g. by obstructing a road for larger vehicles like Fire Engines). The static dataset we used could certainly be processed for use in a smartphone app for easy use, and updated as new data sets are released.

Our code is available in this github repo: ParkingWestminster.

Here’s our presentation:

Team:

Tools used:

  • Python and IPython
  • Pandas
  • QGIS (visualisation of shapefiles backed by OpenLayers maps from Google and OSM)
  • pyshp to handle shapefiles
  • Excel (quick analysis of dates and times, quick visualisation of lat/lng co-ords)
  • HackPad (useful for lightweight note/URL sharing and code snippet collaboration)

 Some reflections for future hackathons:

  • Pre-cleaning of data would speed team productivity (we all hacked various approaches to fixing the odd Date and separate Time fields in the CSV data and I suspect many in the room all solved this same problem over the first hour or two…we should have flagged this issue early on and a couple of us solved it and written out a new 1.4GB fixed CSV file for all to share)
  • Decide early on on a goal – for us it was “work to show that a dynamic pricing model is feasible” – that lets you frame and answer early questions (quite possibly an hour in we’d have discovered that the data didn’t support our hypothesis – thankfully it did!)
  • Always visualise quickly – whilst I wrote a new shapefile to represent the lat/lng data Bart just loaded it into Excel and did a scatter plot – super quick and easy (and shortly after I added the Map layer via QGIS so we could line up street names and validate we had sane data)
  • Check for outliers and odd data – we discovered lots of NaN lines (easily caught and either deleted or fixed using Pandas), these if output and visualised were interpreted by QGIS as an extreme but legal value and so early on we had some odd visuals, until we eyeballed the generated CSV files. Always watch for NaNs!
  • It makes sense to print a list of extreme and normal values for a column, again as a sanity check – histograms are useful, also sets of unique values if you have categories
  • Question whether the result we see actually would match reality – having spent hours on a problem it is nice to think you’ve visualised something new and novel but probably the data you’re drawing is already integrated (e.g. in our case at least some drivers in Westminster would know where the cheap/under-utilised parking spaces would be – so there shouldn’t be too many)
  • Setup a github repo early and make sure all the team can contribute (some of our team weren’t experienced with github so we deferred this step and ended up emailing code…that was a poor use of time!)
  • Go visit the other teams – we hacked so intently we forgot to talk to anyone else…I’m sure we’d have learned and skill-shared had we actually stepped away from our keyboards!

Update – Stephan Hügel has a nice article on various Python tools for making maps of London wards, his notes are far more in-depth than the approach we took here.

Update – nice picture of London house prices by postcode, this isn’t strictly related to the above but it is close enough. Visualising the workings of the city feels rather powerful. I wonder how the house prices track availability of public transport and local amenities?


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Life, Python

17 June 2013 - 20:13Demonstrating the first Brand Disambiguator (a hacky, crappy classifier that does something useful)

Last week I had the pleasure of talking at both BrightonPython and DataScienceLondon to about 150 people in total (Robin East wrote-up the DataScience night). The updated code is in github.

The goal is to disambiguate the word-sense of a token (e.g. “Apple”) in a tweet as being either the-brand-I-care-about (in this case – Apple Inc.) or anything-else (e.g. apple sauce, Shabby Apple clothing, apple juice etc). This is related to named entity recognition, I’m exploring simple techniques for disambiguation. In both talks people asked if this could classify an arbitrary tweet as being “about Apple Inc or not” and whilst this is possible, for this project I’m restricting myself to the (achievable, I think) goal of robust disambiguation within the 1 month timeline I’ve set myself.

Below are the slides from the longer of the two talks at BrightonPython:

As noted in the slides for week 1 of the project I built a trivial LogisticRegression classifier using the default CountVectorizer, applied a threshold and tested the resulting model on a held-out validation set. Now I have a few more weeks to build on the project before returning to consulting work.

Currently I use a JSON file of tweets filtered on the term ‘apple’, obtained using the free streaming API from Twitter using cURL. I then annotate the tweets as being in-class (apple-the-brand) or out-of-class (any other use of the term “apple”). I used the Chromium Language Detector to filter non-English tweets and also discard English tweets that I can’t disambiguate for this data set. In total I annotated 2014 tweets. This set contains many duplicates (e.g. retweets) which I’ll probably thin out later, possibly they over-represent the real frequency of important tokens.

Next I built a validation set using 100 in- and 100 out-of-class tweets at random and created a separate test/train set with 584 tweets of each class (a balanced set from the two classes but ignoring the issue of duplicates due to retweets inside each class).

To convert the tweets into a dense matrix for learning I used the CountVectorizer with all the defaults (simple tokenizer [which is not great for tweets], minimum document frequency=1, unigrams only).

Using the simplest possible approach that could work – I trained a LogisticRegression classifier with all its defaults on the dense matrix of 1168 inputs. I then apply this classifier to the held-out validation set using a confidence threshold (>92% for in-class, anything less is assumed to be out-of-class). It classifies 51 of the 100 in-class examples as in-class and makes no errors (100% precision, 51% recall). This threshold was chosen arbitrarily on the validation set rather than deriving it from the test/train set (poor hackery on my part), but it satisfied me that this basic approach was learning something useful from this first data set.

The strong (but not generalised at all!) result for the very basic LogisticRegression classifier will be due to token artefacts in the time period I chose (March 13th 2013 around 7pm for the 2014 tweets). Extracting the top features from LogisticRegression shows that it is identifying terms like “Tim”, “Cook”, “CEO” as significant features (along with other features that you’d expect to see like “iphone” and “sauce” and “juice”) – this is due to their prevalence in this small dataset (in this set examples like this are very frequent). Once a larger dataset is used this advantage will disappear.

I’ve added some TODO items to the README, maybe someone wants to tinker with the code? Building an interface to the open source DBPediaSpotlight (based on WikiPedia data using e.g. this python wrapper) would be a great start for validating progress, along with building some naive classifiers (a capital-letter-detecting one and a more complex heuristic-based one, to use as controls against the machine learning approach).

Looking at the data 6% of the out-of-class examples are retweets and 20% of the in-class examples are retweets. I suspect that the repeated strings are distorting each class so I think they need to be thinned out so we just have one unique example of each tweet.

Counting the number of capital letters in-class and out-of-class might be useful, in this set a count of <5 capital letters per tweet suggests an out-of-class example:

nbr_capitals_scikit_testtrain_apple
This histogram of tweet lengths for in-class and out-of-class tweets might also suggest that shorter tweets are more likely to be out-of-class (though the evidence is much weaker):

histogram_tweet_lengths_scikit_testtrain_apple

Next I need to:

  • Update the docs so that a contributor can play with the code, this includes exporting a list of tweet-ids and class annotations so the data can be archived and recreated
  • Spend some time looking at the most-important features (I want to properly understand the numbers so I know what is happening), I’ll probably also use a Decision Tree (and maybe RandomForests) to see what they identify (since they’re much easier to debug)
  • Improve the tokenizer so that it respects some of the structure of tweets (preserving #hashtags and @users would be a start, along with URLs)
  • Build a bigger data set that doesn’t exhibit the easily-fitted unigrams that appear in the current set

Longer term I’ve got a set of Homeland tweets (to disambiguate the TV show vs references to the US Department and various sayings related to the term) which I’d like to play with – I figure making some progress here opens the door to analysing media commentary in tweets.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life, Python, SocialMediaBrandDisambiguator

17 April 2013 - 11:38Visualising London, Brighton and the UK using Geo-Tweets

Recently I’ve been grabbing Tweets some some natural language processing analysis (in Python using NetworkX and NLTK) – see this PyCon and PyData conversation analysis. Using the London dataset (visualised in the PyData post) I wondered if the geo-tagged tweets would give a good-looking map of London. It turns out that it does:

london_all_r1_nomap

You can see the bright centre of London, the Thames is visible wiggling left-to-right through the centre. The black region to the left of the centre is Hyde Park. If you look around the edges you can even see the M25 motorway circling the city. This is about a week’s worth of geo-filtered Tweets from the Twitter 10% firehose. It is easier to locate using the following Stamen tiles:

london_all_r5

Can you see Canary Wharf and the O2 arena to its east? How about Heathrow to the west edge of the map? And the string of reservoirs heading north north east from Tottenham?

Here’s a zoom around Victoria and London Bridge, we see a lot of Tweets around the railway stations, Oxford Street and Soho. I’m curious about all the dots in the Thames – presumably people Tweeting about their pleasure trips?

centrallondon_r3_map

Here’s a zoom around the Shoreditch/Tech City area. I was surprised by the cluster of Tweets in the roundabout (Old Street tube station), there’s a cluster in Bonhill Street (where Google’s Campus is located – I work above there in Central Working). The cluster off of Old Street onto Rivington Street seems to be at the location of the new and fashionable outdoor eatery spot (with Burger Bear). Further to the east is a more pubby/restauranty area.

london_shoreditch_all

I’ve yet to analyse the content of these tweets (doing something like phrase extraction from the PyCon/PyData tweets onto this map would be great). As such I’m not sure what’s being discussed, probably a bunch of the banal along with chitchat between people (“I”m on my way”…). Hopefully some of it discusses the nearby environment.

I’m using Seth’s Python heatmap (inspired by his lovely visuals). In addition I’m using Stamen map tiles (via OpenStreetMap). I’m using curl to consume the Twitter firehose via a geo-defined area for London, saving the results to a JSON file which I consume later (shout if you’d like the code and I’ll put it in github) – here’s a tutorial.

During London Fashion Week I grabbed the tagged tweets (for “#lfw’ and those mentioning “london fashion week” in the London area), if you zoom on the official event map you’ll see that the primary Tweet locations correspond to the official venue sites.

lfw

What about Brighton? Down on the south coast (about 1 hour on the train south of London), it is where I’ve spent the last 10 years (before my recent move to London). You can see the coastline, also Sussex University’s campus (north east corner). Western Road (the thick line running west a little way back from the sea) is the main shopping street with plenty of bars.

brighton_gps_to0103_nomap

It’ll make more sense with the Stamen tiles, Brighton Marina (south east corner) is clear along with the small streets in the centre of Brighton:

brighton_gps_to0403_map

Zooming to the centre is very nice, the North Laines are obvious (to the north) and the pedestriansed area below (the “south laines”) is clear too. Further south we see the Brighton Pier reaching into the sea. To the north west on the edge of the map is another cluster inside Brighton Station:

brighton_gps_to0403_map_zoomed

Finally – what about all the geo-tagged Tweets for the UK (annoyingly I didn’t go far enough west to get Ireland)? I’m pleased to see that the entirety of the mainland is well defined, I’m guessing many of the tweets around the coastline are more from pretty visiting points.

uk_gps_to0404_map_r5_zoomed

How might this compare with a satellite photograph of the UK at night? Population centres are clearly visible but tourist spots are far less visible, the edge of the country is much less defined (via dailymail):

Europe satellite

I’m guessing we can use these Tweets for:

  • Understanding what people talk about in certain areas (e.g. Oxford Street at rush-hour?)
  • Learning why foursquare checkings (below) aren’t in the same place as tweet locations (can we filter locations away by using foursquare data?)
  • Seeing how people discuss the weather – is it correlated with local weather reports?
  • Learning if people talk about their environment (e.g. too many cars, poor London tube climate control, bad air, too noisy, shops and signs, events)
  • Seeing how shops, gigs and events are discussed – could we recommend places and events in real time based on their discussion?
  • Figuring out how people discuss landmarks and tourist spots – maybe this helps with recommending good spots to visit?
  • Looking at the trail people leave as they Tweet over time – can we figure out their commute and what they talk about before and after? Maybe this is a sort of survey process that happens using public data?

Here are some other geo-based visualisations I’ve recently seen:

If you want help with this sort of work then note that I run my own AI consultancy, analysing and visualising social media like Twitter is an active topic for me at present (and will be more so via my planned API at annotate.io).


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

13 Comments | Tags: ArtificialIntelligence, Data science, Entrepreneur, Life, Python

7 March 2013 - 17:10PowerPoint: Brief Introduction to NLProc. for Social Media

For my client (AdaptiveLab) I recently gave an internal talk on the state of the art of Natural Language Processing around Social Media (specifically Twitter and Facebook), having spent a few days digesting recent research papers. The area is fascinating (I want to do some work here via my Annotate.io) as the text is so much dirtier than in long form entries such as we might find with Reuters and BBC News.

The Powerpoint below is just the outline, I also gave some brief demos using NLTK (great Python NLP library).

 


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life

10 February 2013 - 14:28Applied Parallel Computing at PyCon 2013 (March)

Minesh B. Amin (MBA Sciences) and I (Mor Consulting) are teaching Applied Parallel Computing at PyCon in San Jose in just over a month, here’s an outline of the tutorial. The conference is sold out but there’s still tickets for the tutorials (note that they’re selling quickly too).

Typically a recording of the tutorial is released a couple of months after PyCon to PyVideo – you miss out on the networking but you can at least catch up on the material. The source code will also be released.

Our tutorial uses a lot of tools so we’re providing a VirtualBox image (32 bit requiring about 5GB of disk space, runs on Win/Lin/Mac). Those who choose not to use the VBox image will have to install the requirements themselves, for some parts this is a bit tough so we strong recommend using the VBox image. Details of the image will be provided to students a few weeks before the conference.

Parts of my tutorial build on my PyCon 2012 High Performance Python 1 tutorial. You might also be interested in the (slightly vague!) idea I have of writing a book on these topics – if so you should add your name to my High Performance Python Mailing List (it is an announce list for when/if I make progress on this project, very lightweight).

This year’s 3 hour tutorial is split into five sections:

  1. Types of parallelism
  2. Hard-won lessons in building reliable/debuggable/extensible parallel systems
  3. “List of tasks” – solving a Mandelbrot task using multiprocessing (single machine), parallelpython (can run multi-machine), redis queue (multi machine and language)
  4. “Map/reduce” – investigating and understanding a set of Tweets using Disco, practical guide to configuration, visualisation with word-cloud and matplotlib, possibly moving on to social network connectivity analysis and visualisation
  5. “Hyperparameter optimisation” – solving a many-paramemter optimisation problem whose parameter space is not fixed at the start of the run

During the Mandelbrot solver we’ll look at where the complexity lies in generating an image like this:

Mandelbrot Surface

During the Disco problem we’ll visualise the results using Andreas’ word-cloud tool, we may also cover the use of map/reduce for social network exploration:

Word-cloud of Apple mentions

Install requirements will be announced closer to the tutorial along with the (recommended!) VirtualBox image. I’m probably providing more material than we can cover for my two sections (Mandelbrot, Disco – how far we get depends on the size and capabilities of the class), all the material will be provided for keen students to continue and we’ll run an after-class session for those with more questions.

 


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Python

23 January 2013 - 0:09Layers of “data science”?

The field of “data science” covers a lot of areas, it feels like there’s a continuum of layers that can be considered and lumping them all as “data science” is perhaps less helpful than it could be. Maybe by sharing my list you can help me with further insight. In terms of unlocking value in the underlying data I see the least to most valuable being:

  • Storing data
  • Making it searchable/accessible
  • Augmenting it to fashion new data and insights
  • Understanding what drives the trends in the data
  • Predicting the future

Storing a “large” amount of data has always been feasible (data warehouses of the 90s don’t sound all that different to our current Big Data processing needs). If you’re dealing with daily Terabyte dumps from telecomms, astro arrays or LHCs then storing it might not be economical but it feels that more companies can easily store more data this decade than in previous decades.

Making the data instantly accessible is harder, this used to be the domain of commercial software and now we have the likes of postgres, mongodb and solr which scale rather well (though there will always be room for higher-spec solutions that deal with things like fsync down to the platter level reliably regardless of power supply and modeling less usual data structures like graphs efficiently). Since CPUs are cheap building a cluster of commodity high-spec machines is no longer a heavy task.

Augmenting our data can makes it more valuable. By example – applying sentiment analysis to a public tweet stream and adding private demographic information gives YouGov’s SoMA (disclosure – I’m working on this via AdaptiveLab) an edge in the brand-analysis game. Once you start joining datasets you have to start dealing with the thorny problems – how do we deal with missing data? If the tools only work with some languages (e.g. English), how do we deal with other languages (e.g. the variants of Spanish) to offer a similarly good product? How do we accurately disambiguate a mention of “apple” between a fruit and a company?

Modeling textual data is somewhat mainstream (witness the availability of Sentiment, NER and categorisation tools). Doing the same for photographs (e.g. Instagram photos) is in the quite-hard domain (have you ever seen a food-identifier classifier for photos that actually works?). We rarely see any augmentations for video. For audio we have song identification and speech recognition, I don’t recall coming across dog-bark/aeroplane/giggling classifiers (which you might find in YouTube videos). Graph network analysis tools are at an interesting stage, we’re only just witnessing them scale to large data amounts of data on commodity PCs and tieing this data to social networks or geographic networks still feels like the domain of commercial tools.

Understanding the trends and communicating them – combining different views on the data to understand what’s really occurring is hard, it still seems to involve a fair bit of art and experience. Visualisations seems to take us a long way to intuitively understanding what’s happening. I’ve started to play with a few for tweets, social graphs and email (unpublished as yet). Visualising many dimensions in 2 or 3D plots is rather tricky, doubly so when your data set contains >millions of points.

Predicting the future – in ecommerce this would be the pinacle – understanding the underlying trends well enough to be able to predict future outcomes from hypothesised actions. Here we need mathematical models that are strong enough to stand up to some rigorous testing (financial prediction is obviously an example, another would be inventory planning). This requires serious model building and thought and is solidly the realm of the statistician.

Currently we just talk about “data science” and often we should be specifying more clearing which sub-domain we’re involved with. Personally I sit somewhere in the middle of this stack, with a goal to move towards the statistical end. I’m not sure one how to define the names for these layers, I’d welcome insight.

This is probably too simple a way of thinking about the field – if you have thoughts I’d be most happy to receive them.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life

13 January 2013 - 20:10Map/Reduce (Disco) on millions of tweets

Whilst working on data sciencey problems for AdaptiveLab I’m becoming more involved in simple visualisations for proof-of-concepts for clients. This ties in nicely with my PyCon Parallel Computing tutorial with Minesh. I’ve been prototyping a Disco map/reduce tutorial (part 2 for PyCon) using tweets collected during the life of SocialTies during 2011-2012.

Using 11,645,331 tweets on 1 machine running through Disco with a modified word_count example it is easy to filter to keep tweets with a certain word (“loving” in this case) and to plot a word cloud (thanks Andreas!) of the remaining tweets:

Words in “loving” tweets

Tweet analysis often shows a self-referential nature – here we see “i’m” as one of the most popular words. It is nice to see “:)” making an appearance. Brands mentioned include “Google”, “iPhone”, “iPad”. We also see “thanks”, “love”, “nice” and “watching” along with “London” and “music”. Annoyingly I’m not cleaning the words so we see “it!”, “it.”, “(via” (with erroneous brackets) and the like which clutter the results a bit.

Next I’ve applied “hating” as the filter to the same set:

Words in “hating” tweets

One of the most mentioned words is “people” which is a bit of a shame, along with “i’m”. Thankfully we see some “love” and “loving” there. “apple” appears more frequently than “twitter” or “google”. Lots of related negative words also appear e.g. “stupid”, “hate”, “shit”, “fuck”, “bitch”.

Interestingly few of the terms shown include Twitter users or hashtags.

Finally I tried the same using “apple” on an earlier smaller set (859,157 tweets):

Words in “apple” tweets

Unsurprisingly we see “store”, “iphone”, “ipad”  “steve”. Hashtags include “#wwdc”, “#apple” and “#ipad”. The Twitter accounts shown are errors due to string-matching on “apple” except for @techcrunch.

I find it interesting to see competitor brands being mentioned in the same tweets (e.g. “google”, “microsoft”, “android”, “samsung”, “amazon”, “nokia”), although the firms are obviously related to “apple”.

An improvement would be to remove words from the chart that match the original pattern (hence removing words like “apple” and “#apple” but keeping everything else). Removing near-duplicate terms (e.g. “apple”, “apples”, “apple’”) and performing common string clean-ups (removing punctuation) which also help.

It would also be good to change the colour channels – perhaps using red for commonly-negative words and green for commonly-positive words, with the rest in a neutral colour. Maybe we could also colour the neutral words differently if they’re commonly associated with the key word (e.g. brands of the key word).

Getting started with Disco was easy enough. The installation takes a few hours (the Disco project instructions assume a certain familiarity with networked systems), after that editing the examples is straightforward. Visualising using Andreas’ code was very straight-forward. The source will be posted around the time of my PyCon tutorial in March.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: ArtificialIntelligence, Data science, Python

13 December 2012 - 0:23Office social graph connectivity using NetworkX

I wanted an excuse to play with the Python NetworkX graph visualisation library and recently I joined AdaptiveLab to consult on some data science & visualisation problems. Thus formed the question – how were we all connected together? I figured that looking at who follows us all will yield a little insight into the people we have in common. I’m particularly interested in this question seeing as I was living in Brighton, then lived in Chile for most of the year and have only recently moved to London – my social graph is likely to be disjointed to the graph of the existing London-based team.

Below I show the follower graph with my new colleagues at the top (James, Kat, Ben, Mark, Steve), Emily, Jon and myself in the middle and my collaborator Balthazar at the bottom:

sample_full_network_thumb

I chose to visualise followers rather than who-we-follow as I cared about the graph of who-pays-(some)-attention-to-us. I figure this is a good surrogate for people who might actually know us, suggesting a good chance that we have friends and colleagues in common.

Balthazar worked in France with me in StrongSteam (whilst I was in Chile), he’s followed by almost nobody from my usual network. Emily and I are a couple, we’re followed by a lot of the same people. Our friend Jon lives in Brighton and runs the central co-working environment (where we were for 10 years), he is followed by many of the people who follow us. The top of the graph shows that my colleagues are followed by only a few people who follow others in the company (so we all have different social networks), with the exception of boss-James who shares a set of followers with my Jon and myself (I guess because we’re all outspoken in the UK tech scene).

In the above graph I deliberately reduced the number of nodes drawn if they were only connected to one person in the network. Seeing as a few of us have over a thousand followers the graph got  too busy too quickly. Below is a subsampled version of the early network with no limit on the number of one-edge-only nodes:

sample_network_thumb

The subsampled network looks nicely organic, like living cells.

The code is on github as twitter-social-graph-networkx, it includes some patches that have just been added back to the python-twitter module to enable whole-graph downloading. You can use this code to download the follower graph for your own network, then plot it using NetworkX (it is configured to use GraphViz as the plots are faster, you can use pure NetworkX if you don’t have GraphViz). The git project has pickles of my social network so if you satisfy the dependencies, you should be good to plot straight away.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Life, Python