Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

16 April 2014 - 21:112nd Early Release of High Performance Python (we added a chapter)

Here’s a quick book update – we just released a second Early Release of High Performance Python which adds a chapter on lists, tuples, dictionaries and sets. This is available to anyone who has bought it already (login into O’Reilly to get the update). Shortly we’ll follow with chapters on Matrices and the Multiprocessing module.

One bit of feedback we’ve had is that the images needed to be clearer for small-screen devices – we’ve increased the font sizes and removed the grey backgrounds, the updates will follow soon. If you’re curious about how much paper is involved in writing a book, here’s a clue:

We announce each updates along with requests for feedback via our mailing list.

I’m also planning on running some private training in London later in the year, please contact me if this is interesting? Both High Performance and Data Science are possible.

In related news – the PyDataLondon conference videos have just been released and you can see me talking on the High Performance Python landscape here.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: High Performance Python Book, Life, pydata, Python

1 November 2013 - 12:10“Introducing Python for Data Science” talk at SkillsMatter

On Wednesday Bart and I spoke at SkillsMatter to 75 Pythonistas with an Introduction to Data Science using Python. A video of the 4 talks is now online. We covered:

Since the group is more of a general programming community we wanted to talk at a high level on the various ways that Python can be used for data science, it was lovely to have such a large turn-out and the following pub conversation was much fun.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

16 Comments | Tags: Data science, Life, Python

8 October 2013 - 10:23What confusion leads from self driving vehicles and their talking to each other?

This is a light follow-up from my “Do self driving cars make the courier redundant?”  post from January. I’m wondering which first- and second-order effects occur from self-driving cars talking to each other.

Let’s assume they can self-drive and self-park and that they have some ability to communicate with each other. Noting their speed and intent should help self-driving cars make better utilisation of the road (they could drive closer together), they could quickly signal if they have a failure (e.g. “My brake readings have just become odd – everyone pull back! I’m slowing using the secondary brake system”), they can signal that e.g. they intend to reverse park and that other cars should slow further back along the road to avoid having to halt. It is hard to see how a sensibly designed system of self-driving cars could be worse than a similar sized pack of normal humans (who might be tired, overconfident, in a rush etc) behind the wheel.

Would cars deliberately lie? There are many running jokes about drivers (often “elsewhere” in the world) where some may signal one way and then exploit nearby gaps regardless of their signalled intention. Might cars do the same? By design or by poor coding? I’d guess people might mod their driving computer to help them get somewhere faster – maybe they’d ask it to be less cautious in its manoeuvres  (taking turns quicker, giving less distance between other vehicles) or hypermile more closely than a human would. Manufacturers would fight back as these sorts of modifications would increase their liabilities and accidents would damage their brand.

What about poorly implemented protocols? On the Internet with TCP/IP we suffer from bufferbloat – many intermediate devices between packet destinations have varying sized buffers, they all try to cache to manage traffic but we end up with lower throughput and odd jams that are rather unpredictable and contrary to the design goal. Cars could have poor implementations of communication protocols (just as some smartphones and laptop brands have trouble with certain WiFi routers), so they’d fail to talk or maybe talk with errors.

Maybe cars would not communicate directly but would implement some boids-like behaviours based on local sensing (probably more robust but also less efficient due to no longer-range negotiation). Even so local odd behaviours might emerge – two cars backing off from each other, then accelerating to close the gap, then repeating – maybe a group of cars get into an unstable ‘dance’ whilst driving down the motorway. This might only be visible from the air and would look rather inhuman.

Presumably self-driving cars would have to avoid hitting humans at all costs. This might make humans less observant as they cross the road – why look if you know that a car is always anticipating (and avoiding) your arrival into the road? This presumably leaves self-driving cars at the mercy of mischievous humans – leaving out human-like dolls in the road that cause slow-and-avoid behaviours, just for kicks.

Governments are likely to introduce some kind of control overrides into the cars in the name of safety and national security (NSA/GCHQ – looking at you). This is likely to be as secure as the “unbreakable” DVD encryption, since any encryption system released into the wild is subject to various attacks. Having people steal cars or subvert their behaviours once the backdoors and overrides are noticed seems inevitable.

I wonder what sort of second order effects we’d see? I suspect that self-driving delivery vehicles would shift to more night work (when the roads are less congested and possibly petrol is dynamically priced to be cheaper), so roads could be less congested by day (and so could be filled by more humans as they commute longer distances to work?). Maybe people en-mass forget how to drive? More people will never have to drive a car, so we’d need fewer driving instructors. Maybe we’d need fewer parking spaces as cars could self-park elsewhere and return when summoned – maybe the addition of intelligence helps us use parking resources more efficiently?

If we have self-driving trucks then maybe the cost of removals and deliveries drop. No longer would I need to hire a large truck with a driver, instead the truck would drive itself (it’d still need loading/unloading of course). This would mean fewer people taking the larger-vehicle licensing exams, so fewer test centres (just as for driving schools) would be needed.

An obvious addition – if cars can self-drive then repair centres don’t need to be small and local. Whither the local street of car mechanics (inevitably of varying quality and, sadly, honesty)? I’d guess larger, out of town centralised garages more closely monitored by the manufacturers will surface (along with a fleet of pick-up trucks for broken-down vehicles). What happens to the local street of car mechanic shops? More hackspaces and assembly shops? Conversion to housing seems more likely.

If we need less parking spaces (e.g. in Hove [1927 photo!] there are huge boulevards – see Grand Avenue lanes here) then maybe we get more cycle lanes and maybe we can repurpose some of the road space for other usages – communal green patches (for kids and/or for growing stuff?).

The NYTimes has a good article on how driverles cars could reshape cities.

Charles Stross has a nice thread on geo-political consequences of self-driving cars. One comment alludes to improved social lives – if we can get to and from a party/restaurant/pub/nice social scene very easily (without e.g. hoping for the last Tube train home in London or a less pleasant bus journey), maybe our social dimension increases? The comment on flying vs driving  is interesting – you’d probably drive further rather than fly if you could sleep for much of the journey, so that hurts flight companies and increases the burden on road maintenance (but maybe preserves motorway service stations that might otherwise get less business since you’d be less in need of a break if you’re not concentrating on driving all the time!).

Hmmm…drone networks look like they might do interesting things for delivery to non-road locations, but drones have a limited range. What about coupling an HGV ‘mother truck’ with a drone fleet for the distribution of goods to remote locations, with the ‘mother truck’ containing a generator and a large storage unit of stuff-to-distribute. I’m thinking about feeding animals in winter that are stuck in fields, reaching hurricane survivors, more extreme running races (and hopefully helping to avoid deaths) or even supplying people living out of cities and in remote areas (maybe Amazon-by-drone deliveries whilst living up a mountain become feasible?).

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Life

7 October 2013 - 17:10Future Cities Hackathon (@ds_ldn) Oct 2013 on Parking Usage Inefficiencies

On Saturday six of us attended the Future Cities Hackathon organised by Carlos and DataScienceLondon (@ds_ldn). I counted about 100 people in the audience (see lots of photos, original meetup thread), from asking around there seemed to be a very diverse skill set (Python and R as expected, lots of Java/C, Excel and other tools). There were several newly-released data sets to choose from. We spoke with Len Anderson of SocITM who works with Local Government, he suggested that the parking datasets for Westminster Ward might be interesting as results with an economic outcome might actually do something useful for Government Policy. This seemed like a sensible reason to tackle the data. Other data sets included flow-of-people and ASBO/dog-mess/graffiti recordings.

Overall we won ‘honourable mention’ for proposing the idea that the data supported a method of changing parking behaviour whilst introducing the idea of a dynamic pricing model so that parking spaces might be better utilised and used to generate increased revenue for the council. I suspect that there are more opportunities for improving the efficiency of static systems as the government opens more data here in the UK.

Sidenote – previously I’ve thought about the replacement of delivery drivers with self-driving cars and other outcomes of self-driving vehicles, the efficiencies discussed here connect with those ideas.

With the parking datasets we have over 4 million lines of cashless parking-meter payments for 2012-13 in Westminster to analyse, tagged with duration (you buy a ticket at a certain time for fixed periods of time like 30 minutes, 2 hours etc) and a latitude/longitude for location. We also had a smaller dataset with parking offence tickets (with date/time and location – but only street name, not latitude/longitude) and a third set with readings from the small number of parking sensors in Westminster.

Ultimately we produced a geographic plot of over 1000 parking bays, coloured by average percentage occupancy in Westminster. The motivation was to show that some bays are well used (i.e. often have a car parked in them) whilst other areas are under-utilised and could take a higher load (darker means better utilised):

Westminster Parking Bays by Percentage Occupancy

At first we thought we’d identified a striking result. After a few more minutes hacking (around 9.30pm on the Saturday) we pulled out the variance in pricing per bay and noted that this was actually quite varied and confusing, so a visitor to the area would have a hard time figuring out which bays were likely to be both under-utilised and cheap (darker means more expensive):

Westminster parking bays by cost

If we’d have had more time we’d have checked to see which bays were likely to be under-utilised and cheap and ranked the best bays in various areas. One can imagine turning this into a smartphone app to help visitors and locals find available parking.

The video below shows the cost and availability of parking over the course of the day. Opacity (how see-through it is) represents the expense – darker means more expensive (so you want to find very-see-through areas). Size represents the number of free spaces, bigger means more free space, smaller (i.e. during the working day) shows that there are few free spaces:

Behind this model we captured the minute-by-minute stream of ticket purchases by lat/lng to model the occupancy of bays, the data also records the number of bays that can be maximally used (but the payment machines don’t know how many are in use – we had to model this). Using Pandas we modelled usage over time (+1 for each ticket purchase and -1 for each expiry), the red line shows the maximum number of bays that are available, the sections over the line suggest that people aren’t parking for their full allocation (e.g. you might buy an hour’s ticket but only stay for 20 minutes, then someone else buys a ticket and uses the same bay):


We extended the above model for one Tuesday over all the 1000+ plus parking bays in Westminster.

Additionally this analysis by shows the times and days when parking tickets are most likely to be issued. The 1am and 3am results were odd, Sunday (day 6) is clearly the quietest, weekdays at 9am are obviously the worst:



We believe that the carrot and stick approach to parking management (showing where to park – and noting that you’ll likely get fined if you don’t do it properly) should increase the correct utilisation of parking bays in Westminster which would help to reduce congestion and decrease driver-frustration, whilst increasing income for the local council.

Update – at least one parking area in New Zealand are experimenting with truly dynamic demand-based pricing.

We also believe the data could be used by Traffic Wardens to better patrol the high-risk areas to deter poor parking (e.g. double-parking) which can be a traffic hazard (e.g. by obstructing a road for larger vehicles like Fire Engines). The static dataset we used could certainly be processed for use in a smartphone app for easy use, and updated as new data sets are released.

Our code is available in this github repo: ParkingWestminster.

Here’s our presentation:


Tools used:

  • Python and IPython
  • Pandas
  • QGIS (visualisation of shapefiles backed by OpenLayers maps from Google and OSM)
  • pyshp to handle shapefiles
  • Excel (quick analysis of dates and times, quick visualisation of lat/lng co-ords)
  • HackPad (useful for lightweight note/URL sharing and code snippet collaboration)

 Some reflections for future hackathons:

  • Pre-cleaning of data would speed team productivity (we all hacked various approaches to fixing the odd Date and separate Time fields in the CSV data and I suspect many in the room all solved this same problem over the first hour or two…we should have flagged this issue early on and a couple of us solved it and written out a new 1.4GB fixed CSV file for all to share)
  • Decide early on on a goal – for us it was “work to show that a dynamic pricing model is feasible” – that lets you frame and answer early questions (quite possibly an hour in we’d have discovered that the data didn’t support our hypothesis – thankfully it did!)
  • Always visualise quickly – whilst I wrote a new shapefile to represent the lat/lng data Bart just loaded it into Excel and did a scatter plot – super quick and easy (and shortly after I added the Map layer via QGIS so we could line up street names and validate we had sane data)
  • Check for outliers and odd data – we discovered lots of NaN lines (easily caught and either deleted or fixed using Pandas), these if output and visualised were interpreted by QGIS as an extreme but legal value and so early on we had some odd visuals, until we eyeballed the generated CSV files. Always watch for NaNs!
  • It makes sense to print a list of extreme and normal values for a column, again as a sanity check – histograms are useful, also sets of unique values if you have categories
  • Question whether the result we see actually would match reality – having spent hours on a problem it is nice to think you’ve visualised something new and novel but probably the data you’re drawing is already integrated (e.g. in our case at least some drivers in Westminster would know where the cheap/under-utilised parking spaces would be – so there shouldn’t be too many)
  • Setup a github repo early and make sure all the team can contribute (some of our team weren’t experienced with github so we deferred this step and ended up emailing code…that was a poor use of time!)
  • Go visit the other teams – we hacked so intently we forgot to talk to anyone else…I’m sure we’d have learned and skill-shared had we actually stepped away from our keyboards!

Update – Stephan Hügel has a nice article on various Python tools for making maps of London wards, his notes are far more in-depth than the approach we took here.

Update – nice picture of London house prices by postcode, this isn’t strictly related to the above but it is close enough. Visualising the workings of the city feels rather powerful. I wonder how the house prices track availability of public transport and local amenities?

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Life, Python

17 September 2013 - 23:00Writing a High Performance Python book

I’m terribly excited to announce that I’m co-authoring an O’Reilly book on High Performance Python, to be published next year. My co-author is the talented Micha Gorelick (github @mynameisfiber) of bit.ly, he’s already written a few chapters, I’ll be merging an updated version of my older eBook and adding content based on past tutorials (PyCon 2013, PyCon 2012, EuroSciPy 2012, EuroPython 2011), along with a big pile of new content from us both.

I setup a mailing list a year back with a plan to write such a book, I’ll be sending list members a survey tomorrow to validate the topics we plan to cover (and to spot the things we missed!). Please join the list (no spam, just Python HPC stuff occasionally) to participate. We’ll be sending out subsequent surveys and requests for feedback as we go.

Our snake is a Fer-de-Lance (which even has its own unofficial flag) and which also happens to be a ship from the classic spacefaring game Elite.

We plan to develop the book in a collaborative way based on some lessons I learned last time.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

37 Comments | Tags: Life, Python

24 August 2013 - 9:35EuroSciPy 2013 write-up

The conference is over, tomorrow I’m sticking around to Sprint on scikit-learn. As last year it has been a lot of fun to catch up with colleagues out here in Brussels. Here’s Logilab’s write-up.

Yesterday I spoke on Building an Open Source Data Science company. Topics included how companies benefit from open sourcing their tools, how individuals benefit by contributing to open source and how to build a consultancy or products.



This led to good questions over lunch. It seems that many people are evaluating whether their future is as predictable as it once was, especially for some in academia.

One question that repeatedly surfaced was “I’m an academic scientist – how do I figure out if my skills are needed in industry?”. My suggestion was simply to phone some nearby recruiters and have an introductory chat (probably via a Google search). Stay in control of the conversation, start from the position that you’re considering a move into industry (so you commit to nothing), build a relationship with the recruiter if you like them via several phone calls (and weed out the idiots – stay in control and just politely get rid of the ones who waste your time).



Almost certainly your science skills will translate into industrial problems (so you won’t have to retrain as a web programmer – a fear expressed by several). One recruitment group I’ve been talking with are the Hydrogen Group, they have contracts for data science throughout Europe. Contact Nick there and mention my name. If you’re in London then talk to Thayer of TeamPrime or look at TechCityJobs and filter by sensible searches.

Another approach is to use a local jobs board (e.g. in London there is TechCityJobs) which lists a healthy set of data science jobs. You can also augment your LinkedIn profile (here’s mine) with the term “data science” as it seems to be the term recruiters know to use to find you. Write a bullet-point list of your skills for data and the tools you use (e.g. Python, R, SPSS, gnuplot, mongodb, Amazon EC2 etc) to held with keyword searches and see who comes to find you (it’ll take months to get a good feel of what people are searching for to find you). In LinkedIn add any talks, open source projects etc that you contribute to as these are easy for someone to check to verify your skill level.

(Sidenote – I’m in the Sprint publishing this, I’ve just had a very interesting chat with a nascent company about how much they want to open source and the benefits and trade offs of doing so in their optics industry. Knowing why you attract user-attention, what you might give away to competitors, how much time you might lose in supporting non-commercial users whilst demonstrating your competence through open source is critical to making a reasoned decision. Related to this chat – posts on switching from [L]GPL to BSD 1, 2)

Next on the Friday I was invited to join a panel discussion asking “How do we make more programmers?”. It was nice to discuss some lessons learned teaching millions of beginners through ShowMeDo and by teaching at the intermediate/expert level at Python conferences. Thoughts covered the uses of the IPython Notebook, the depth of tuition to fit the needs of a group and the wealth of teaching material that’s freely available (e.g. pyvideo.org and the pytutor list).


This morning Peter Wang gave the keynote looking at a future for data analysis with Python. The Continuum tool chain is looking very nice, Bokeh and Blaze look to be worth testing now. I’m still curious about the limitations of Numba, I suspect that common use cases are still a way from being covered.

During the conference I got to learn about cartopy (a bit of a pain to setup but they promise that process will improve) which is a very compelling replacement for basemap, vispy is a cool looking OpenGL based visualiser for large datasets and I learned how to install the IPython Notebook in one go using ‘pip install ipython[notebook]‘.

Overall I’ve had fun again and am very grateful to be part of such a smart and welcoming community.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

7 Comments | Tags: Life, Python

9 July 2013 - 13:55Some Natural Language Processing and ML Papers

After I spoke at DataScienceLondon in June I was given a set of paper references by a couple of people (the bulk were by Levente Török) – thanks to all. They’re listed below. Along the same lines I have one machine learning paper aimed at beginners to recommend (“A Few Useful Things to Know about Machine Learning” – Pedro Domingos), it gives a set of real-world examples to work off, useful for someone short on experience who wants to learn whilst avoiding some of the worse mistakes.

Selection of references in no particular order:

Deep Learning for Efficient Discriminative Parsing, Ronan Collobert

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Ronan Collobert

Latent Dirichlet Allocation (old article)
Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Rethinking LDA: Why priors matter (How to tune the hyper parameters which shouldn’t matter.)
Dynamic Topic Models and the Document Influence Model (in which they deal with the change of the hidden topics ( HMM))

Semi supervised topic model notes:

Semi-supervised Extraction of Entity Aspects using Topic Models

Hierarchically Supervised Latent Dirichlet Allocation

Melting the huge difference between the topic models and the bag of words approach:

Beyond Bag of words (presentation)

A note on Topical N-grams

PCFGs, Topic Models

Integrating Topics with Syntax

Syntactic Topic Models

Collective Latent Dirichlet Allocation (might be useful for Tweet collections)

R packages (from Levente):

topicmodels for R

lda for R

R Text Tools package (noted as most advanced package, website offline when I visited it)

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: Life, SocialMediaBrandDisambiguator

20 June 2013 - 19:15Visualising the internals of Logistic Regression on a Text Matrix

Below I have some plots that visualise the term matrix (as a binary matrix and as a TF-IDF matrix) for the brand disambiguation project followed by a visualisation of the coefficients used in scikit-learn’s LogisticRegression classifier using l1 and l2 penalties.

Using a CountVectorizer with binary=True we can mark the absence or presence of a token in a tweet. This is generated using learn1.py with the –termmatrix argument. If you open the full version of the image you’ll see that Class 0 is the bottom half (below the red line) of the rows, Class 1 is the top half (with 1168 rows in total, equally split between the classes). The x-axis shows 1238 features (formed of all unigrams and bigrams by the default tokenizer with a minimum document frequency of 2). The strong white line on the left is for the token ‘apple’ which is present in all tweets. If you look carefully you can see that some terms occur more frequently in only one of the two classes (as we’d hope).


The repeated rows are due to retweets – they have the same terms so we get repeated sets of the same binary features. This probably distorts the learning a bit and these will be removed in a later experiment.

Next we do the same operation but use TF-IDF (wikipedia) to scale the values (so they’re in the range 0-1.0), higher values mean that the tokens are rarer and so should have more importance. Often you’d use TF-IDF to normalise for document length (longer documents have more words and so in a binary matrix you’d have more 1s represented). Tweets are of roughly the same length so I don’t think this is so useful, I don’t know what the effect will be on precision and recall in later testing. As you’d expect the most common terms (e.g. ‘apple’) now have a low value, a few rare words now have a high (bright) value.


An obvious question given the above plots is whether we can easily remove a number of the tokens due to them contributing little towards a classification, either because they’re mentioned equally for both classes (e.g. common English words would be mentioned roughly equally and would have no bearing on the classification problem) or because they occur so rarely that we don’t know if they truly represent a feature that should identify a class. This will be investigated soon.

Next we create a LogisticRegression classifier with learn1_coefficients.py and train it first with the default l2 penalty and then with an l1 penalty. We can see that the l1 penalty sets many of the coefficients to 0. In both plots we’re looking at the coefficients for each of 5 cross-fold models (the dark lines mean more models agree on the importance of the feature, each model is plotted with an alpha blend). For the l1 penalty I’ve annotated the 10 biggest coefficients for the positive and negative coefficients. This chart is plotted using a Binary CountVectorizer (from the first of the two examples above) – if I switch to a TF-IDF Vectorizer then I get a very similar visual output.


As you might expect for the apple-the-brand coefficients we see “cook, company, google, ipad, iphone, macbook, market, samsung, store, vatican” (I’ll explain “vatican” in a moment). For not-apple-the-brand we see “candy, caramel, cinnamon, eat, eye, girl, juice, orange, pie, tree“.

The inclusion of “vatican” happens because these tweets occur with the announcement of the latest Pope – various wags tweeted about topics like the “iPope” and so “vatican” is discussed alongside apple-the-brand. This also highlights the over-fitting that has inevitably occurred due to the current small sample of tweets for this experiment.

Clearly we could use the l1 penalty to perform feature selection, we can also use other methods. This is to follow.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: Life, Python, SocialMediaBrandDisambiguator

17 June 2013 - 20:13Demonstrating the first Brand Disambiguator (a hacky, crappy classifier that does something useful)

Last week I had the pleasure of talking at both BrightonPython and DataScienceLondon to about 150 people in total (Robin East wrote-up the DataScience night). The updated code is in github.

The goal is to disambiguate the word-sense of a token (e.g. “Apple”) in a tweet as being either the-brand-I-care-about (in this case – Apple Inc.) or anything-else (e.g. apple sauce, Shabby Apple clothing, apple juice etc). This is related to named entity recognition, I’m exploring simple techniques for disambiguation. In both talks people asked if this could classify an arbitrary tweet as being “about Apple Inc or not” and whilst this is possible, for this project I’m restricting myself to the (achievable, I think) goal of robust disambiguation within the 1 month timeline I’ve set myself.

Below are the slides from the longer of the two talks at BrightonPython:

As noted in the slides for week 1 of the project I built a trivial LogisticRegression classifier using the default CountVectorizer, applied a threshold and tested the resulting model on a held-out validation set. Now I have a few more weeks to build on the project before returning to consulting work.

Currently I use a JSON file of tweets filtered on the term ‘apple’, obtained using the free streaming API from Twitter using cURL. I then annotate the tweets as being in-class (apple-the-brand) or out-of-class (any other use of the term “apple”). I used the Chromium Language Detector to filter non-English tweets and also discard English tweets that I can’t disambiguate for this data set. In total I annotated 2014 tweets. This set contains many duplicates (e.g. retweets) which I’ll probably thin out later, possibly they over-represent the real frequency of important tokens.

Next I built a validation set using 100 in- and 100 out-of-class tweets at random and created a separate test/train set with 584 tweets of each class (a balanced set from the two classes but ignoring the issue of duplicates due to retweets inside each class).

To convert the tweets into a dense matrix for learning I used the CountVectorizer with all the defaults (simple tokenizer [which is not great for tweets], minimum document frequency=1, unigrams only).

Using the simplest possible approach that could work – I trained a LogisticRegression classifier with all its defaults on the dense matrix of 1168 inputs. I then apply this classifier to the held-out validation set using a confidence threshold (>92% for in-class, anything less is assumed to be out-of-class). It classifies 51 of the 100 in-class examples as in-class and makes no errors (100% precision, 51% recall). This threshold was chosen arbitrarily on the validation set rather than deriving it from the test/train set (poor hackery on my part), but it satisfied me that this basic approach was learning something useful from this first data set.

The strong (but not generalised at all!) result for the very basic LogisticRegression classifier will be due to token artefacts in the time period I chose (March 13th 2013 around 7pm for the 2014 tweets). Extracting the top features from LogisticRegression shows that it is identifying terms like “Tim”, “Cook”, “CEO” as significant features (along with other features that you’d expect to see like “iphone” and “sauce” and “juice”) – this is due to their prevalence in this small dataset (in this set examples like this are very frequent). Once a larger dataset is used this advantage will disappear.

I’ve added some TODO items to the README, maybe someone wants to tinker with the code? Building an interface to the open source DBPediaSpotlight (based on WikiPedia data using e.g. this python wrapper) would be a great start for validating progress, along with building some naive classifiers (a capital-letter-detecting one and a more complex heuristic-based one, to use as controls against the machine learning approach).

Looking at the data 6% of the out-of-class examples are retweets and 20% of the in-class examples are retweets. I suspect that the repeated strings are distorting each class so I think they need to be thinned out so we just have one unique example of each tweet.

Counting the number of capital letters in-class and out-of-class might be useful, in this set a count of <5 capital letters per tweet suggests an out-of-class example:

This histogram of tweet lengths for in-class and out-of-class tweets might also suggest that shorter tweets are more likely to be out-of-class (though the evidence is much weaker):


Next I need to:

  • Update the docs so that a contributor can play with the code, this includes exporting a list of tweet-ids and class annotations so the data can be archived and recreated
  • Spend some time looking at the most-important features (I want to properly understand the numbers so I know what is happening), I’ll probably also use a Decision Tree (and maybe RandomForests) to see what they identify (since they’re much easier to debug)
  • Improve the tokenizer so that it respects some of the structure of tweets (preserving #hashtags and @users would be a start, along with URLs)
  • Build a bigger data set that doesn’t exhibit the easily-fitted unigrams that appear in the current set

Longer term I’ve got a set of Homeland tweets (to disambiguate the TV show vs references to the US Department and various sayings related to the term) which I’d like to play with – I figure making some progress here opens the door to analysing media commentary in tweets.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life, Python, SocialMediaBrandDisambiguator

17 June 2013 - 14:39Active Countermeasures for Privacy in a Social Networking age?

This is a bit of a rambling post covering some thoughts on data privacy, mobile phones and social networking.

A general and continued decrease in personal privacy  seems inevitable in our age of data (NSA Files at The Guardian). We generate a lot of data, we rarely know how or where it is stored and we don’t understand how easy it is to make certain inferences based on aggregated forms of our data. Cory Doctorow has some points on why we should care about this topic.

Will we now see the introduction of active countermeasures in a data stream by way of protest or camouflage by regular folk?

Update – hat tip to Kyran for prism-break.org, listing open-source alternatives to Operating Systems and communication clients/systems. I had a play earlier today with the Tor-powered Orweb on Android – it Just Worked and whatsmyip.org didn’t know where my device was coming from (running traceroute went from whatsmyip to the Tor entry node and [of course] no further). It seems that installing Tor on a raspberrypi or Tor on EC2 is pretty easy too (Tor runs faster when more people start Tor relays [which carry the internal encrypted traffic, so there's none of the fear of running an edge nodes that sends the traffic onto the unencrypted Internet]). Here are some Tor network statistic graphs.

I’ve long been unhappy with the fact that my email is known to be transmitted and stored in the clear (accepting that I turn on HTTPS-only in Gmail). I’d really like for it to be readable only for the recipient, not for anyone (sysadmin or Government agency) along the chain. Maybe someone can tell me if adding PGP into Gmail via the browser and Android phone is an easy thing to do?

I’m curious to see how long it’ll be before we have a cypherpunk mobile OS, preconfigured with sensible defaults. CyanogenMod is an open build of Android (so you could double-check for Government backdoors [if you took the time]), there’s no good reason why a distro couldn’t be setup that uses Tor, HTTPSEverywhere (eff.org post on this combo, this Tor blog post comments on Tor vs PRISM) and Incognito Mode by default as a start for private web usage. Add on a secure and open source VoIP client (not Skype) and an IM tool and you’re most of the way there for better-than-normal-folk privacy.

Compared to an iOS device it’ll be a bit clunky (so maybe my mum won’t use it) but I’d like the option, even if I have to jump through a few hoops. You might also choose not to trust your handset provider, we’re just starting to see designs for build-it-yourself cellphones (albeit very basic non-data phones at present).

Maybe we’ll start to consider the dangers of entrusting our data to near-monopolies in the hope that they do no evil (and aren’t subject to US Government secret & uninvestigable
disclosures to people who we personally may or may not trust, and may or may not be decent, upright, solid, incorruptible citizens). Perhaps far-sighted governments in other countries will start to educate their citizens about the dangers of trusting US Data BigCorps (“Loose Lips Sink Ships“)?

So what about active countermeasures? For the social networking example above we’d look at communications traffic (‘friends’ are cheap to acquire but communication takes effort). What if we started to lie about who we talk to? What if my email client builds a commonly-communicated-with list and picks someone from outside of that list, then starts to send them reasonably sensible-looking emails automatically? Perhaps it contains a pre-agreed codeword, then their client responds at a sensible frequency with more made-up but intelligible text. Suddenly they appear to be someone I closely communicate with, but that’s a lie.

My email client knows this so I’m not bothered by it but an eavesdropper has to process this text. It might not pass human inspection but it ought to tie up more resources, forcing more humans to get involved, driving up the cost and slowing down response times. Maybe our email clients then seed these emails with provocative keywords in innocuous phrases (“I’m going to get the bomb now! The bomb is of course the name for my football”) which tie up simple keyword scanners.

The above will be a little like the war on fake website signups for spam being defeated by CAPTCHAs (and in turn defeating the CAPTCHAs), driving perhaps improvements in NLP technologies. I seem to recall that Hari Seldon in Asimov’s Foundation novels used auto-generated plausible speech generators to mask private in-person communications from external eavesdropping (I can’t find a reference – am I making this up?), this stuff doesn’t feel like science fiction any more.

Maybe with FourSquare people will practice fake check-ins. Maybe during a protest you comfortably sit at home and take part in remote virtual check-ins to spots that’ll upset the police (“quick! join the mass check-in in the underground coffee shop! the police will have to spend resources visiting it to see if we’re actually there!”). Maybe you’ll physically be in the protest but will send spoofed GPS co-ords with your check-ins pretending to be elsewhere.

Maybe people start to record and replay another person’s check-ins, a form of ‘identify theft’ where they copy the behaviour of another to mask their own movements?

Maybe we can extend this idea to photo sharing. Some level of face detection and recognition already exists and it is pretty good, especially if you bound the face recognition problem to a known social group. What if we use a graphical smart-paste to blend a person-of-interest’s face into some of our group photos? Maybe Julian Assange appears in background shots around London or a member of Barack Obama’s Government in photos from Iranian photobloggers?

The photos could be small and perhaps reasonably well disguised so they’re not obvious to humans, but obvious enough to good face detection & recognition algorithms. Again this ties up resources (and computer vision algorithms are terribly CPU-expensive). It would no doubt upset the intelligence services if it impacted their automated analysis, maybe this becomes a form of citizen protest?

Hidden Mickeys appear in lots of places (did you spot the one in Tron?), yet we don’t notice them. I’m pretty sure a smart paste could hide a small or distorted or rotated or blended image of a face in some photos, without too much degradation.

Figuring out who is doing what given the absence of information is another interesting area. With SocialTies (built by Emily and I) I could track who was at a conference via their Lanyrd sign-up, and also track people via nearby FourSquare check-ins and geo-tagged tweets (there are plenty of geo-tagged tweets in London…). Inferring where you were was quite possible, even if you only tweeted (and had geo-locations enabled). Double checking your social group and seeing that friends are marked as attending the event that you are near only strengthens the assertion that you’re also present.

Facebook typically knows the address book of your friends, so even if you haven’t joined the service it’ll still have your email. If 5 members of Facebook have your email address then that’s 5 directed edges in a social network graph pointing at a not-yet-active profile with your name on it. You might never join Facebook but they still have your email, name and some of your social connections. You can’t make those edges disappear. You just leaked your social connectivity without ever going near the service.

Anyhow, enough with the prognostications. Privacy is dead. C’est la vie. As long as we trust the good guys to only be good, nothing bad can happen.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: Life