Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

1 November 2013 - 12:10“Introducing Python for Data Science” talk at SkillsMatter

On Wednesday Bart and I spoke at SkillsMatter to 75 Pythonistas with an Introduction to Data Science using Python. A video of the 4 talks is now online. We covered:

Since the group is more of a general programming community we wanted to talk at a high level on the various ways that Python can be used for data science, it was lovely to have such a large turn-out and the following pub conversation was much fun.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

16 Comments | Tags: Data science, Life, Python

8 October 2013 - 10:23What confusion leads from self driving vehicles and their talking to each other?

This is a light follow-up from my “Do self driving cars make the courier redundant?”  post from January. I’m wondering which first- and second-order effects occur from self-driving cars talking to each other.

Let’s assume they can self-drive and self-park and that they have some ability to communicate with each other. Noting their speed and intent should help self-driving cars make better utilisation of the road (they could drive closer together), they could quickly signal if they have a failure (e.g. “My brake readings have just become odd – everyone pull back! I’m slowing using the secondary brake system”), they can signal that e.g. they intend to reverse park and that other cars should slow further back along the road to avoid having to halt. It is hard to see how a sensibly designed system of self-driving cars could be worse than a similar sized pack of normal humans (who might be tired, overconfident, in a rush etc) behind the wheel.

Would cars deliberately lie? There are many running jokes about drivers (often “elsewhere” in the world) where some may signal one way and then exploit nearby gaps regardless of their signalled intention. Might cars do the same? By design or by poor coding? I’d guess people might mod their driving computer to help them get somewhere faster – maybe they’d ask it to be less cautious in its manoeuvres  (taking turns quicker, giving less distance between other vehicles) or hypermile more closely than a human would. Manufacturers would fight back as these sorts of modifications would increase their liabilities and accidents would damage their brand.

What about poorly implemented protocols? On the Internet with TCP/IP we suffer from bufferbloat – many intermediate devices between packet destinations have varying sized buffers, they all try to cache to manage traffic but we end up with lower throughput and odd jams that are rather unpredictable and contrary to the design goal. Cars could have poor implementations of communication protocols (just as some smartphones and laptop brands have trouble with certain WiFi routers), so they’d fail to talk or maybe talk with errors.

Maybe cars would not communicate directly but would implement some boids-like behaviours based on local sensing (probably more robust but also less efficient due to no longer-range negotiation). Even so local odd behaviours might emerge – two cars backing off from each other, then accelerating to close the gap, then repeating – maybe a group of cars get into an unstable ‘dance’ whilst driving down the motorway. This might only be visible from the air and would look rather inhuman.

Presumably self-driving cars would have to avoid hitting humans at all costs. This might make humans less observant as they cross the road – why look if you know that a car is always anticipating (and avoiding) your arrival into the road? This presumably leaves self-driving cars at the mercy of mischievous humans – leaving out human-like dolls in the road that cause slow-and-avoid behaviours, just for kicks.

Governments are likely to introduce some kind of control overrides into the cars in the name of safety and national security (NSA/GCHQ – looking at you). This is likely to be as secure as the “unbreakable” DVD encryption, since any encryption system released into the wild is subject to various attacks. Having people steal cars or subvert their behaviours once the backdoors and overrides are noticed seems inevitable.

I wonder what sort of second order effects we’d see? I suspect that self-driving delivery vehicles would shift to more night work (when the roads are less congested and possibly petrol is dynamically priced to be cheaper), so roads could be less congested by day (and so could be filled by more humans as they commute longer distances to work?). Maybe people en-mass forget how to drive? More people will never have to drive a car, so we’d need fewer driving instructors. Maybe we’d need fewer parking spaces as cars could self-park elsewhere and return when summoned – maybe the addition of intelligence helps us use parking resources more efficiently?

If we have self-driving trucks then maybe the cost of removals and deliveries drop. No longer would I need to hire a large truck with a driver, instead the truck would drive itself (it’d still need loading/unloading of course). This would mean fewer people taking the larger-vehicle licensing exams, so fewer test centres (just as for driving schools) would be needed.

An obvious addition – if cars can self-drive then repair centres don’t need to be small and local. Whither the local street of car mechanics (inevitably of varying quality and, sadly, honesty)? I’d guess larger, out of town centralised garages more closely monitored by the manufacturers will surface (along with a fleet of pick-up trucks for broken-down vehicles). What happens to the local street of car mechanic shops? More hackspaces and assembly shops? Conversion to housing seems more likely.

If we need less parking spaces (e.g. in Hove [1927 photo!] there are huge boulevards – see Grand Avenue lanes here) then maybe we get more cycle lanes and maybe we can repurpose some of the road space for other usages – communal green patches (for kids and/or for growing stuff?).

The NYTimes has a good article on how driverles cars could reshape cities.

Charles Stross has a nice thread on geo-political consequences of self-driving cars. One comment alludes to improved social lives – if we can get to and from a party/restaurant/pub/nice social scene very easily (without e.g. hoping for the last Tube train home in London or a less pleasant bus journey), maybe our social dimension increases? The comment on flying vs driving  is interesting – you’d probably drive further rather than fly if you could sleep for much of the journey, so that hurts flight companies and increases the burden on road maintenance (but maybe preserves motorway service stations that might otherwise get less business since you’d be less in need of a break if you’re not concentrating on driving all the time!).

Hmmm…drone networks look like they might do interesting things for delivery to non-road locations, but drones have a limited range. What about coupling an HGV ‘mother truck’ with a drone fleet for the distribution of goods to remote locations, with the ‘mother truck’ containing a generator and a large storage unit of stuff-to-distribute. I’m thinking about feeding animals in winter that are stuck in fields, reaching hurricane survivors, more extreme running races (and hopefully helping to avoid deaths) or even supplying people living out of cities and in remote areas (maybe Amazon-by-drone deliveries whilst living up a mountain become feasible?).

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Life

7 October 2013 - 17:10Future Cities Hackathon (@ds_ldn) Oct 2013 on Parking Usage Inefficiencies

On Saturday six of us attended the Future Cities Hackathon organised by Carlos and DataScienceLondon (@ds_ldn). I counted about 100 people in the audience (see lots of photos, original meetup thread), from asking around there seemed to be a very diverse skill set (Python and R as expected, lots of Java/C, Excel and other tools). There were several newly-released data sets to choose from. We spoke with Len Anderson of SocITM who works with Local Government, he suggested that the parking datasets for Westminster Ward might be interesting as results with an economic outcome might actually do something useful for Government Policy. This seemed like a sensible reason to tackle the data. Other data sets included flow-of-people and ASBO/dog-mess/graffiti recordings.

Overall we won ‘honourable mention’ for proposing the idea that the data supported a method of changing parking behaviour whilst introducing the idea of a dynamic pricing model so that parking spaces might be better utilised and used to generate increased revenue for the council. I suspect that there are more opportunities for improving the efficiency of static systems as the government opens more data here in the UK.

Sidenote – previously I’ve thought about the replacement of delivery drivers with self-driving cars and other outcomes of self-driving vehicles, the efficiencies discussed here connect with those ideas.

With the parking datasets we have over 4 million lines of cashless parking-meter payments for 2012-13 in Westminster to analyse, tagged with duration (you buy a ticket at a certain time for fixed periods of time like 30 minutes, 2 hours etc) and a latitude/longitude for location. We also had a smaller dataset with parking offence tickets (with date/time and location – but only street name, not latitude/longitude) and a third set with readings from the small number of parking sensors in Westminster.

Ultimately we produced a geographic plot of over 1000 parking bays, coloured by average percentage occupancy in Westminster. The motivation was to show that some bays are well used (i.e. often have a car parked in them) whilst other areas are under-utilised and could take a higher load (darker means better utilised):

Westminster Parking Bays by Percentage Occupancy

At first we thought we’d identified a striking result. After a few more minutes hacking (around 9.30pm on the Saturday) we pulled out the variance in pricing per bay and noted that this was actually quite varied and confusing, so a visitor to the area would have a hard time figuring out which bays were likely to be both under-utilised and cheap (darker means more expensive):

Westminster parking bays by cost

If we’d have had more time we’d have checked to see which bays were likely to be under-utilised and cheap and ranked the best bays in various areas. One can imagine turning this into a smartphone app to help visitors and locals find available parking.

The video below shows the cost and availability of parking over the course of the day. Opacity (how see-through it is) represents the expense – darker means more expensive (so you want to find very-see-through areas). Size represents the number of free spaces, bigger means more free space, smaller (i.e. during the working day) shows that there are few free spaces:

Behind this model we captured the minute-by-minute stream of ticket purchases by lat/lng to model the occupancy of bays, the data also records the number of bays that can be maximally used (but the payment machines don’t know how many are in use – we had to model this). Using Pandas we modelled usage over time (+1 for each ticket purchase and -1 for each expiry), the red line shows the maximum number of bays that are available, the sections over the line suggest that people aren’t parking for their full allocation (e.g. you might buy an hour’s ticket but only stay for 20 minutes, then someone else buys a ticket and uses the same bay):


We extended the above model for one Tuesday over all the 1000+ plus parking bays in Westminster.

Additionally this analysis by shows the times and days when parking tickets are most likely to be issued. The 1am and 3am results were odd, Sunday (day 6) is clearly the quietest, weekdays at 9am are obviously the worst:



We believe that the carrot and stick approach to parking management (showing where to park – and noting that you’ll likely get fined if you don’t do it properly) should increase the correct utilisation of parking bays in Westminster which would help to reduce congestion and decrease driver-frustration, whilst increasing income for the local council.

Update – at least one parking area in New Zealand are experimenting with truly dynamic demand-based pricing.

We also believe the data could be used by Traffic Wardens to better patrol the high-risk areas to deter poor parking (e.g. double-parking) which can be a traffic hazard (e.g. by obstructing a road for larger vehicles like Fire Engines). The static dataset we used could certainly be processed for use in a smartphone app for easy use, and updated as new data sets are released.

Our code is available in this github repo: ParkingWestminster.

Here’s our presentation:


Tools used:

  • Python and IPython
  • Pandas
  • QGIS (visualisation of shapefiles backed by OpenLayers maps from Google and OSM)
  • pyshp to handle shapefiles
  • Excel (quick analysis of dates and times, quick visualisation of lat/lng co-ords)
  • HackPad (useful for lightweight note/URL sharing and code snippet collaboration)

 Some reflections for future hackathons:

  • Pre-cleaning of data would speed team productivity (we all hacked various approaches to fixing the odd Date and separate Time fields in the CSV data and I suspect many in the room all solved this same problem over the first hour or two…we should have flagged this issue early on and a couple of us solved it and written out a new 1.4GB fixed CSV file for all to share)
  • Decide early on on a goal – for us it was “work to show that a dynamic pricing model is feasible” – that lets you frame and answer early questions (quite possibly an hour in we’d have discovered that the data didn’t support our hypothesis – thankfully it did!)
  • Always visualise quickly – whilst I wrote a new shapefile to represent the lat/lng data Bart just loaded it into Excel and did a scatter plot – super quick and easy (and shortly after I added the Map layer via QGIS so we could line up street names and validate we had sane data)
  • Check for outliers and odd data – we discovered lots of NaN lines (easily caught and either deleted or fixed using Pandas), these if output and visualised were interpreted by QGIS as an extreme but legal value and so early on we had some odd visuals, until we eyeballed the generated CSV files. Always watch for NaNs!
  • It makes sense to print a list of extreme and normal values for a column, again as a sanity check – histograms are useful, also sets of unique values if you have categories
  • Question whether the result we see actually would match reality – having spent hours on a problem it is nice to think you’ve visualised something new and novel but probably the data you’re drawing is already integrated (e.g. in our case at least some drivers in Westminster would know where the cheap/under-utilised parking spaces would be – so there shouldn’t be too many)
  • Setup a github repo early and make sure all the team can contribute (some of our team weren’t experienced with github so we deferred this step and ended up emailing code…that was a poor use of time!)
  • Go visit the other teams – we hacked so intently we forgot to talk to anyone else…I’m sure we’d have learned and skill-shared had we actually stepped away from our keyboards!

Update – Stephan Hügel has a nice article on various Python tools for making maps of London wards, his notes are far more in-depth than the approach we took here.

Update – nice picture of London house prices by postcode, this isn’t strictly related to the above but it is close enough. Visualising the workings of the city feels rather powerful. I wonder how the house prices track availability of public transport and local amenities?

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Life, Python

7 October 2013 - 14:44Public Python survey for “High Performance Python” book – your input much appreciated!

If you’re a Pythonista and you’re interested in reading our forthcoming High Performance Python book from O’Reilly we’d really appreciate 5-10 minutes of your time in our survey so we can discover what you want to learn about. Please mail this link to whoever you think would be interested (and ReTweet etc!).

We’ve already conducted a first survey with the people who are on our mailing list (see earlier post), if you’ve filled that survey in then there’s no need to do this additional survey. This second survey has some refinements to the first and is public (we’re interested in the variation in results from the mailing list I’ve collected in the last year and this more public survey now). You don’t need to sign-up, you just visit the site and spend 5-10 minutes ticking some boxes and writing as much (or little) as you want.

If you’d like to be notified about our progress and to help with the creation of the book please join our very-lightly-used mailing list.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

18 Comments | Tags: Python

22 September 2013 - 12:13PyConUK 2013

I’m just finishing with PyConUK, it has been a fun 3 days (and the sprints carry on tomorrow).


Yesterday I presented a lightly tweaked version of my Brand Disambiguation with scikit-learn talk on natural language processing for social media processing. I had 65 people in the room (cripes!), 2/3 had used ML or NLP for their own projects though only a handful of the participants had used either ‘in anger’ for commercial work. The slides below are slightly updated from my DataScienceLondon talk earlier in the year, there’s more on this blog over the last 2 months that I hadn’t integrated into the talk.



The project is in github if you’re interested, I’m looking for new collaborators and I can share the dataset of hand-tagged tweets.

I’d like to see more scientific talks at PyConUK, a lightning talk for later today will introduce EuroSciPy 2014 which will take place in Cambridge. I’d love to see more Pythonistas talking about scientific work, numerical computing and parallel computing (rather than quite so much web and db development). I also met David Miller who spoke on censorship (giving a call-out to the OpenRightsGroup – you too should pay them a tenner a month to support digital freedoms in the UK), but looked over a long period of censorship in the UK and the English language. As ever, there were a ton of interesting folk to meet.

David mentioned the Andrews and Arnold ISP who pledge not to censor their broadband, apparently the only ISP in the UK to put up a strong pledge. This is interesting.

Shortly in London I’ll organise (or co-opt) some sort of Natural Language Processing meetup, I’m keen to meet others (Pythonistas, R, Matlab, whoever) who are involved in the field. I’ll announce it here when I’ve figured something out.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: ArtificialIntelligence, Python, SocialMediaBrandDisambiguator

17 September 2013 - 23:00Writing a High Performance Python book

I’m terribly excited to announce that I’m co-authoring an O’Reilly book on High Performance Python, to be published next year. My co-author is the talented Micha Gorelick (github @mynameisfiber) of bit.ly, he’s already written a few chapters, I’ll be merging an updated version of my older eBook and adding content based on past tutorials (PyCon 2013, PyCon 2012, EuroSciPy 2012, EuroPython 2011), along with a big pile of new content from us both.

I setup a mailing list a year back with a plan to write such a book, I’ll be sending list members a survey tomorrow to validate the topics we plan to cover (and to spot the things we missed!). Please join the list (no spam, just Python HPC stuff occasionally) to participate. We’ll be sending out subsequent surveys and requests for feedback as we go.

Our snake is a Fer-de-Lance (which even has its own unofficial flag) and which also happens to be a ship from the classic spacefaring game Elite.

We plan to develop the book in a collaborative way based on some lessons I learned last time.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

37 Comments | Tags: Life, Python

24 August 2013 - 9:35EuroSciPy 2013 write-up

The conference is over, tomorrow I’m sticking around to Sprint on scikit-learn. As last year it has been a lot of fun to catch up with colleagues out here in Brussels. Here’s Logilab’s write-up.

Yesterday I spoke on Building an Open Source Data Science company. Topics included how companies benefit from open sourcing their tools, how individuals benefit by contributing to open source and how to build a consultancy or products.



This led to good questions over lunch. It seems that many people are evaluating whether their future is as predictable as it once was, especially for some in academia.

One question that repeatedly surfaced was “I’m an academic scientist – how do I figure out if my skills are needed in industry?”. My suggestion was simply to phone some nearby recruiters and have an introductory chat (probably via a Google search). Stay in control of the conversation, start from the position that you’re considering a move into industry (so you commit to nothing), build a relationship with the recruiter if you like them via several phone calls (and weed out the idiots – stay in control and just politely get rid of the ones who waste your time).



Almost certainly your science skills will translate into industrial problems (so you won’t have to retrain as a web programmer – a fear expressed by several). One recruitment group I’ve been talking with are the Hydrogen Group, they have contracts for data science throughout Europe. Contact Nick there and mention my name. If you’re in London then talk to Thayer of TeamPrime or look at TechCityJobs and filter by sensible searches.

Another approach is to use a local jobs board (e.g. in London there is TechCityJobs) which lists a healthy set of data science jobs. You can also augment your LinkedIn profile (here’s mine) with the term “data science” as it seems to be the term recruiters know to use to find you. Write a bullet-point list of your skills for data and the tools you use (e.g. Python, R, SPSS, gnuplot, mongodb, Amazon EC2 etc) to held with keyword searches and see who comes to find you (it’ll take months to get a good feel of what people are searching for to find you). In LinkedIn add any talks, open source projects etc that you contribute to as these are easy for someone to check to verify your skill level.

(Sidenote – I’m in the Sprint publishing this, I’ve just had a very interesting chat with a nascent company about how much they want to open source and the benefits and trade offs of doing so in their optics industry. Knowing why you attract user-attention, what you might give away to competitors, how much time you might lose in supporting non-commercial users whilst demonstrating your competence through open source is critical to making a reasoned decision. Related to this chat – posts on switching from [L]GPL to BSD 1, 2)

Next on the Friday I was invited to join a panel discussion asking “How do we make more programmers?”. It was nice to discuss some lessons learned teaching millions of beginners through ShowMeDo and by teaching at the intermediate/expert level at Python conferences. Thoughts covered the uses of the IPython Notebook, the depth of tuition to fit the needs of a group and the wealth of teaching material that’s freely available (e.g. pyvideo.org and the pytutor list).


This morning Peter Wang gave the keynote looking at a future for data analysis with Python. The Continuum tool chain is looking very nice, Bokeh and Blaze look to be worth testing now. I’m still curious about the limitations of Numba, I suspect that common use cases are still a way from being covered.

During the conference I got to learn about cartopy (a bit of a pain to setup but they promise that process will improve) which is a very compelling replacement for basemap, vispy is a cool looking OpenGL based visualiser for large datasets and I learned how to install the IPython Notebook in one go using ‘pip install ipython[notebook]‘.

Overall I’ve had fun again and am very grateful to be part of such a smart and welcoming community.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

7 Comments | Tags: Life, Python

9 July 2013 - 13:55Some Natural Language Processing and ML Papers

After I spoke at DataScienceLondon in June I was given a set of paper references by a couple of people (the bulk were by Levente Török) – thanks to all. They’re listed below. Along the same lines I have one machine learning paper aimed at beginners to recommend (“A Few Useful Things to Know about Machine Learning” – Pedro Domingos), it gives a set of real-world examples to work off, useful for someone short on experience who wants to learn whilst avoiding some of the worse mistakes.

Selection of references in no particular order:

Deep Learning for Efficient Discriminative Parsing, Ronan Collobert

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Ronan Collobert

Latent Dirichlet Allocation (old article)
Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Rethinking LDA: Why priors matter (How to tune the hyper parameters which shouldn’t matter.)
Dynamic Topic Models and the Document Influence Model (in which they deal with the change of the hidden topics ( HMM))

Semi supervised topic model notes:

Semi-supervised Extraction of Entity Aspects using Topic Models

Hierarchically Supervised Latent Dirichlet Allocation

Melting the huge difference between the topic models and the bag of words approach:

Beyond Bag of words (presentation)

A note on Topical N-grams

PCFGs, Topic Models

Integrating Topics with Syntax

Syntactic Topic Models

Collective Latent Dirichlet Allocation (might be useful for Tweet collections)

R packages (from Levente):

topicmodels for R

lda for R

R Text Tools package (noted as most advanced package, website offline when I visited it)

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: Life, SocialMediaBrandDisambiguator

7 July 2013 - 17:52Overfitting with a Decision Tree

Below is a plot of Training versus Testing errors using a Precision metric (actually 1.0-precision, so lower is better) that shows how easy it is to over-fit a decision tree to the detriment of generalisation. It is important to check that a classifier isn’t overfitting to the training data such that it is just learning the training set, rather than generalising to the true patterns that make up the entire dataset. It will only be a good a good predictor on unseen data if it has generalised to the true patterns.


Looking at the first column (depth 1 decision tree) the training error (red) is around 0.29 (so the Precision is around 71%). If we look at the exported depth 1 decision tree (1 page pdf) we see that it picks out 1 feature (“http”) as the most informative feature to split the dataset (ignore the threshold, that’s held at a constant 0.5 as we only have 0 or 1 values in our training matrix). It has 935 samples in the dataset with 465 in class 0 (not-a-brand) and 470 in class 1 (is-the-brand).

The right sub-tree is chosen if the term “http” is seen in the tweet. In that case the the training set is left with 331 samples of which 95 are class 0 and 236 are class 1. 1.0/331*236 == 71%. If “http” isn’t seen then the left branch is taken where 234 class 1 samples are given a false negative labelling.

As we allow greater depth in the decision tree we see both the training and the testing error improves. By around depth 35 we have a very low training error and (roughly) the optimum testing error. By allowing the decision tree to add new branches it overfits, becoming a great predictor for the training set (the error goes to 0) but with worsening testing errors (the thin green line is the average – it increases past a depth of 35 layers). Decision trees tend to overfit due to their greedy nature.

I’ve added an example of a depth 50 (1 page pdf) decision tree if you’re curious. The social media disambiguator project has example code (learn1_biasvar.py) to generate this plot.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: ArtificialIntelligence, Python, SocialMediaBrandDisambiguator

28 June 2013 - 18:52Visualising True Positives and False Positives against Features with scikit-learn

Here I’m starting to look into the errors caused in the social media brand disambiguator project. Below I look at true and false positives (correct and mistaken is-a-brand classifications) and plot them against the number of features that two different classifiers can use to calculate their class membership probabilities.

First I’m using the default LogisticRegression classifier. For both of these examples I’m using (1,3) n-grams (uni-, bi- and tri-grams) and a minimum document frequency of 2 occurrences for a term when building the Binary Vectorizer. The Vectorizer is constructed inside a 5-fold cross validation loop, so the number of features found varies a little per fold (you can see this in the two image titles – the title is generated using the final CV Vectorizer).


Class 1 (is-a-brand) results are ‘light blue’, they cluster towards the top of the graph (towards probability of 1 of being-in-class-1). Class 0 (is-not-a-brand) results cluster towards the bottom (towards a probability of 0 of being-in-class-1). There’s a lot of mixing around P(0.5) as the two classes aren’t separated terribly well.

We can see that the majority of the points (each circle ignoring which class it is in) have 1 to 10 features by looking along the x-axis, a few go up to over 50 features. Since the features include bi- and tri-grams we’ll see a lot of redundant features for these examples.

If we imagine drawing a threshold for is-class-1 above 0.89 then between all the cross validation test results (584 items across the 5 folds) I’d have 349 true positives (giving 100% precision, 59% recall). If I set the threshold to 0.78 then I’d have 422 true positives and 4 false positives (the 4 black dots above 0.78 giving 99% precision and 72% recall).

Now I repeat the experiment with the same Vectorizer settings but changing the classifier to Bernoulli Naive Bayes. The diagram shows a much stronger separation between the two classes:


If I choose a threshold of 0.66 then I have 100% precision with 66% recall. If I choose 0.28 then I get 2 false positives giving 99.5% precision with 73% recall. It is nice to be able to visualise the class separations for each of the test rows, to both have a feel for how the classifier is doing and to view how changing the feature set (without modifying the classifier) changes the results.

Looking at these results I’d obviously want to diagnose what the false positive results look like, maybe that gives further ideas for features that could help to separate the two classes. The modifications to learn1_experiments.py are in this check-in on the github project.

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Python, SocialMediaBrandDisambiguator