Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group


28 June 2013 - 18:52Visualising True Positives and False Positives against Features with scikit-learn

Here I’m starting to look into the errors caused in the social media brand disambiguator project. Below I look at true and false positives (correct and mistaken is-a-brand classifications) and plot them against the number of features that two different classifiers can use to calculate their class membership probabilities.

First I’m using the default LogisticRegression classifier. For both of these examples I’m using (1,3) n-grams (uni-, bi- and tri-grams) and a minimum document frequency of 2 occurrences for a term when building the Binary Vectorizer. The Vectorizer is constructed inside a 5-fold cross validation loop, so the number of features found varies a little per fold (you can see this in the two image titles – the title is generated using the final CV Vectorizer).


Class 1 (is-a-brand) results are ‘light blue’, they cluster towards the top of the graph (towards probability of 1 of being-in-class-1). Class 0 (is-not-a-brand) results cluster towards the bottom (towards a probability of 0 of being-in-class-1). There’s a lot of mixing around P(0.5) as the two classes aren’t separated terribly well.

We can see that the majority of the points (each circle ignoring which class it is in) have 1 to 10 features by looking along the x-axis, a few go up to over 50 features. Since the features include bi- and tri-grams we’ll see a lot of redundant features for these examples.

If we imagine drawing a threshold for is-class-1 above 0.89 then between all the cross validation test results (584 items across the 5 folds) I’d have 349 true positives (giving 100% precision, 59% recall). If I set the threshold to 0.78 then I’d have 422 true positives and 4 false positives (the 4 black dots above 0.78 giving 99% precision and 72% recall).

Now I repeat the experiment with the same Vectorizer settings but changing the classifier to Bernoulli Naive Bayes. The diagram shows a much stronger separation between the two classes:


If I choose a threshold of 0.66 then I have 100% precision with 66% recall. If I choose 0.28 then I get 2 false positives giving 99.5% precision with 73% recall. It is nice to be able to visualise the class separations for each of the test rows, to both have a feel for how the classifier is doing and to view how changing the feature set (without modifying the classifier) changes the results.

Looking at these results I’d obviously want to diagnose what the false positive results look like, maybe that gives further ideas for features that could help to separate the two classes. The modifications to learn1_experiments.py are in this check-in on the github project.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

3 Comments | Tags: Python, SocialMediaBrandDisambiguator

20 June 2013 - 19:15Visualising the internals of Logistic Regression on a Text Matrix

Below I have some plots that visualise the term matrix (as a binary matrix and as a TF-IDF matrix) for the brand disambiguation project followed by a visualisation of the coefficients used in scikit-learn’s LogisticRegression classifier using l1 and l2 penalties.

Using a CountVectorizer with binary=True we can mark the absence or presence of a token in a tweet. This is generated using learn1.py with the –termmatrix argument. If you open the full version of the image you’ll see that Class 0 is the bottom half (below the red line) of the rows, Class 1 is the top half (with 1168 rows in total, equally split between the classes). The x-axis shows 1238 features (formed of all unigrams and bigrams by the default tokenizer with a minimum document frequency of 2). The strong white line on the left is for the token ‘apple’ which is present in all tweets. If you look carefully you can see that some terms occur more frequently in only one of the two classes (as we’d hope).


The repeated rows are due to retweets – they have the same terms so we get repeated sets of the same binary features. This probably distorts the learning a bit and these will be removed in a later experiment.

Next we do the same operation but use TF-IDF (wikipedia) to scale the values (so they’re in the range 0-1.0), higher values mean that the tokens are rarer and so should have more importance. Often you’d use TF-IDF to normalise for document length (longer documents have more words and so in a binary matrix you’d have more 1s represented). Tweets are of roughly the same length so I don’t think this is so useful, I don’t know what the effect will be on precision and recall in later testing. As you’d expect the most common terms (e.g. ‘apple’) now have a low value, a few rare words now have a high (bright) value.


An obvious question given the above plots is whether we can easily remove a number of the tokens due to them contributing little towards a classification, either because they’re mentioned equally for both classes (e.g. common English words would be mentioned roughly equally and would have no bearing on the classification problem) or because they occur so rarely that we don’t know if they truly represent a feature that should identify a class. This will be investigated soon.

Next we create a LogisticRegression classifier with learn1_coefficients.py and train it first with the default l2 penalty and then with an l1 penalty. We can see that the l1 penalty sets many of the coefficients to 0. In both plots we’re looking at the coefficients for each of 5 cross-fold models (the dark lines mean more models agree on the importance of the feature, each model is plotted with an alpha blend). For the l1 penalty I’ve annotated the 10 biggest coefficients for the positive and negative coefficients. This chart is plotted using a Binary CountVectorizer (from the first of the two examples above) – if I switch to a TF-IDF Vectorizer then I get a very similar visual output.


As you might expect for the apple-the-brand coefficients we see “cook, company, google, ipad, iphone, macbook, market, samsung, store, vatican” (I’ll explain “vatican” in a moment). For not-apple-the-brand we see “candy, caramel, cinnamon, eat, eye, girl, juice, orange, pie, tree“.

The inclusion of “vatican” happens because these tweets occur with the announcement of the latest Pope – various wags tweeted about topics like the “iPope” and so “vatican” is discussed alongside apple-the-brand. This also highlights the over-fitting that has inevitably occurred due to the current small sample of tweets for this experiment.

Clearly we could use the l1 penalty to perform feature selection, we can also use other methods. This is to follow.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: Life, Python, SocialMediaBrandDisambiguator

17 June 2013 - 20:13Demonstrating the first Brand Disambiguator (a hacky, crappy classifier that does something useful)

Last week I had the pleasure of talking at both BrightonPython and DataScienceLondon to about 150 people in total (Robin East wrote-up the DataScience night). The updated code is in github.

The goal is to disambiguate the word-sense of a token (e.g. “Apple”) in a tweet as being either the-brand-I-care-about (in this case – Apple Inc.) or anything-else (e.g. apple sauce, Shabby Apple clothing, apple juice etc). This is related to named entity recognition, I’m exploring simple techniques for disambiguation. In both talks people asked if this could classify an arbitrary tweet as being “about Apple Inc or not” and whilst this is possible, for this project I’m restricting myself to the (achievable, I think) goal of robust disambiguation within the 1 month timeline I’ve set myself.

Below are the slides from the longer of the two talks at BrightonPython:

As noted in the slides for week 1 of the project I built a trivial LogisticRegression classifier using the default CountVectorizer, applied a threshold and tested the resulting model on a held-out validation set. Now I have a few more weeks to build on the project before returning to consulting work.

Currently I use a JSON file of tweets filtered on the term ‘apple’, obtained using the free streaming API from Twitter using cURL. I then annotate the tweets as being in-class (apple-the-brand) or out-of-class (any other use of the term “apple”). I used the Chromium Language Detector to filter non-English tweets and also discard English tweets that I can’t disambiguate for this data set. In total I annotated 2014 tweets. This set contains many duplicates (e.g. retweets) which I’ll probably thin out later, possibly they over-represent the real frequency of important tokens.

Next I built a validation set using 100 in- and 100 out-of-class tweets at random and created a separate test/train set with 584 tweets of each class (a balanced set from the two classes but ignoring the issue of duplicates due to retweets inside each class).

To convert the tweets into a dense matrix for learning I used the CountVectorizer with all the defaults (simple tokenizer [which is not great for tweets], minimum document frequency=1, unigrams only).

Using the simplest possible approach that could work – I trained a LogisticRegression classifier with all its defaults on the dense matrix of 1168 inputs. I then apply this classifier to the held-out validation set using a confidence threshold (>92% for in-class, anything less is assumed to be out-of-class). It classifies 51 of the 100 in-class examples as in-class and makes no errors (100% precision, 51% recall). This threshold was chosen arbitrarily on the validation set rather than deriving it from the test/train set (poor hackery on my part), but it satisfied me that this basic approach was learning something useful from this first data set.

The strong (but not generalised at all!) result for the very basic LogisticRegression classifier will be due to token artefacts in the time period I chose (March 13th 2013 around 7pm for the 2014 tweets). Extracting the top features from LogisticRegression shows that it is identifying terms like “Tim”, “Cook”, “CEO” as significant features (along with other features that you’d expect to see like “iphone” and “sauce” and “juice”) – this is due to their prevalence in this small dataset (in this set examples like this are very frequent). Once a larger dataset is used this advantage will disappear.

I’ve added some TODO items to the README, maybe someone wants to tinker with the code? Building an interface to the open source DBPediaSpotlight (based on WikiPedia data using e.g. this python wrapper) would be a great start for validating progress, along with building some naive classifiers (a capital-letter-detecting one and a more complex heuristic-based one, to use as controls against the machine learning approach).

Looking at the data 6% of the out-of-class examples are retweets and 20% of the in-class examples are retweets. I suspect that the repeated strings are distorting each class so I think they need to be thinned out so we just have one unique example of each tweet.

Counting the number of capital letters in-class and out-of-class might be useful, in this set a count of <5 capital letters per tweet suggests an out-of-class example:

This histogram of tweet lengths for in-class and out-of-class tweets might also suggest that shorter tweets are more likely to be out-of-class (though the evidence is much weaker):


Next I need to:

  • Update the docs so that a contributor can play with the code, this includes exporting a list of tweet-ids and class annotations so the data can be archived and recreated
  • Spend some time looking at the most-important features (I want to properly understand the numbers so I know what is happening), I’ll probably also use a Decision Tree (and maybe RandomForests) to see what they identify (since they’re much easier to debug)
  • Improve the tokenizer so that it respects some of the structure of tweets (preserving #hashtags and @users would be a start, along with URLs)
  • Build a bigger data set that doesn’t exhibit the easily-fitted unigrams that appear in the current set

Longer term I’ve got a set of Homeland tweets (to disambiguate the TV show vs references to the US Department and various sayings related to the term) which I’d like to play with – I figure making some progress here opens the door to analysing media commentary in tweets.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life, Python, SocialMediaBrandDisambiguator

17 June 2013 - 14:39Active Countermeasures for Privacy in a Social Networking age?

This is a bit of a rambling post covering some thoughts on data privacy, mobile phones and social networking.

A general and continued decrease in personal privacy  seems inevitable in our age of data (NSA Files at The Guardian). We generate a lot of data, we rarely know how or where it is stored and we don’t understand how easy it is to make certain inferences based on aggregated forms of our data. Cory Doctorow has some points on why we should care about this topic.

Will we now see the introduction of active countermeasures in a data stream by way of protest or camouflage by regular folk?

Update – hat tip to Kyran for prism-break.org, listing open-source alternatives to Operating Systems and communication clients/systems. I had a play earlier today with the Tor-powered Orweb on Android – it Just Worked and whatsmyip.org didn’t know where my device was coming from (running traceroute went from whatsmyip to the Tor entry node and [of course] no further). It seems that installing Tor on a raspberrypi or Tor on EC2 is pretty easy too (Tor runs faster when more people start Tor relays [which carry the internal encrypted traffic, so there’s none of the fear of running an edge nodes that sends the traffic onto the unencrypted Internet]). Here are some Tor network statistic graphs.

I’ve long been unhappy with the fact that my email is known to be transmitted and stored in the clear (accepting that I turn on HTTPS-only in Gmail). I’d really like for it to be readable only for the recipient, not for anyone (sysadmin or Government agency) along the chain. Maybe someone can tell me if adding PGP into Gmail via the browser and Android phone is an easy thing to do?

I’m curious to see how long it’ll be before we have a cypherpunk mobile OS, preconfigured with sensible defaults. CyanogenMod is an open build of Android (so you could double-check for Government backdoors [if you took the time]), there’s no good reason why a distro couldn’t be setup that uses Tor, HTTPSEverywhere (eff.org post on this combo, this Tor blog post comments on Tor vs PRISM) and Incognito Mode by default as a start for private web usage. Add on a secure and open source VoIP client (not Skype) and an IM tool and you’re most of the way there for better-than-normal-folk privacy.

Compared to an iOS device it’ll be a bit clunky (so maybe my mum won’t use it) but I’d like the option, even if I have to jump through a few hoops. You might also choose not to trust your handset provider, we’re just starting to see designs for build-it-yourself cellphones (albeit very basic non-data phones at present).

Maybe we’ll start to consider the dangers of entrusting our data to near-monopolies in the hope that they do no evil (and aren’t subject to US Government secret & uninvestigable
disclosures to people who we personally may or may not trust, and may or may not be decent, upright, solid, incorruptible citizens). Perhaps far-sighted governments in other countries will start to educate their citizens about the dangers of trusting US Data BigCorps (“Loose Lips Sink Ships“)?

So what about active countermeasures? For the social networking example above we’d look at communications traffic (‘friends’ are cheap to acquire but communication takes effort). What if we started to lie about who we talk to? What if my email client builds a commonly-communicated-with list and picks someone from outside of that list, then starts to send them reasonably sensible-looking emails automatically? Perhaps it contains a pre-agreed codeword, then their client responds at a sensible frequency with more made-up but intelligible text. Suddenly they appear to be someone I closely communicate with, but that’s a lie.

My email client knows this so I’m not bothered by it but an eavesdropper has to process this text. It might not pass human inspection but it ought to tie up more resources, forcing more humans to get involved, driving up the cost and slowing down response times. Maybe our email clients then seed these emails with provocative keywords in innocuous phrases (“I’m going to get the bomb now! The bomb is of course the name for my football”) which tie up simple keyword scanners.

The above will be a little like the war on fake website signups for spam being defeated by CAPTCHAs (and in turn defeating the CAPTCHAs), driving perhaps improvements in NLP technologies. I seem to recall that Hari Seldon in Asimov’s Foundation novels used auto-generated plausible speech generators to mask private in-person communications from external eavesdropping (I can’t find a reference – am I making this up?), this stuff doesn’t feel like science fiction any more.

Maybe with FourSquare people will practice fake check-ins. Maybe during a protest you comfortably sit at home and take part in remote virtual check-ins to spots that’ll upset the police (“quick! join the mass check-in in the underground coffee shop! the police will have to spend resources visiting it to see if we’re actually there!”). Maybe you’ll physically be in the protest but will send spoofed GPS co-ords with your check-ins pretending to be elsewhere.

Maybe people start to record and replay another person’s check-ins, a form of ‘identify theft’ where they copy the behaviour of another to mask their own movements?

Maybe we can extend this idea to photo sharing. Some level of face detection and recognition already exists and it is pretty good, especially if you bound the face recognition problem to a known social group. What if we use a graphical smart-paste to blend a person-of-interest’s face into some of our group photos? Maybe Julian Assange appears in background shots around London or a member of Barack Obama’s Government in photos from Iranian photobloggers?

The photos could be small and perhaps reasonably well disguised so they’re not obvious to humans, but obvious enough to good face detection & recognition algorithms. Again this ties up resources (and computer vision algorithms are terribly CPU-expensive). It would no doubt upset the intelligence services if it impacted their automated analysis, maybe this becomes a form of citizen protest?

Hidden Mickeys appear in lots of places (did you spot the one in Tron?), yet we don’t notice them. I’m pretty sure a smart paste could hide a small or distorted or rotated or blended image of a face in some photos, without too much degradation.

Figuring out who is doing what given the absence of information is another interesting area. With SocialTies (built by Emily and I) I could track who was at a conference via their Lanyrd sign-up, and also track people via nearby FourSquare check-ins and geo-tagged tweets (there are plenty of geo-tagged tweets in London…). Inferring where you were was quite possible, even if you only tweeted (and had geo-locations enabled). Double checking your social group and seeing that friends are marked as attending the event that you are near only strengthens the assertion that you’re also present.

Facebook typically knows the address book of your friends, so even if you haven’t joined the service it’ll still have your email. If 5 members of Facebook have your email address then that’s 5 directed edges in a social network graph pointing at a not-yet-active profile with your name on it. You might never join Facebook but they still have your email, name and some of your social connections. You can’t make those edges disappear. You just leaked your social connectivity without ever going near the service.

Anyhow, enough with the prognostications. Privacy is dead. C’est la vie. As long as we trust the good guys to only be good, nothing bad can happen.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: Life

17 June 2013 - 11:02Open Sourcing “The Screencasting Handbook”

Back in 2010 I released the finished version of my first commercial eBook The Screencasting Handbook. It was 129 pages of distilled knowledge for the budding screencaster, written in part to introduce my (then) screencasting company ProCasts to the world (which I sold years back) and based on experience teaching through ShowMeDo. Today I release the Handbook under a Creative Commons License. After 3 years the content is showing its age (the procedures are good, the software-specific information is well out of date), I moved out of screencasting a while back and have no plans to update this book.

The download link for the open sourced version is at thescreencastinghandbook.com.

I’m using the Creative Commons Unported license – it allows anyone to derive a new version and/or make commercial usage without requiring any additional permissions from me, it does require attribution. This is the most open license I can give that still gives me a little bit of value (by way of attribution). The license must not be modified.

If someone would like to derive an updated version (with or without a price tag) you are very welcome to – just remember to attribute back to the original site and to this site with my name please (as noted at the download point). You can not change the license (but if you wanted to make a derived and non-open-source version of the book for commercial use, I’m sure we can come to an arrangement).

Previously I’ve discussed how I wrote the Handbook in an open, collaborative fashion (with monthly chapter releases to the preview audience), this was a good procedure that I’d use again. Other posts discussing the Handbook are under the “screencasting-handbook” tag.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: The Screencasting Handbook

3 June 2013 - 20:24Social Media Brand Disambiguator first steps

As noted a few days back I’m spending June working on a social-media focused brand disambiguator using Python, NLTK and scikit-learn. This project has grown out of frustrations using existing Named Entity Recognition tools (like OpenCalais and DBPediaSpotlight) to recognise brands in social media messages. These tools are generally trained to work on long-form clean text and tweets are anything but long or cleanly written!

The problem is this: in a short tweet (e.g. “Loving my apple, like how it werks with the iphon”) we have little context to differentiate the sense of the word “apple”. As a human we see the typos and deliberate spelling errors and know that this use of “apple” is for the brand, not for the fruit. Existing APIs don’t make this distinction, typically they want a lot more text with fewer text errors. I’m hypothesising that with a supervised learning system (using scikit-learn and NLTK) and hand tagged data I can outperform the existing APIs.

I started on Saturday (freshly back from honeymoon), a very small github repo is online. Currently I can ingest tweets from a JSON file (captured using curl), marking the ones with a brand and those with the same word but not-a-brand (in-class and out-of-class) in a SQLite db. I’ll benchmark my results against my hand-tagged Gold Standard to see how I do.

Currently I’m using my Python template to allow environment-variable controlled configurations, simple logging, argparse and unittests. I’ll also be using the twitter text python module that I’m now supporting to parse some structure out of the tweets.

I’ll be presenting my progress next week at Brighton Python, my goal is to have a useful MIT-licensed tool that is pre-trained with some obvious brands (e.g. Apple, Orange, Valve, Seat) and software names (e.g. Python, vine, Elite) by the end of this month, with instructions so anyone can train their own models. Assuming all goes well I can then plumb it into my planned annotate.io online service later.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: ArtificialIntelligence, Python, SocialMediaBrandDisambiguator