As noted a few days back I’m spending June working on a social-media focused brand disambiguator using Python, NLTK and scikit-learn. This project has grown out of frustrations using existing Named Entity Recognition tools (like OpenCalais and DBPediaSpotlight) to recognise brands in social media messages. These tools are generally trained to work on long-form clean text and tweets are anything but long or cleanly written!
The problem is this: in a short tweet (e.g. “Loving my apple, like how it werks with the iphon”) we have little context to differentiate the sense of the word “apple”. As a human we see the typos and deliberate spelling errors and know that this use of “apple” is for the brand, not for the fruit. Existing APIs don’t make this distinction, typically they want a lot more text with fewer text errors. I’m hypothesising that with a supervised learning system (using scikit-learn and NLTK) and hand tagged data I can outperform the existing APIs.
I started on Saturday (freshly back from honeymoon), a very small github repo is online. Currently I can ingest tweets from a JSON file (captured using curl), marking the ones with a brand and those with the same word but not-a-brand (in-class and out-of-class) in a SQLite db. I’ll benchmark my results against my hand-tagged Gold Standard to see how I do.
Currently I’m using my Python template to allow environment-variable controlled configurations, simple logging, argparse and unittests. I’ll also be using the twitter text python module that I’m now supporting to parse some structure out of the tweets.
I’ll be presenting my progress next week at Brighton Python, my goal is to have a useful MIT-licensed tool that is pre-trained with some obvious brands (e.g. Apple, Orange, Valve, Seat) and software names (e.g. Python, vine, Elite) by the end of this month, with instructions so anyone can train their own models. Assuming all goes well I can then plumb it into my planned annotate.io online service later.
Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and in his Mor Consulting, sign-up for Data Science tutorials in London. He also founded the image and text annotation API Annotate.io, lives in London and is a consumer of fine coffees.