Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I’m planning on a change for June.
I’m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following tweet mentions Apple-the-brand:
“I love my apple, though leopard can be a pain”
and that this tweet does not:
“Really enjoying this apple, very tasty”
I’ve used AlchemyAPI, OpenCalais, DBPedia Spotlight and others for client projects and it turns out that these APIs expect long-form text (e.g. Reuters articles) written with good English.
Tweets are short-form, messy, use colloquialisms, can be compressed (e.g. using contractions) and rely on local context (both local in time and social group). Linguistically a lot is expressed in 140 characters and it doesn’t look like”good English”.
A second problem with existing APIs is that they cannot be trained and often don’t know about European brands, products, people and places. I plan to build a classifier that learns whatever you need to classify.
Examples for disambiguation will include Apple vs apple (brand vs e.g. fruit/drink/pie), Seat vs seat (brand vs furniture), cold vs cold (illness vs temperature), ba (when used as an abbreviation for British Airways).
The goal of the June project will be to out-perform existing Named Entity Recognition APIs for well-specified brands on Tweets, developed openly with a liberal licence. The aim will be to solve new client problems that can’t be solved with existing APIs.
I’ll be using Python, NLTK, scikit-learn and Tweet data. I’m speaking on progress at BrightonPy and DataScienceLondon in June.
Probably for now I should focus on having no computer on my honeymoon…
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.