About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group

27 January 2015 - 23:51Annotate.io self-learning text cleaner demo online

A few weeks I posted some notes on a self-learning text cleaning system, to be used by data scientists who didn’t want to invest time cleaning their data by hand. I have a first demo online over at annotate.io (the demo code is here in github).

The intuition behind this is that we currently divert a lot of mental resource early in a project to cleaning data and a bunch of that can be spent just figuring out which libraries will help with the cleaning. What if we could just let the machine do that for us? We can then focus on digging into new data and figuring out how to solve the bigger problems.

With annotate.io you give it a list of “data you have” and “data you want”, it’ll figuring out how to transform the former into the latter.  With the recipe it generates you then feed in new data and it performs the cleaning for you. You don’t have to install any of the libraries it might use (that’s all server-side).

Using Python 2.7 or 3.4 you can run the demo in github (you need the requests library). You can sign-up to the announce list if you’d like to be kept informed on developments.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: ArtificialIntelligence, Python