A few weeks I posted some notes on a self-learning text cleaning system, to be used by data scientists who didn’t want to invest time cleaning their data by hand. I have a first demo online over at annotate.io (the demo code is here in github).
The intuition behind this is that we currently divert a lot of mental resource early in a project to cleaning data and a bunch of that can be spent just figuring out which libraries will help with the cleaning. What if we could just let the machine do that for us? We can then focus on digging into new data and figuring out how to solve the bigger problems.
With annotate.io you give it a list of “data you have” and “data you want”, it’ll figuring out how to transform the former into the latter. With the recipe it generates you then feed in new data and it performs the cleaning for you. You don’t have to install any of the libraries it might use (that’s all server-side).
Using Python 2.7 or 3.4 you can run the demo in github (you need the requests library). You can sign-up to the announce list if you’d like to be kept informed on developments.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
8 Comments