At this week’s PyDataLondon I did a 5 minute lightning talk on the Annotate text-cleaning service for data scientists that I made live recently. It was good to have a couple of chats after with others who are similarly bored of cleaning their text data.
The goal is to make it quick and easy to clean data so you don’t have to figure out a method yourself. Behind the scenes it uses ftfy to fix broken unicode, unidecode to remove foreign characters if needed and a mix of regular-expressions that are written on the fly depending on the data submitted.
I suspect that adding some datetime-fixers will be a next step (dealing with UK data when tools often assume that 1/3/13 is 3rd January in US-format is a pain), maybe a fact-extractor will follow.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.