At this week’s PyDataLondon I did a 5 minute lightning talk on the Annotate text-cleaning service for data scientists that I made live recently. It was good to have a couple of chats after with others who are similarly bored of cleaning their text data.
The goal is to make it quick and easy to clean data so you don’t have to figure out a method yourself. Behind the scenes it uses ftfy to fix broken unicode, unidecode to remove foreign characters if needed and a mix of regular-expressions that are written on the fly depending on the data submitted.
I suspect that adding some datetime-fixers will be a next step (dealing with UK data when tools often assume that 1/3/13 is 3rd January in US-format is a pain), maybe a fact-extractor will follow.
Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and in his Mor Consulting, sign-up for Data Science tutorials in London. He also founded the image and text annotation API Annotate.io, lives in London and is a consumer of fine coffees.