I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:
- decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
- validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
- automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
- manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
- visualisation with GlueViz, Seaborn and csv-fingerprint
- starting your first ML project
Here are the slides:
Thanks to Enthought and the org-team for a lovely conference!
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.