About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group

28 August 2015 - 11:27EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I’m at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. Next year will be in Bavaria, you can sign-up for announces.

EuroSciPy 2015

I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

  • decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
  • validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
  • automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
  • manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
  • visualisation with GlueViz, Seaborn and csv-fingerprint
  • starting your first ML project

Here are the slides:

 

Thanks to Enthought and the org-team for a lovely conference!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Data science, pydata, Python