Data science pydata PythonAugust 28, 2015

EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

I’m at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. Next year will be in Bavaria, you can sign-up for announces.

I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:

decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
visualisation with GlueViz, Seaborn and csv-fingerprint
starting your first ML project

Here are the slides:

Thanks to Enthought and the org-team for a lovely conference!

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

14 Comments

EGouillart
August 28, 2015 at 11:31 am
RT @ianozsvald: #EuroSciPy 2015 and "Data Cleaning on Text for ML" (talk) http://t.co/Yj8tW9cbAS
snipe
August 28, 2015 at 12:00 pm
Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk) http://t.co/kJddIWiCqO
sujann143
August 28, 2015 at 12:23 pm
Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk) http://t.co/ZxpDpM8wXC #Python #Django http://t.co/DoCbPJaDhH
EuroSciPy
August 28, 2015 at 12:24 pm
RT @ianozsvald: #EuroSciPy 2015 and "Data Cleaning on Text for ML" (talk) http://t.co/Yj8tW9cbAS
__J_L_M__
August 28, 2015 at 1:19 pm
#PlanetPython Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk) http://t.co/5o7ckjWtZg
OpenSourceHero
August 28, 2015 at 2:01 pm
from @planetpython: Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk) http://t.co/CwIW2cYRXv #opensourcehero
planetpython
August 28, 2015 at 2:20 pm
Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk) http://t.co/ljt46XKymN
elisk13
August 28, 2015 at 2:33 pm
RT @planetpython: Ian Ozsvald: EuroSciPy 2015 and Data Cleaning on Text for ML (talk) http://t.co/ljt46XKymN
paulblaser
August 30, 2015 at 1:32 am
On Preparing Data for Machine Learning Projects - http://t.co/6XXuKBE07P
EvasiumDev
August 30, 2015 at 3:43 am
Ian ozsvald: euroscipy 2015 and data cleaning on text for ml (talk) http://t.co/BxyHHw2YDL
npettiaux
August 31, 2015 at 10:05 pm
RT @ianozsvald: #EuroSciPy 2015 and "Data Cleaning on Text for ML" (talk) http://t.co/Yj8tW9cbAS
enthought
September 1, 2015 at 8:46 am
RT @ianozsvald: #EuroSciPy 2015 and "Data Cleaning on Text for ML" (talk) http://t.co/Yj8tW9cbAS
leriomaggio
September 1, 2015 at 11:28 am
RT @ianozsvald: #EuroSciPy 2015 and "Data Cleaning on Text for ML" (talk) http://t.co/Yj8tW9cbAS
ChonteMerengue
September 1, 2015 at 5:20 pm
RT @ianozsvald: #EuroSciPy 2015 and "Data Cleaning on Text for ML" (talk) http://t.co/Yj8tW9cbAS

EuroSciPy 2015 and Data Cleaning on Text for ML (talk)

14 Comments

Navigation

Recent Posts

About Ian