Data science Life pydata PythonApril 3, 2015

PyDataParis 2015 and “Cleaning Confused Collections of Characters”

I’m at PyDataParis, this is the first PyData in France and we have a 300-strong turn-out. In my talk I asked about the split of academic and industrial folk, we have 70% industrialists here (at least – in my talk of 70 folk). The bulk of the attendees are in the Intro track and maybe the split is different in there. All slides are up, videos are following, see them here.

Here’s a photo of Gael giving a really nice opening keynote on Scikit-Learn:

I spoke on data cleaning with text data, I packed quite a bit into my 40 minutes and got a nice set of questions. The slides are below, it covers:

Data extraction from text files, PDF, HTML/XML and images
Merging on columns of data
Correctly processing datetimes from files and the dangers of relying on the pandas defaults
Normalising text columns so we could join on otherwise messy data
Automated data transformation using my annotate.io (Python demo)
Ideas on automated feature extraction
Ideas on automating visualisation for new, messy datasets to get a “bird’s eye view”
Tips on getting started – make a Gold Standard!

One question concerned the parsing of datetime strings from unusual sources. I’d mentioned dateutil‘s parser in the talk and a second parser is delorean. In addition I’ve also seen arrow (an extension of the standard datetime) which has a set of parsers including one for ISO8601. The parsedatetime module has an NLP module to convert statements like “tomorrow” into a datetime.

I don’t know of other, better parsers – do you? In particular I want one that’ll take a list of datetimes and return one consistent converter that isn’t confused by individual instances (e.g. “1/1” is MM/DD or DD/MM ambiguous).

I’m also asking for feedback on the subject of automated feature extraction and automated column-join tools for messy data. If you’ve got ideas on these subjects I’d love to hear from you.

In addition I was reminded of DiffBot, it uses computer vision and NLP to extract meaning from web pages. I’ve never tried it, can any of you comment on its effectiveness? Olivier Grisel mentioned pyquery to me, it is an lxml parser which lets you make jquery-like queries on HTML.

update I should have mentioned chardet, it detects encodings (UTF8, CP1252 etc) from raw text, very useful if you’re trying to figure out the encoding for a collection of bytes off of a random data source! libextract (write-up) looks like a young but nice tool for extracting text blocks from HTML/XML sources, also goose. boltons is a nice collection of bolton-tools to the standard library (e.g. timeutils, strutils, tableutils). Possibly mETL is a useful tool to think about the extract, transform and load process.

update It might also be worth noting some useful data sources from which you can extract semi-structured data, e.g. ‘tech tags’ from stackexchange‘s forums (and I also see a new hackernews dump). Here’s a big list of “awesome public datasets“.

update Peadar Coyle (@springcoil) gave a nice talk at PyConItaly 2015 on “Data Products – how to get models into production” which is related.

Camilla Montonen has just spoken on Rush Hour Dynamics, visualising London Underground behaviour. She noted graph-tool, a nice graphing/viz library I’d not seen before. Fabian has just shown me his new project, it collects NLP IPython Notebooks and lists them, it tries to extract titles or summaries (which is a gnarly sub-problem!). The AXA Data Innovation Lab have a nice talk on explaining machine learned models.

Gilles Loupe’s slides for his ML/sklearn talk on trees and boosting are online, as are Alexandre Gramfort‘s on sklearn linear models.

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

18 Comments

eoinbrazil
April 3, 2015 at 11:06 am
RT @ianozsvald: PyDataParis 2015 and "Cleaning Confused Collections of Characters": I'm at PyDataParis, this is the first ... http://t.co/E…
pydatalondon
April 3, 2015 at 11:08 am
RT @ianozsvald: PyDataParis 2015 and "Cleaning Confused Collections of Characters": I'm at PyDataParis, this is the first ... http://t.co/E…
cazencott
April 3, 2015 at 11:17 am
RT @ianozsvald: PyDataParis 2015 and "Cleaning Confused Collections of Characters": I'm at PyDataParis, this is the first ... http://t.co/E…
pierrepo
April 3, 2015 at 11:19 am
RT @ianozsvald: PyDataParis 2015 and "Cleaning Confused Collections of Characters": I'm at PyDataParis, this is the first ... http://t.co/E…
hsantuz
April 3, 2015 at 11:20 am
RT @ianozsvald: PyDataParis 2015 and "Cleaning Confused Collections of Characters": I'm at PyDataParis, this is the first ... http://t.co/E…
snipe
April 3, 2015 at 11:55 am
Ian Ozsvald: PyDataParis 2015 and “Cleaning Confused Collections of Characters” http://t.co/jZjUhVZ84j
__J_L_M__
April 3, 2015 at 12:02 pm
#PlanetPython Ian Ozsvald: PyDataParis 2015 and “Cleaning Confused Collections of Characters” http://t.co/rFcvfEX9Io
planetpython
April 3, 2015 at 1:03 pm
Ian Ozsvald: PyDataParis 2015 and “Cleaning Confused Collections of Characters” http://t.co/947TA96PcA
bayuadji
April 3, 2015 at 1:13 pm
RT @planetpython: Ian Ozsvald: PyDataParis 2015 and “Cleaning Confused Collections of Characters” http://t.co/947TA96PcA
edwood_ocasio
April 3, 2015 at 1:39 pm
RT @planetpython: Ian Ozsvald: PyDataParis 2015 and “Cleaning Confused Collections of Characters” http://t.co/947TA96PcA
amicel
April 3, 2015 at 5:42 pm
“Cleaning Confused Collections of Characters” — Great talk by @ianozsvald on data extraction cleaning at @PyDataParis http://t.co/OwpupOS1S6
PyDataParis
April 4, 2015 at 9:10 am
RT @ianozsvald: PyDataParis 2015 and "Cleaning Confused Collections of Characters": I'm at PyDataParis, this is the first ... http://t.co/E…
planetbnm
April 4, 2015 at 9:28 am
[Blog] Entrepreneurial Geekiness: PyDataParis 2015 and "Cleaning Confused Collections of Characters" http://t.co/EDaIUQ2vuS
data_wizard
April 6, 2015 at 12:56 pm
#PyDataParis 2015 Meetup Last Friday http://t.co/UcaooozJX8
Vasudev Ram
April 10, 2015 at 7:52 pm
Interesting post, Ian, thanks. libextract looks interesting, checking it out. graph-tool is good. I had blogged a bit about it here: http://jugad2.blogspot.in/2013/01/graph-tool-python-module-for-graph.html
Vasudev Ram
April 10, 2015 at 7:55 pm
Also, related to extracting text blocks from HTML sources, I wrote a small such tool here: http://jugad2.blogspot.in/2015/01/html-text-to-pdf-with-beautiful-soup.html
ianozsvald
May 21, 2015 at 1:44 pm
@maryumk1 @christianf21 cheers, had libextract in my @PyDataParis write-up but have added the write-up: http://t.co/zZEdhpYkPc
maryumk1
May 21, 2015 at 1:52 pm
@ianozsvald @christianf21 @PyDataParis ah okay already on top of it! (ofcourse!)

PyDataParis 2015 and “Cleaning Confused Collections of Characters”

18 Comments

Navigation

Recent Posts

About Ian