About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

30 January 2011 - 19:49Review for Python Text Processing with NLTK 2.0 Cookbook (Packt, 2010)

Python Text Processing with NLTK 2.0 Cookbook (Amazon US, UK) is a cookbook for Python’s Natural Language Processing Toolkit. I’d suggest that this book is seen as a companion for O’Reilly’s Natural Language Processing with Python (available for free at nltk.org). The older O’Reilly book gives a lot of explanation for how to use NLTK’s component, Packt’s new book shows you lots of little recipes which build to larger projects giving you a great hands-on toolkit.

Overall the book is easy to read, has a huge set of sample recipes and feels very useful. I’ll be referring to it for our upcoming @socialties mobile app.

You’ll need to download NLTK, you can also refer to some sample articles at Packt’s site and get Chapter 3 as a free PDF (see below). The author is Jacob Perkins, his blog links to many related articles, he also has a nice ‘how it started‘ article.

Here are my thoughts on the book. Disclosure – I was sent a free copy of the book by Packt for review, the thoughts below are entirely my own.

Chapter 1: Tokenizing Text and WordNet Basics

If you haven’t tried tokenising text before you may not realise how complicated it can be (expressing even basic rules for English is jolly hard!). This chapter has a good overview of tokenisation and the excellent WordNet library. Filtering stopwords (low value words like ‘the’, ‘of’) and synsets approaches (synonym groups in WordNet) are also covered. The word similarity measure was new to me, the book certainly throws up nice nuggets.

Chapter 2: Replacing and Correcting Words

Stemming approaches are covered, the goal is to find common root words (e.g. “running”, “runs” and “run” can each have “run” as their stem) to simplify your input text. Synonym replacement (e.g. converting “bday” to “birthday”) and negating words using antonyms are nicely treated. Babelfish is provided through NLTK for translation and the PyEnchant spellchecker is introduced.

Chapter 3: Creating Custom Corpora (sample PDF chapter)

This chapter discusses MongoDB (a NoSQL document store) as a way to store your own corpora in NLTK’s format, it also introduces part of speech tagging. File locking using lockfile is mentioned in case you’re using multiple processes (discussed later).

Chapter 4: Part-of-Speech Tagging and Chapter 5: Extracting Chunks

I was less interested in this part, I’ve had to extract Named Entities before and there’s a nice discussion in Chapter 5.

Chapter 6: Transforming Chunks and Trees

The section on filtering out insignificant words using part of speech tags was interesting (i.e. using the Determiner tag DT to filter words like “a”, “all”, “an”, “that”, “that”). Cardinals (numbers) are discussed, I liked the recipe for swapping noun cardinal phrases so e.g. “Dec 10″ becomes “10 Dec” (whilst “10 Dec” doesn’t change).

Chapter 7: Text Classification

This feels like it will be useful – bag of words classification and the Naive Bayes Classifier are discussed (along with some other classifiers). Here the author starts to build a movie rating classifier. Precision and Recall are explained nicely. A high-information classifier is built, this is useful as we can then remove low-information words (those that aren’t biased to a single class in the classifier) which can improve classification results. Combining classifiers to further improve results is also covered.

Chapter 8: Distributed Processing and Handling Large Datasets

This chapter has promise – I wasn’t aware of the share-nothing distributed execution engine execnet. Redis is also used, Jacob builds towards a distributed word scoring engine which uses Redis as a single storage system. I’ve yet to use Redis but really want to hook it into our future @socialtiesapp, distributed processing will definitely be on the agenda too.

Chapter 9: Parsing Specific Data

This is a little gem, tucked at the end of the book. Ages ago I’d come across a date parsing module (which I then forgot about), having needed it recently I was super-happy to see dateutil discussed. It makes the parsing of different date formats incredibly easy and also handles timezones.

The timex module in NLTK is introduced (I’d never heard of it before) – it takes a fuzzy reference to a date or time and marks it up. An example would be “let’s go sometime <TIMEX2>this week</TIMEX2>”, you can then extract the fuzzy reference and decide how to interpret it in your application.

lxml, Beautiful Soup and chardet (another gem) are used to write a web page scraper.

Overall I recommend this book, if you have the original O’Reilly book (and you really ought to) then this makes for a great companion. I also spotted these two other reviews.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: ArtificialIntelligence, Python

1 January 2011 - 16:02And on with 2011

Earlier this year I switched back to my long-running artificial intelligence and CUDA high performance computing consultancy work through Mor Consulting. Having sold ProCasts earlier in the year and moved entirely away from screencasting (leaving ShowMeDo in Kyran’s capable hands) I wanted to get back to the nitty gritty of low level algorithms and implementations.

My A.I.Cookbook project is coming along, albeit very slowly over the last few months. I’m looking forward to starting a few new projects in the Cookbook, the OCR project against the OpenPlaque images needs finishing first (I see more prizes ahead there…).

Emily and I have some joint A.I./mobile projects to publish this year and I’m always on the look-out for interesting parallel, high performance computing and artificial intelligence problems – if you have a problem in this area please do get in touch.

Lee Tucknott did the design for Mor Consulting and has provided a design for the A.I.Cookbook (which I’ve yet to get implemented), if you need a beautiful design do go see his work.

And now, to push on with Social Ties, our first A.I./mobile product for January which’ll help you find interesting people at events…


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Life