Review for Python Text Processing with NLTK 2.0 Cookbook (Packt, 2010)

Python Text Processing with NLTK 2.0 Cookbook (Amazon US, UK) is a cookbook for Python’s Natural Language Processing Toolkit. I’d suggest that this book is seen as a companion for O’Reilly’s Natural Language Processing with Python (available for free at nltk.org). The older O’Reilly book gives a lot of explanation for how to use NLTK’s component, Packt’s new book shows you lots of little recipes which build to larger projects giving you a great hands-on toolkit.

Overall the book is easy to read, has a huge set of sample recipes and feels very useful. I’ll be referring to it for our upcoming @socialties mobile app.

You’ll need to download NLTK, you can also refer to some sample articles at Packt’s site and get Chapter 3 as a free PDF (see below). The author is Jacob Perkins, his blog links to many related articles, he also has a nice ‘how it started‘ article.

Here are my thoughts on the book. Disclosure – I was sent a free copy of the book by Packt for review, the thoughts below are entirely my own.

Chapter 1: Tokenizing Text and WordNet Basics

If you haven’t tried tokenising text before you may not realise how complicated it can be (expressing even basic rules for English is jolly hard!). This chapter has a good overview of tokenisation and the excellent WordNet library. Filtering stopwords (low value words like ‘the’, ‘of’) and synsets approaches (synonym groups in WordNet) are also covered. The word similarity measure was new to me, the book certainly throws up nice nuggets.

Chapter 2: Replacing and Correcting Words

Stemming approaches are covered, the goal is to find common root words (e.g. “running”, “runs” and “run” can each have “run” as their stem) to simplify your input text. Synonym replacement (e.g. converting “bday” to “birthday”) and negating words using antonyms are nicely treated. Babelfish is provided through NLTK for translation and the PyEnchant spellchecker is introduced.

Chapter 3: Creating Custom Corpora (sample PDF chapter)

This chapter discusses MongoDB (a NoSQL document store) as a way to store your own corpora in NLTK’s format, it also introduces part of speech tagging. File locking using lockfile is mentioned in case you’re using multiple processes (discussed later).

Chapter 4: Part-of-Speech Tagging and Chapter 5: Extracting Chunks

I was less interested in this part, I’ve had to extract Named Entities before and there’s a nice discussion in Chapter 5.

Chapter 6: Transforming Chunks and Trees

The section on filtering out insignificant words using part of speech tags was interesting (i.e. using the Determiner tag DT to filter words like “a”, “all”, “an”, “that”, “that”). Cardinals (numbers) are discussed, I liked the recipe for swapping noun cardinal phrases so e.g. “Dec 10” becomes “10 Dec” (whilst “10 Dec” doesn’t change).

Chapter 7: Text Classification

This feels like it will be useful – bag of words classification and the Naive Bayes Classifier are discussed (along with some other classifiers). Here the author starts to build a movie rating classifier. Precision and Recall are explained nicely. A high-information classifier is built, this is useful as we can then remove low-information words (those that aren’t biased to a single class in the classifier) which can improve classification results. Combining classifiers to further improve results is also covered.

Chapter 8: Distributed Processing and Handling Large Datasets

This chapter has promise – I wasn’t aware of the share-nothing distributed execution engine execnet. Redis is also used, Jacob builds towards a distributed word scoring engine which uses Redis as a single storage system. I’ve yet to use Redis but really want to hook it into our future @socialtiesapp, distributed processing will definitely be on the agenda too.

Chapter 9: Parsing Specific Data

This is a little gem, tucked at the end of the book. Ages ago I’d come across a date parsing module (which I then forgot about), having needed it recently I was super-happy to see dateutil discussed. It makes the parsing of different date formats incredibly easy and also handles timezones.

The timex module in NLTK is introduced (I’d never heard of it before) – it takes a fuzzy reference to a date or time and marks it up. An example would be “let’s go sometime <TIMEX2>this week</TIMEX2>”, you can then extract the fuzzy reference and decide how to interpret it in your application.

lxml, Beautiful Soup and chardet (another gem) are used to write a web page scraper.

Overall I recommend this book, if you have the original O’Reilly book (and you really ought to) then this makes for a great companion. I also spotted these two other reviews.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight and in his Mor Consulting, sign-up for Data Science tutorials in London. He also founded the image and text annotation API Annotate.io, lives in London and is a consumer of fine coffees.

6 Comments

  • Nice post, came here from py reddit ... will definitely have to have a go, nltk being on my list of things to get round to for a while now :)
  • Instead of doing the unicode conversion manually, like described in the book, it would probably be a better idea to use the UnicodeDammit class that comes with BeautifulSoup. This will give you the best of both worlds: a fast html parser (lxml) and a good unicode conversion (beautifulsoup). The class also uses chardet, if it's installed. More here: http://codespeak.net/lxml/elementsoup.html#using-only-the-encoding-detection
  • Thanks for the note, UnicodeDammit does look rather cool :-)
  • cool stuff, would be interested to see some results comparing NLTK and n-gram in accuracy and performance for different tasks.
  • Hey thanks for the review, I also came from reddit. I really need to find time to start learning this stuff, it has interested me for a while now but i never seem to get around to it. Cheers! - Billy Miller