Python Text Processing with NLTK 2.0 Cookbook (Amazon US, UK) is a cookbook for Python’s Natural Language Processing Toolkit. I’d suggest that this book is seen as a companion for O’Reilly’s Natural Language Processing with Python (available for free at nltk.org). The older O’Reilly book gives a lot of explanation for how to use NLTK’s component, Packt’s new book shows you lots of little recipes which build to larger projects giving you a great hands-on toolkit.
Overall the book is easy to read, has a huge set of sample recipes and feels very useful. I’ll be referring to it for our upcoming @socialties mobile app.
You’ll need to download NLTK, you can also refer to some sample articles at Packt’s site and get Chapter 3 as a free PDF (see below). The author is Jacob Perkins, his blog links to many related articles, he also has a nice ‘how it started‘ article.
Here are my thoughts on the book. Disclosure – I was sent a free copy of the book by Packt for review, the thoughts below are entirely my own.
Chapter 1: Tokenizing Text and WordNet Basics
If you haven’t tried tokenising text before you may not realise how complicated it can be (expressing even basic rules for English is jolly hard!). This chapter has a good overview of tokenisation and the excellent WordNet library. Filtering stopwords (low value words like ‘the’, ‘of’) and synsets approaches (synonym groups in WordNet) are also covered. The word similarity measure was new to me, the book certainly throws up nice nuggets.
Chapter 2: Replacing and Correcting Words
Stemming approaches are covered, the goal is to find common root words (e.g. “running”, “runs” and “run” can each have “run” as their stem) to simplify your input text. Synonym replacement (e.g. converting “bday” to “birthday”) and negating words using antonyms are nicely treated. Babelfish is provided through NLTK for translation and the PyEnchant spellchecker is introduced.
Chapter 3: Creating Custom Corpora (sample PDF chapter)
This chapter discusses MongoDB (a NoSQL document store) as a way to store your own corpora in NLTK’s format, it also introduces part of speech tagging. File locking using lockfile is mentioned in case you’re using multiple processes (discussed later).
Chapter 4: Part-of-Speech Tagging and Chapter 5: Extracting Chunks
I was less interested in this part, I’ve had to extract Named Entities before and there’s a nice discussion in Chapter 5.
Chapter 6: Transforming Chunks and Trees
The section on filtering out insignificant words using part of speech tags was interesting (i.e. using the Determiner tag DT to filter words like “a”, “all”, “an”, “that”, “that”). Cardinals (numbers) are discussed, I liked the recipe for swapping noun cardinal phrases so e.g. “Dec 10” becomes “10 Dec” (whilst “10 Dec” doesn’t change).
Chapter 7: Text Classification
This feels like it will be useful – bag of words classification and the Naive Bayes Classifier are discussed (along with some other classifiers). Here the author starts to build a movie rating classifier. Precision and Recall are explained nicely. A high-information classifier is built, this is useful as we can then remove low-information words (those that aren’t biased to a single class in the classifier) which can improve classification results. Combining classifiers to further improve results is also covered.
Chapter 8: Distributed Processing and Handling Large Datasets
This chapter has promise – I wasn’t aware of the share-nothing distributed execution engine execnet. Redis is also used, Jacob builds towards a distributed word scoring engine which uses Redis as a single storage system. I’ve yet to use Redis but really want to hook it into our future @socialtiesapp, distributed processing will definitely be on the agenda too.
Chapter 9: Parsing Specific Data
This is a little gem, tucked at the end of the book. Ages ago I’d come across a date parsing module (which I then forgot about), having needed it recently I was super-happy to see dateutil discussed. It makes the parsing of different date formats incredibly easy and also handles timezones.
The timex module in NLTK is introduced (I’d never heard of it before) – it takes a fuzzy reference to a date or time and marks it up. An example would be “let’s go sometime <TIMEX2>this week</TIMEX2>”, you can then extract the fuzzy reference and decide how to interpret it in your application.
Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.