Tesseract optical character recognition to read plaques

The tesseract engine (wikipedia) is a very capable OCR package, I’m playing with it after a thought for my AI Handbook plan. OCR is a pretty interesting subject, it drove a lot of early computer research as it was used to automate paper filing for banks and companies like Readers Digest. This TesseractOSCON paper gives a nice summary of how it works.

Update – almost 100% perfect recognition results are possible, see OCR Webservice work-in-progress for an update.

As it states on the website:

“The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.”

I wanted to see how well it might extract the text from English Heritage plaques for the openplaques.org project. At the weekend I took this photo:

On the command line I ran:

tesseract SwissGardensPlaque.tif output.txt -l eng

and the result in output.txt was:

"VICTORIAN
 1=e—1.EAsuRE masonrr
_ _ THE
SWISS GARDENS
FOUNDED HERE
IN 1838 E
x BY
JAMES BRITTON BALLEY
SHIPBUILDER
9/ 1789 — 1863 N
x
égpis COQQE
USSEX c0UN“"

Obviously the result isn’t brilliant but all the major text is present – this is without any training or preparation.

As a pre-processing test I flattened the image to a bitdepth of 1 (black and white), rotated the image a little to make the text straight and cropped some of the unnecessary parts of the image. The recognition improves a small amount, the speckling and the bent text are still a problem:

"W,R¤¤AM a
‘ A
VICTORlAN
i ‘ P-LEASURE RESORT
SWISS GARDENS
FOUNDED mama
IN 1838
BY
JAMES\ BRITTON BALLEY
SHIPBUILDER
1789 - 1863
lb (9*,
6*7 S 006
USSEX couw 1"

I tried a few others plaques and the results were similar – generally all the pertinent text came through along with some noise.

On my MacBook it took 20 minutes to get started. I downloaded:

  • tesseract-2.04.tar.gz
  • tesseract-2.00.eng.tar.gz

As noted in the README I extracted the .eng data files into tessdata/, ran ‘./configure’, ‘make’, ‘sudo make install’ and that was all.

For future research there are other OCR systems with SDKs. The algorithms used for number plate recognition might be an interesting place to start further research.

Update – the blog for the A.I. Cookbook is now active, more A.I. and robot updates will occur there.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

1 Comment

  • I have always been fascinated by OCR technology. Very interesting reading the results of your experiment. One of the areas that particularly interests me is spam captchas. The better the OCR gets the more obscure image captchas will need to become, it will get to a point where the captchas aren't human readable and we will need a new way to defeat spam. I read this article recently on 'emergence imaging', http://whatisartificialintelligence.com/772/new-captcha-technology-to-prevent-robot-hackers/