Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly

View Ian Ozsvald's profile on LinkedIn

ModelInsight Data Science Consultancy London Protecting your bits. Open Rights Group

4 April 2010 - 22:36Tesseract optical character recognition to read plaques

The tesseract engine (wikipedia) is a very capable OCR package, I’m playing with it after a thought for my AI Handbook plan. OCR is a pretty interesting subject, it drove a lot of early computer research as it was used to automate paper filing for banks and companies like Readers Digest. This TesseractOSCON paper gives a nice summary of how it works.

Update – almost 100% perfect recognition results are possible, see OCR Webservice work-in-progress for an update.

As it states on the website:

“The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.”

I wanted to see how well it might extract the text from English Heritage plaques for the openplaques.org project. At the weekend I took this photo:

On the command line I ran:

tesseract SwissGardensPlaque.tif output.txt -l eng

and the result in output.txt was:

 1=e—1.EAsuRE masonrr
_ _ THE
IN 1838 E
x BY
9/ 1789 — 1863 N
égpis COQQE

Obviously the result isn’t brilliant but all the major text is present – this is without any training or preparation.

As a pre-processing test I flattened the image to a bitdepth of 1 (black and white), rotated the image a little to make the text straight and cropped some of the unnecessary parts of the image. The recognition improves a small amount, the speckling and the bent text are still a problem:

"W,R¤¤AM a
‘ A
IN 1838
1789 - 1863
lb (9*,
6*7 S 006
USSEX couw 1"

I tried a few others plaques and the results were similar – generally all the pertinent text came through along with some noise.

On my MacBook it took 20 minutes to get started. I downloaded:

  • tesseract-2.04.tar.gz
  • tesseract-2.00.eng.tar.gz

As noted in the README I extracted the .eng data files into tessdata/, ran ‘./configure’, ‘make’, ‘sudo make install’ and that was all.

For future research there are other OCR systems with SDKs. The algorithms used for number plate recognition might be an interesting place to start further research.

Update – the blog for the A.I. Cookbook is now active, more A.I. and robot updates will occur there.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: ArtificialIntelligence

4 April 2010 - 19:39New book/wiki – a practical artificial intelligence ‘cookbook’

Having almost completed The Screencasting Handbook I’m now thinking about my next project. I’ve been involved in the field of artificial intelligence since my first computer (a Commodore 64 back in the 80s) and I’ve continued to be paid to work in this area since the end of the 90s.

Update – as mentioned below the new project has started – read more at the A.I. Cookbook blog.

My goal now is to write a collaborative book (probably using a wiki) that takes a very practical look at the use of artificial intelligence in web-apps and desktop software. The big goal would be to teach you how to effectively use A.I. techniques in your job and for your own research. Here’s a few of the topics that could be covered:

  • Using open source and commercial tools for face, object and speech recognition
  • Playing with open source and commercial text to speech tools (e.g. the open source festival)
  • Automated control of driving and flight simulators with artificial brains
  • Building chatbot systems using tools like AIML, CHAT-L and natural language parsing kits
  • Using natural language parsing to add some smarts to apps – maybe for reading and identifying interesting people in Twitter and on blogs
  • Building useful demos around techniques like neural networks and evolutionary optimisation
  • Adding brains to real robots with some Arduinos and open source robot kits
  • Teaching myself machine learning and pattern matching (an area I’m weak on) along with useful libraries like Bayesian classification (Python’s reverend is great for this)
  • Parallel computation engines like Amazon’s EC2, libcloud and GPU programming with CUDA and OpenCL
  • Using Python and C++ for prototyping (along with Matlab and some other relevant languages)
  • and a whole bunch of other stuff – your input is very welcome

I’ve noticed that there are an awful lot of open source (and commercial) toolkits but very few practical guides to using them in your own software. What I want to encourage are some fun projects that’ll run for a month or two, here are some ideas:

  • Using optical character recognition engines to augment projects like OpenPlaques.org with free meta data from real-world photos (for a start see my Tesseract OCR post)
  • Collaborating in real-world competitions like the Simulated Car Racing Competition 2010: Demolition Derby (they’re running a simulated project that’s not unlike the DARPA Grand Challenge)
  • Applying face recognition algorithms to flickr photos so we can track who is posting images of us for identity management
  • Creating a Twitter bot that responds to questions and maybe can have a chat (checking the weather should be easy, some memory could be useful – using Twitter as an interface to tools like OCR for plaques might be fun too) – I have one of these in development right now
  • Build a Zork-solving bot (using NLP and tools like ConceptNet) that can play interactive fiction, build maps and try to solve puzzles
  • Using evolutionary optimisation techniques like genetic algorithms on the traveling salesman problem
  • Building Braitenberg-like brains for open source robot kits (like those by Steve at BotBuilder)
  • Crate a QR code and Bar Code reader, tied to a camera

LinkedIn has my history – here’s my work site (please forgive it being a little…simple) Mor Consulting Ltd, I’m the AI Consultant for Qtara.com and I used to be the Senior Programmer for the UK R&D arm of MasaGroup.net/BlueKaizen.com.

I don’t have a definite timeline for the book, I’ll be making that up with you and everyone else once I’ve finished The Screencasting Handbook (end of April).

The Artificial Intelligence Cookbook project has started – the blog is currently active (along with the @aicookbook Twitter account). There is a mailing list to join for occasional updates – email AICookbook@Aweber.com to join.

It will be a commercial project and I will be looking to make it very relevant to however you’re using AI. Sign-up and you’ll get some notifications from me as the project develops.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Programming, Python