Saturday, March 10, 2012

Optical Character Recognition Freeware: JOCR and FreeOCR

I had some PNG images of text files.  I wanted to do optical character recognition (OCR) to convert their contents into text.

I already had Adobe Acrobat.  I tried printing the PNGs as PDFs in Acrobat, which was capable of doing OCR.  Unfortunately, its recognition was poor, even when I set its PDF printer to 1200 DPI.

Next, I turned to a fairly recent review of desktop OCR software.  It looked like the best-known OCR freeware engines might be the Cuneiform, the Tesseract, and the SimpleOCR.  I checked Softpedia for examples of those.  Cuneiform was virtually unknown, and SimpleOCR was making a mediocre showing, but FreeOCR (using Tesseract) seemed relatively popular.

That review also directed me toward JOCR, which was apparently designed to do OCR directly from the screen. The reviews of JOCR at CNET and Softpedia were underwhelming.  But because it was supposedly designed to do OCR directly from the screen, I decided to compare it against one of the others.  FreeOCR seemed to be the most likely candidate.  I might have gotten different results from one of the other OCR engines, or from another implementation of the Tesseract engine.

I let JOCR and FreeOCR try their luck with a screenshot taken from a maximized Notepad display of a text file, upsampled to 300 DPI.  (FreeOCR had frozen with a 600 DPI file, which JOCR had been able to handle without difficulty.)

Briefly, the FreeOCR output was visibly inferior to the JOCR output.  Of course, this was a test with text from an image, for which JOCR was specially designed.  There was no question, at least within the parameters of this brief test, that JOCR was producing better output.

Compared against the original text, the primary problem with the JOCR output was in the area of capitalization.  The recognized text was generally pretty accurate, with few dropped letters or other errors.  Overall, its output would have made a bad impression, if pasted directly into the body of a professional letter or memorandum; but its output was quite good for archival purposes of capturing the wording in an image.