
Tesseract
Pure JavaScript OCR library.
- Free • Open Source
- Mac
- Windows
- Linux
What is Tesseract?
Tesseract.js is a javascript library that gets words in almost any language out of images.
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images. There are language files for many languages, even for text set in Fraktur and blackletter typefaces.
Tesseract Screenshots
Tesseract Features
Tesseract information
Supported Languages
- English
Comments and Reviews
Said about Tesseract as an alternative
samsuffit
GImageReader is a very good tool that use in fact the 'Tesseract' javascript library
Tags
- Drag selection
- text-recognition
In terms of OCR this tesseract is fantastic. I compared it to ABBYY 14 and tesseract had fewer errors on dictionary words. While it doesn't offer layout preservation with the OCR (i.e. converting into an editable document that should print similarly) you'll likely make up for that in the reduced time needed to fix OCR errors.
For handling PDFs you'll need to convert them to an image file, first - pdftopng (an Open Source tool that can be found in the Xpdf project)
I downloaded from SourceForge. I'm looking for a simple installer for Windows. I see so many likes for this program. I will guess they are from Linux users.