Multi function ocr tool

4/30/2023

New language codes included: amh (Amharic), asm (Assamese), aze_cyrl (Azerbaijana in Cyrillic script), bod (Tibetan), bos (Bosnian), ceb (Cebuano), cym (Welsh), dzo (Dzongkha), fas (Persian), gle (Irish), guj (Gujarati), hat (Haitian and Haitian Creole), iku (Inuktitut), jav (Javanese), kat (Georgian), kat_old (Old Georgian), kaz (Kazakh), khm (Central Khmer), kir (Kyrgyz), kur (Kurdish), lao (Lao), lat (Latin), mar (Marathi), mya (Burmese), nep (Nepali), ori (Oriya), pan (Punjabi), pus (Pashto), san (Sanskrit), sin (Sinhala), srp_latn (Serbian in Latin script), syr (Syriac), tgk (Tajik), tir (Tigrinya), uig (Uyghur), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek in Cyrillic script), yid (Yiddish). V3.04, released in July 2015, added an additional 39 language/script combinations, bringing the total count of support languages to over 100. New languages included Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, German ( Fraktur script), Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Arabic, Hebrew) languages, as well as many more scripts. Version 3 extended language support significantly to include ideographic (Chinese & Japanese) and right-to-left (e.g. Tesseract v2 added six additional Western languages (French, Italian, German, Spanish, Brazilian Portuguese, Dutch). The initial versions of Tesseract could only recognize English-language text. Tesseract can detect whether text is monospaced or proportionally spaced. Support for a number of new image formats was added using the Leptonica library. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page-layout analysis. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output. Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. However, due to limited resources it is only rigorously tested by developers under Windows and Ubuntu. It is available for Linux, Windows and Mac OS X. Tesseract was in the top three OCR engines in terms of character accuracy in 1995. Version 5 was released in 2021, after more than two years of testing and developing. So it is for example possible to recognize text with a mix of Western and Central European languages by using the model for the Latin script it is written in. Version 4 adds LSTM based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages. Tesseract development has been sponsored by Google since 2006. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV).

Very little work was done in the following decade. Since then, all the code has been converted to at least compile with a C++ compiler. A lot of the code was written in C, and then some more was written in C++. The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 19, with more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. In 2006, Tesseract was considered one of the most accurate open-source OCR engines available. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

It is free software, released under the Apache License. Tesseract is an optical character recognition engine for various operating systems. (more can be added using included training files)

Afrikaans, Albanian, Arabic, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Czech, Cherokee, Croatian, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hebrew, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Maltese, Malay, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian & Vietnamese

0 Comments

Multi function ocr tool

Leave a Reply.

Author

Archives

Categories