OCR / OCV - Reading plain text with a camera

OCR is the abbreviation for Optical Character Recognition and OCV stands for Optical Character Verification. Translated into German, this means as much as plain text reading or plain text verification. While it used to be necessary to use a specific font for machine reading, this is no longer necessary today. A good example is passports, which have a machine-readable line and therefore had to be printed with exactly this font. In recent years, OCR systems have been further developed so that things are now possible that would have been unthinkable some time ago. Today, OCR can be used reliably and without training on documents thanks to the standardization of Windows fonts. It is even possible to read narrow proportional fonts. A modern OCR system is able to recognize the format of a text so that even multi-column documents can be processed automatically.

What is OCR actually?

Optical character recognition (OCR) is a technology that converts various documents into searchable and editable files. This can be PDF data, paper documents or digital images, for example. If you want to extract relevant information from a brochure, a newspaper article or even a contract in order to reproduce it in Word format or edit it in an Excel file, for example, you cannot simply use a scanner. This is because the scanner only outputs a copy or image of the document. This is a collection of pixels, i.e. pixels that can be white, black or colored. Of course, tables or raster graphics are also possible.

OCR software is required to read and process these documents. It turns documents, PDFs or digital images into words and sentences. This allows information to be stored in a readable and searchable format. Further processing is also possible.

Text recognition in practice

Most optical input devices, such as digital cameras, scanners or faxes , can only output raster graphics. This means that the dots arranged in the columns and rows are colored differently, the so-called pixels. In text recognition, however, letters must also be recognizable as letters. This is because they must be identified in order to subsequently assign them a numerical value, which is assigned to them after text encoding. For example, using Unicode or ASCII.

In German, the terms OCR and automatic text recognition are used as synonyms. However, this is incorrect , because technically speaking, OCR describes the recognition of individual characters into separate image parts. This is preceded by recognizing the structures by first separating text blocks from the graphic elements. The line structures are then recognized and individual characters are separated. The decision as to which text character is involved is made using certain algorithms that take linguistic context into account.

In the past, it was necessary to use specially designed fonts for automatic text recognition. Everyone probably remembers the bottom line of a check form. This font was designed in such a way that the characters could be distinguished and read by a special OCR reader very quickly and without much computing effort. The font used was called OCR-A and was characterized by the fact that very similar characters, such as the zero and the capital O, were printed in such a way that they were no longer similar. OCR-B, on the other hand, resembled a non-proportional and sans serif font. OCR-H, on the other hand, was modeled on handwritten letters and numbers. As modern computers have become more and more powerful and there are now improved algorithms, it is now possible for printers to recognize normal fonts and even handwriting.

What modern OCR software can do

Modern text recognition software is now able to carry out a context analysis. With the help of ICR (Intelligent Character Recognition), the result can be corrected and a character that was originally recognized as the number 8, for example, is automatically converted into a B because it is within a word. 8 letters thus become letters.

Text recognition is mainly used by larger companies, for example when it comes to automatically processing incoming mail. Documents have to be sorted in the inbox, for example. However, it is not necessary to analyze the entire content for this task. Instead, it is usually sufficient to differentiate according to rough characteristics. This can be, for example, a very specific layout of invoices or forms, a company logo or other characteristic features. Classification is then carried out using pattern recognition, which refers to the defined areas and not to the entire document.

Advantages of OCR

OCR is primarily used to save time and costs when creating a wide variety of documents. This also applies to further processing and reuse. With OCR software, a paper document is scanned so that it can later be edited in a Word document or an Excel file, for example, and then forwarded. It is also possible to take text passages from journals and books and use them in your own documents, working papers and studies without having to type out the quote or text passage.

Even when on the move, it is now possible to capture text from timetables, posters or banners using a simple cell phone camera and use the resulting information in a document. The same of course also applies to text passages from books and paper documents if a scanner is not available. The software can also be used to create searchable archives. Modern programs now work so quickly that data conversion only takes a few seconds.

Further information:

https://en.wikipedia.org/wiki/Optical_character_recognition

00:00 / 00:00