Uncategorized

How to do Optical Character Recognition (OCR) of non-English documents in R using Tesseract?

FavoriteLoadingAdd to favorites

One of the many great packages of rOpenSci has implemented the open source engine Tesseract.Optical character recognition (OCR) is used to digitize written or typed documents, i.e. photos or scans of text documents are “translated” into a digital text on your computer. While this might seem like a trivial task at first glance, because it is so easy for our human brains. When reading text, we make use of our built-in word and sentence “autocomplete” that we learned from experience. But the same task is really quite difficult for a computer to recognize typed words correctly, especially if the document is of low quality. One of the best open-source engines today is Tesseract. You can run tesseract from the command-line or – with the help of rOpenSci’s tesseract package – run it conveniently from within R! Tesseract uses language…
Original Post: How to do Optical Character Recognition (OCR) of non-English documents in R using Tesseract?