algorithm - What is the typical method to separate connected letters in a word using OCR

Question

Welcome To Ask or Share your Answers For Others

algorithm - What is the typical method to separate connected letters in a word using OCR

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

algorithm - What is the typical method to separate connected letters in a word using OCR

I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.

Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually to a letter. Is there any known algorithm for that?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:30:49+0000

The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.

I'd encourage you to look at Tesseract - an open source OCR implementation, especially the documents.

Feature as defined in the glossary has a bit on this, but there is a ton of information here.

Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.

For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.

Categories

algorithm - What is the typical method to separate connected letters in a word using OCR

algorithm - What is the typical method to separate connected letters in a word using OCR

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags