[comp.graphics] Help : Sperate Bitmap character

brian@ucselx.sdsu.edu (Brian Ho) (08/28/90)

Hello out there,
  I have some interesting problem that you may find interested and may be
  you can give me a hand/hint.

  I am currently working on a OCR (optical Character Recogniation) project.
  I am now in the stage that I need to scan a page of document, and sperate
  each character appears in the document.  The image of the document from the
  scanner will converted into (binary) bitmap format. e.g

0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0001111111100000000000000111000000000000000000000000000000000000000000
0001110000000000000000000111000000000110000001101110000000000000000000
0001100000000111111100001111111000011111100001111110000000000000000000
0001100000000111101110001111000000110000110001110000000000000000000111
0001111110000111000110000111000001111111110001100000000000000000011110
0001100000000111000110000110000001111111110001100000000000000000001111
0001100000000111000110000111000001110000000001100000000000000000000011
0001100000000111000110000111000000111001110001100000000000000000000000
0001111111100111000110000011111000011111100001100000000000000000000000
0001111111100010000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000

  
  And I have a function that can sperate each character from the document.
  My function work fine when two characters are sperated by one or more
  (blank) column, as the example shown in above.

  My problem is when two characters are sperated less than one blank column,
  I can not distinguish/sperate the two character. (P.S. the character has
  unknown size) e.g.


000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000
000111111110000000000001110000000000000000000000000000000000000000
000111000000000000000001110000000110000001101110000000000000000000
000110000001111111000011111110011111100001111110000000000000000000
000110000001111011100011110000110000110001110000000000000000000111
000111111001110001100001110001111111110001100000000000000000011110
000110000001110001100001100001111111110001100000000000000000001111
000110000001110001100001110001110000000001100000000000000000000011
000110000001110001100001110000111001110001100000000000000000000000
000111111111110001100000111110011111100001100000000000000000000000
000111111110100000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000


The characters "En" and "te" are eventually appears side by side with the other
character.


I am wondering if anybody out there that can give me some advices, how to solve
this problem.  Or even someone who is facing the same type of problem, I'll like
to hear about it.. Thank you .. Thank you.....


Thank you for advance.. 

Brian Ho

Contack me at :

brian@yucatec.sdsu.edu
brian@ucselx.sdsu.edu

mark@calvin..westford.ccur.com (Mark Thompson) (08/28/90)

In article <1990Aug27.172757.18703@ucselx.sdsu.edu> brian@ucselx.sdsu.edu (Brian Ho) writes:
>  My problem is when two characters are sperated less than one blank column,
>  I can not distinguish/sperate the two character. (P.S. the character has
>  unknown size) e.g.
>000000000000000000000000000000000000000000000000000000000000000000
>000111111110000000000001110000000000000000000000000000000000000000
>000111000000000000000001110000000110000001101110000000000000000000
>000110000001111111000011111110011111100001111110000000000000000000
>000110000001111011100011110000110000110001110000000000000000000111
>000111111001110001100001110001111111110001100000000000000000011110
>000110000001110001100001100001111111110001100000000000000000001111
>000110000001110001100001110001110000000001100000000000000000000011
>000110000001110001100001110000111001110001100000000000000000000000
>000111111111110001100000111110011111100001100000000000000000000000
>000111111110100000000000000000000000000000000000000000000000000000
>000000000000000000000000000000000000000000000000000000000000000000
>
>
>The characters "En" and "te" are eventually appears side by side with the
>other character.

Well provided that none of your characters exhibit "breaks" in the 
bitmap (ofcourse with the exception of the letter "i"), a 3x3
neighborhood operation could clear up the "te" case by rejecting
1's that are not horizontally, vertically, or diagonally connected.

The "En" case however is not so easily dealt with and if the character
size is allowed to vary greatly within a single word, it may be impossible.
I would try looking at horizontal run lengths and sudden changes in line
gradients to attempt to determine a reasonable separation point for
overlapping characters. Obviously you wouldn't resort to this unless
no match could be made.

Another alternative is to intelligently look at the word and context
and make a guess at what the character(s) are. Many non-font oriented
OCRs do this.
+--------------------------------------------------------------------------+
|  Mark Thompson                                                           |
|  mark@westford.ccur.com                                                  |
|  ...!{decvax,uunet}!masscomp!mark   Designing high performance graphics  |
|  (508)392-2480                      engines today for a better tomorrow. |
+------------------------------------------------------------------------- +