Touching syllable segmentation

Abstract

The most challenging task of a character recognition system is associated with segmentation of individual components of the script with maximum efficiency. This process is relatively easy with regard to stroke based and standard scripts. Cursive scripts are more complex possessing a large number of overlapping and touching objects, where in the statistical behavior of the topological properties are to be studied extensively for achieving highest accuracy. Certain amount of similarity exists between unconstrained hand written text as well as South Indian scripts in terms of topology, component combinations, overlapping and merging characteristics. The concept of syllables and their formulations is an additive complexity with regard to Indian scripts. In this paper the statistical behavior of the cursive script, Telugu, is presented. The topological properties in terms of zones, component combinations, behavioural aspects of syllables are studied and adopted in the segmentation process. The statistical behaviour of cursive components are evaluated. Split Profile Algorithm is proposed while handling touching components. The proposed algorithm is evaluated on different fonts and sizes. The performance of the proposed algorithm is compared with two more methods viz aspect ratio and syllable width approaches.

Keywords: Segmentation, Connected Components, Syllable Segmentation, Touching Syllable, Split Profile Algorithm.

Introduction

Document image analysis and Optical Character Recognition (OCR) systems are under continuous research for decades. The transformation of paper media into the searchable and revisable text format gives research a great boost in the field of language technologies. Automated content creation from printed or written form of ancient and later versions of documents is the major area of OCR research. Achieving accurate results under all possible conditions remains as challenging task. The first step in this process is to achieve maximum efficiency in character segmentation, which inturn reflects in OCR accuracy.

Script topology plays a dominant role in the segmentation process. Structural features describe the patterns of topology and geometry while exploring global and local properties. White spaces and pitch information are the useful primitive parametric data of any segmentation system. The notion of detecting vertical white space between successive characters is an important concept while dissecting images of machine print and even in hand written documents. Apart from this, other topological features like height, width and orientation etc., are useful parameters. In case of fixed width characters, pitch information provides effective segmentation. However, variable width characters are found in almost all scripts due to large number of font designers. As a result, various segmentation approaches are proposed [1] in literature to handle this complexity.

Structural properties of natural language script are another useful piece of information, to be adopted while choosing the segmentation approach. Scripts can be further classified as stroke based, cursive and hybrid. Segmentation of stroke based scripts can be performed by making use of properties like horizontal, vertical and slant line information.

Segmentation approaches that are to be adopted for cursive scripts are complex when compared with stroke based scripts. Character shapes of these scripts posses variable widths and sizes, originated from a combination of more than one component. The topological or structural properties of individual components and their associative nature with other components transform the final shape, occasionally leading to overlapping and touching phenomena. Segmentation issues of these scripts are to be addressed by considering common statistical properties along with specificities of the respective class of formulations. On the other hand, hybrid model is associated with a set of strokes as well as curves. The primitive shape (glyph) is to be treated as the main focus of this model.

Review

Various segmentation methods adapted in document image processing are described [1] by Richard G.Casey and Eric Lecolinet. Profile based approach proposed [2] by K.Ohta which is considered as a simple and effective method for segmenting a print line. These approaches are reported to be effective for non-cursive writing systems and still found their applications even in handwritten recognition systems. Detection of white spaces can be effectively carried out on a structured text image. Identification and extraction of vertical strokes is made simple using this method. Analysis of peaks and valleys of profile patterns extended the scope of profile method for partial adaptation into touching character segmentation. In [1], the profile was first obtained and then the ratio of second derivative of this pattern to its height is used as a criterion for identifying segment separation. The peak of the derivative, which is associated with projection minima converge the splitting points along the thin horizontal lines.

A peak-to-valley function is proposed in [3] by Y.Lu with further improvements. Spatial domain characteristics based on the topological features of the script are explored. Valleys between successive peaks are extracted from the profile function. An average function is used to identify the extract segment point with a specific reference to touching characters. A selection criterion of the segmentation boundary is associated with discriminating function of topological features of individual characteristics.

Bounding box approach [4] is proposed by M.Cesar and R.Shinghal as an alternative to profiling method. They reported that, it is effective on stroke based script. The adjacency relationships between bounding boxes, their size and aspect ratios are explored for splitting mechanisms. Segmentation effectiveness is reported with high degree of accuracies even at low computational requirements. Splitting and merging of character component is reported in [4]. The connected component approach proposed in this work is mainly concentrated on defining specific rules using height and width parameters of bounding box. Extension of connected component approaches is proposed in [5] by G.Seni and E.Cohen for segmentation of hand written words in a document image.

The CJK script models are more predominant in strokes and relationships among the strokes are well structured [6]. Latin text, European language models describe the dominance of strokes. The linear property of strokes is explored in [2] using profile based approach while segmenting characters of all the above scripts.

North Indian scripts are hybrid in nature, combining strokes and curves, where strokes are dominated by curves. Linear spatial relationship in the form of shirorekha (a straight line combining components) can be found within the topological structure. The resultant form of this linear relation is treated as zone, which is used to establish correlation among strokes and curves within the syllables. The top zone of the character resembles stroke like geometry. The positional information of zone is identified by finding the linear region from the profile function of script line. Segmentation is achieved by exploring the statistical properties [7,8,9,10,14] of zone information using profile based methods.

Arabic and South Indian scripts are dominated by curve like components. In Arabic scripts, the formation of character is nonlinear and base line is identified by the peak of vertical profile function[11]. South Indian scripts are derived from the writing style on palm leaves, resembling cursive nature in machine print as well as hand written. The process of character formulations resulting from component combinations with zero width joiner and some times with non joiner leading to overlapping phenomena. Character formation in these scripts (also known as syllable) deals with two part glyphs in certain cases, deviating from the linear process. Notable number of non-linear combinations exists in these scripts. Segmentation is to be addressed by taking into consideration of all processes, linear and non-linear combinations. In this context, the statistical behavior of component shapes within the boundaries of text line, any word and even a syllable, influences the segmentation strategies.

In South Indian scripts, curves are more predominant and extraction of zone information is complex. Syllable is formed by a set of curves with high degree of similarity among them. The individual components in the syllable are extracted and associated relationship is studied using zone information. The extraction of zone information is complex because of the non-linear distribution of glyphs in the upper and bottom regions. Common properties are reported in [12,13] by extensive statistical evaluation of the profile function.

Profile method [11] found its use not only in printed text but also in hand written text. Handwritten profile information of the script line is used to identify the linear portions of the script characters. The profile information of a word differs from that of a line. At the same time, multiple word combinations of a script line posses linear behavior in the profile patterns. Extensive statistical evaluation of various script lines is necessary while formulating rules in the zone identification process. Similar studies are extended to syllable segmentation in the present paper. A detailed description of segmentation model is presented in the following section.

Segmentation Model

The segmentation model in this paper explored the statistical patterns of profile vector which signifies the topology and geometry of printed text with specific reference to cursive script of Telugu. Preprocessing steps like binarisation, skew correction are carried out on the document image before proceeding to segmentation algorithm.

Four phases are proposed in the segmentation process. First phase deals with line segmentation and extraction of zone information, second phase deals with syllable segmentation, third phase addressed the classification of segmented syllables into touching and non-touching objects and fourth phase is emphasised on segmentation of touching objects using Split Profile Algorithm (SPA).

In the first two phases, connected component approach is adopted for segmentation of syllable. Syllable model proposed [13] by Pratap et al, presented in Fig. 1, is adopted in the present phase.

Topology of the syllable can be decomposed into component like glyph objects. One base symbol object, also treated as essential component, is the minimum topology. In a complex conjunct syllable a maximum of four other components will be positioned, as in the above figure. The number of components may vary in between 1 and 5. However the topological features in the form of zones is difficult to extract due to the inherent nature of zero width joiner and non joiners between components. This phenomenon reflected in the form of touching syllables that are predominant in various font sizes.

Using the above model, syllable segmentation and classification of touching syllable is carried out in phase 2 and 3 respectively. In the last phase, segmentation of touching syllable is addressed with the help of SPA. Topology of various syllable components is studied after splitting the profile. Prediction of segmentation threshold is carried out in the separation process of touching syllables.

Line and Zone Separation

Different scripts posses varied structural features. However machine printed document images are structured in nature with a similarity around script line distribution. The linear property from pixel distribution of Horizontal projection Profile (HPP) is adopted for line segmentation.

White spaces between text lines are treated as delimiters in ideal case. However under the influence of noise the profile distribution between lines reflects the random nature of noise information. In the present case, we considered ideal scenario, where the noise component is negligible. Starting point and ending point of script line is found with certain amount of black pixel distribution, using which the lines are segmented.

Pixel distribution of script line is studied on various document images. Certain amount of linear behavior is found in the form of peaks and valleys, reflecting the zone information. The geometry of individual syllable does not match with zones, which is also the case with certain words where as multiple combinations of words found to be linear. One peak in the first half of profile distribution is observed. This peak matched with zone separation line between top and middle zones. However zone separation line between middle and bottom zone is reflected in the form of maximum slope in the later half of the profile formation.

Syllable Segmentation

In an ideal scenario, individual glyph components (Fig.1) can be decomposed using zone information. The canonical space is extracted from the text document using connected component approach with a reported [13] segmentation efficiency of 95.72%, without addressing the touching syllables. Similar approach is adopted in the present phase. The component objects that are separable, are identified with the help of labeling approach. Grouping of core and non-core components are carried out while segmenting syllable objects. These objects may include touching syllables also.

Classification of Correctly Segmented Syllables

In the process of improving segmentation efficiency, it is required to classify the correctly segmented syllables against the others. The syllable objects, extracted from the previous stage are a combination of touching and nontouching syllable objects. Aspect ratio (the relation between component height and width) is a simple approach adopted for this purpose which is defined in Eq.(2).

Results

The proposed algorithm is evaluated on 1,11,582 syllables of Anupama, Gowthami, and Priyanka font. Segmentation is carried out on font sizes of 12,14,16 and 19. Syllable segmentation efficiencies of aspect ratio, syllable width and Split Profile Algorithm for Anupama font is presented in figures 15 to figures 18. SPA outperform with 100% segmentation efficiency on the sample set of size 14, 16 and 19. The syllable width based approach is observed with average segmentation efficiency of 98%. The aspect ratio based approach is observed with segmentation efficiency ranging from 97.93% to 98.04%. For font size 12, SPA is observed with maximum segmentation efficiency of 92.98% against 84.15% and 76.12% with syllable width and aspect ratio respectively. However, when evaluated on samples of font sizes 12,14,16 and 19, the average segmentation efficiency of SPA is observed as 92.98%, 100%, 99.96% and 100% where as syllable width approach is observed as 84.15%, 98.65%, 98.77% and 98.98% and aspect ratio is found to be 76.12%, 97.93%, 97.63% and 98.04% respectively. Comparison of segmentation efficiencies for different fonts and sizes presented in Table 3, 4 and 5

Conclusions

Topology and geometry is observed to be one of the important information of any script. Extensive study of the statistical properties with regard to topology is crucial while improving segmentation accuracy. In this paper an attempt is made towards this direction on popular cursive script Telugu. Profile function is considered for separating the linear region over non linear portions in the script line as well as touching syllables. A general approach (connected component approach) on these scripts is compared with the proposed Split Profile Algorithm. The highest performance of average segmentation efficiency with SPA is observed as 99.98%, 99.47% and 99.05% on ANUPAMA, GOWTHAMI and PRIYANKA fonts respectively. Experimental evaluation of the proposed algorithm on small font sizes is in progress. Extension of the proposed algorithm at the level of segmentation and classification with apreori knowledge is in progress.

References

  1. Richard G.Casey and Eric Lecolinet, "A survey of methods and strategies in character segmentation" IEEE Transctions on Pattern Analysis and machine Intelligence, vol. 18, No. 7, pp. 690-706, July 1996.
  2. K. Ohta, I. Kaneko, Y. Itamoto, and Y. Nishijima, "Character Segmentation of Address Reading/Letter Sorting Machine" for the Ministry of Posts and Telecommunications of Japan, NEC Research and Development, vol. 34, no. 2, pp. 248-256, Apr. 1993
  3. Y. Lu, "On the Segmentation of Touching Characters," lnt'l Conf. Document Analysis and Recognition, Tsukuba, Japan, pp. 440-443, Oct. 1993.
  4. M. Cesar and R. Shinghal, "Algorithm for Segmenting Handwritten Postal Codes," Int'l J. Man Machine Studies, vol. 33, no. 1, pp. 63-80, July 1990.
  5. G. Seni and E. Cohen, "External Word Segmentation of Off- Line Handwritten Text Lines," Pattern Recognition, vol. 27, no. 1, pp. 41-52, Jan. 1994.
  6. K.W. Gan, K.T. Lua, "A new approach to stroke and feature point extraction in Chinese character recognition". Pattern Recognition Letters, Vol. 12 , no. 6, pp 381-386 , June 1991
  7. B. B. Chaudhuri, U. Pal, "A complete printed Bangla OCR system" Pattern Recognition, vol.31, No. 5, pp. 531- 549, March 1998.
  8. Veena Bansal, R. M. K. Sinha, "Integrating knowledge sources in Devanagari text recognition system", IEEE Transactions on Systems, Man, and Cybernetics, Part A : Systems and Human, vol. 30, no. 4, pp 500-505, July 2000.
  9. M. K. Jindal, G. S. Lehal, R. K. Sharma, "A Study of Touching Characters in degraded Gurmukhi Script", World Academy of Science, Engineering and Technology, vol.4,pp 121-124, 2005
  10. U. Pal and Sagarika Datta, "Segmentation of Bangla Unconstrained Handwritten Text", Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003),August 2003.
  11. Liana M. Lorigo, Venu Govindaraju, "Offline Arabic Handwriting Recognition: A Survey" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 712-724, May 2006
  12. Pratap Reddy L, Satyaprasad L, ASCS Sastry, "Middle Zone Component Extraction and Recognition of Telugu Document Image", Ninth International Conference on Document Analysis and Recognition,(ICDAR 2007), Vol 2, pp 584 - 588, September 2007.
  13. Pratap Reddy, L. Sastry, A.S.C. Rao, A.V.S. Venkat Rao, N., "Canonical syllable segmentation of Telugu document images", TENCON 2008 - IEEE Region 10 Conference , pp 1-5, Nov-2008.
  14. M. K. Jindal, R. K. Sharma, G. S. Lehal "Segmentation of touching characters in upper zone in printed Gurmukhi script", ACM Annual Bangalore Compute Conference, Article No.: 9, 2009 .
  15. L. Pratap Reddy, "A New Scheme for Information Interchange in Telugu through Computer Networks", Ph.D. thesis, Department Electronics and Communication, JNTUniversity, Hyderabad, INDIA, May 2001

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!