Handwriting recognition techniques can be considered as two distinct categories; on-line and off-line. Off-line handwriting recognition is where an extracted image of handwriting is stored in a digital format, for example a hand written letter which has been processed using a scanner. When attempting to run a recognition algorithm on this type of data, the only information available is the pixel co-ordinates of the pen strokes which make up each character or word.
For online handwriting recognition, we have additional information as the character data is collected whilst the user is actually writing. This kind of data can be captured using tool such as a graphics tablet, touch screen or a digital pen. Capturing data in this way can provide additional information which can be useful for training a recognition algorithm. Information such as character stroke order and stroke velocity is maintained and provides significant improvement of recognition accuracies over off-line techniques although often both offline and online features are combined to provide maximum likelihood of a correct classification.
Handwriting recognition is an important and active area of informatics research offering automated solutions to many monotonous and time consuming problems such as signature verification, postal-address interpretation and bank cheque processing. One of the more recent applications of this technology is in the automatic transcribing of meeting notes and it is within this particular area that the project focuses on.
The IAM On-Line Handwriting Database is the chosen corpus and contains forms of handwritten English text acquired from an electronic interactive whiteboard. The database is initially stored in xml-format, including the writer-id, the transcription and the setting of the recording. For each writer the gender, the native language and some other facts which could be useful for analysis are stored in the database, however, for the purposes of this project, only the pen stroke data is used.
The IAM Online Handwriting Database consists of:
- 221 writers
- More than 1700 acquired forms
- 13,049 isolated and labelled text lines in on-line and off-line format
- 86,272 word instances from a 11,059 word dictionary
Handwriting recognition systems generally assume that the written text is a realisation of some message encoded as a sequence of one or more symbols.
Description of the meaning of pen prosodic data and why it is hoped to improve recognition accuracy
Up until now most handwriting recognition techniques have focussed on
Converting handwritten notes, taken at meetings, into a machine-readable format is essential for developing automated meeting browsers. Since notes are often scribbled and fragmented, and the image resolution of handwriting is not sufficiently high due to the hardware constraints of electronic pen devices used, it is still a challenging task for state-of-the-art automatic character recognition to recognise the notes accurately.
The project will investigate the use of on-line and off-line features of written characters, especially prosody-like features such as pen-up / pen-down durations and pen speed.
HMM's are widely used statistical models for characterizing sequences of speech spectra and have often been applied successfully to handwritten character recognition. Standard HMM's can be classified as either continuous or discrete, each able to model either continuous vectors or discrete symbols respectively. However, we cannot apply both the conventional discrete and continuous HMM's to observation sequences which consist of both continuous values and discrete symbols.
Since prosodic features of handwriting are not continuous and sometimes unobservable whereas non-prosodic features can be continuous, standard hidden Markov models (HMMs) cannot model these features. It is proposed that a multi-space distribution HMM (MSD-HMM), which is an extended version of HMM, can be employed to address the problem and prove to be a more accurate classification technique.
The project will use the AMI Corpus which consists of a large collection of over 100 hours of meeting notes captured using electronic whiteboards and digital pens. This database provides a collection of handwritten characters stored in xml format which can be used to train and test the final constructed classifier. The collection is currently not labelled so some time will be required to spend annotating to construct a useful dataset.
The classifier itself will be constructed using the hidden Markov model toolkit which provides a collection of sophisticated tools for HMM training, testing and analysis. The toolkit currently supports both discrete and continuous probability distributions but will require extension of its current functionality to include multi-space distributions. The project aims to compare the overall accuracy and efficiency of the MSD-HMM technique to the normal continuous or discrete HMM's mentioned previously.