Project Title: Title Page Recognition

Character recognition is constantly a frontier area of study in the field of pattern recognition and image processing. Even though, adequate studies have been performed in foreign scripts like English, Chinese and Japanese characters, simply a very little work can be traced for Indian scripts particularly South Indian scripts like Telugu and Kannada. Telugu is one of the oldest and most popular languages of India. Recognition of these scripts may be a challenging task.

In this project work we would like to identify the titles of the book from the scanned digital image which consists of telugu text and images. The special feature is that, it is designed to handle multiple sizes and multiple fonts. Extracting the information from the images and scanned documents has been classified as a science fiction in the present days.

This project focuses on scanning the front page of the book and saving it as an image. Then the next sequence of steps involves Preprocessing, Segmentation, Feature Selection and Recognition. Once the characters are recognized then they are categorized as Title-name, Author, Publisher etc. 

Technologies to be used:                                              

Operating System                   : Windows

Programming Language          : Java

Commercial OCR packages are already available for languages like English. And Considerable work has also been done for language like Japanese and Chinese. Recently, work has been done for development of OCR systems for Indian languages. Telugu is one of the popular languages of India that is spoken by more than 66 million people especially in South India. Work on Telugu character recognition is not substantial.Recognition of Telugu script is much more complicated because of the use of huge number of combinations of characters and modifiers.Recognition of these scripts may be a challenging task. Basic Symbols are identified as the unit of recognition in Telugu script. Histograms are used for a feature based recognition scheme for these basic symbols.

The main objective of this Title Page Recognition project is to identify the titles of the text books from the scanned digital image which consists of Telugu text and images. The special feature is that, it is designed to handle multiple sizes and multiple fonts. Extracting the information from the images and scanned documents has been classified as a science fiction in the present days. It mainly focuses on scanning the front page of the book and saving it as an image. Then the next sequence of steps involves Preprocessing, Segmentation, Feature Selection and Recognition. Once the characters are recognized then they are categorized as Title-name, Author, Publisher etc. which is useful for easy retrieval and storage.

CONCLUSION

The document analysis and understanding of Indian Languages has been lagging behind the recognition systems for languages like English. And Considerable work has also been done for language like Japanese and Chinese. Recently, work has been done for development of OCR systems for Indian languages. Telugu is one of the popular languages of India. In this work we propose a recognizer for Telugu text, which extracts characters from the scanned image and identifies the title and the author of the respective Telugu books.

We have analyzed Telugu script and their language rules for building a language model for better document understanding. We have explored pattern recognition algorithms for designing feature extraction, classification and post processing mechanisms for recognition of document images.

The further work includes implementation of various algorithms for recognition of titles and authors of Telugu text books in the field of pattern recognition and image processing.

Reference books:

  1. Principles of Digital Image Processing – Fundamental Techniques by Wilhem Burger, Mark J. Burge.
  2. Fundamentals of Digital Image Processing by Anil K.Jain.