Deep learning for the development of an OCR for old Tibetan books
Kirill Brodt,
Novosibirsk State University,
Russia
Oleg Rinchinov,
Institute for Mongolian, Buddhist and Tibetan Studies, SB RAS,
Russia
Andrei Bazarov,
Institute for Mongolian, Buddhist and Tibetan Studies, SB RAS,
Russia
Aleksei Okunev,
Novosibirsk State University,
Russia

Motivation and Goals: The funds of the Center of Oriental Manuscripts
and Xylographs of the Institute for Mongolian, Buddhist and Tibetan Studies of
the Siberian Branch of the Russian Academy of Sciences have the richest
Buddhist heritage collections of manuscripts and woodblock printings, numbering
about 100 thousand books in Tibetan and Mongolian. The manual processing of
these writings and their translation into a machine-readable form is tedious
and time-consuming. Automation of optical character recognition or OCR allow us
to save cultural treasures for the future generation.
Methods and Algorithms: We use computer vision and deep learning
advances to apply the OCR for old Tibetan books. Experts in Tibetan studies at
the Institute for Mongolian, Buddhist and Tibetan Studies of the Siberian
Branch of the Russian Academy of Sciences annotated a dataset of scanned rare
woodblock printings, namely Jone (Chone) Kangyur of early 18th century. These
works have been chosen because they represent the woodcutting and printing
quality typical for most Tibetan books. The dataset includes 500 images of book
pages. Each image has text annotations. We split the dataset into 420 training
images and 30 testing images. The RGB image width is 2048 pixels and the
average height is 650 pixels. Each image contains 8 text lines on average. Each
line is framed with bounding box, and the corresponding text annotation is
transliterated in Latin characters. The text length per line is 300 characters
on average. First, we detect rectangular bounding boxes with text in image.
After we crop out the boxes from image and extract the text using OCR model.
Results: We evaluate the performance of the baseline model on validation
dataset, consisting of 30 high-resolution images each containing 8 text lines.
We use standard metrics for OCR. The model reaches: character-by-character
recall (char_recall): 0.9495, character-by-character precision
(char_precision): 0.9539, normalized edit distance (1-N.E.D): 0.9397.
Conclusion: We have created a full-featured system for OCR of the
Tibetan script. We implement a stream decoding the scanned text in Tibetan. The
results of the project were included in the report
of the President of the Russian Academy of Sciences Academician A.M. Sergeev,
presented to the President of the Russian Federation V.V. Putin.
Acknowledgements: We acknowledge the financial support of MTS AI LLC. We
also thank the Presidium of the SB RAS for the idea and organizational support.
Key words: deep learning, Tibetan xylographs, optical character recognition (OCR)