Historical Document Digitization through Layout Analysis and Deep Content Classification
Abstract: Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.
Citation:Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "Historical Document Digitization through Layout Analysis and Deep Content Classification" Proceedings of the 23rd International Conference on Pattern Recognition, Cancun, Mexico, 4-8 Dec 2016, 2016 DOI: 10.1109/ICPR.2016.7900272
- Author version:
- DOI: 10.1109/ICPR.2016.7900272