Layout and content analysis in Digitized Books

Automatic layout analysis has proven to be extremely important in the process of digitization of large amounts of documents. We developed a complete pipeline for layout analysis and content classification, introducing a SVM-aided layout segmentation process. The final output of the automatic analysis algorithm is a complete and structured annotation in JSON format, containing the digitalized text as well as all the references to the illustrations of the input page, and which can be used by visualization interfaces as well as annotation interfaces.

Treccani Annotation Interface

Beside the content analysis algorithm, two tools have been developed. The first one is an annotation tool which allows a user to visualize the result of the analysis process and, if needed, allows for modifications to the segmentation results. This tool is useful for many reasons: it makes the creation of an annotated dataset possible for all the subsequent learning and evaluation processes and allows users to apply corrections to the processed data.


Treccani Visualization Interface 2


The second tool is a visualization interface used to present and browse the content of the original book, making all the information easily accessible. This tool lets the users access the content at di erent levels and from di erent points of view, it's possible to browse the encyclopedia page by page, lemma by lemma and image by image in each volume. The full text is also accessible and readable in HTML format. Hovering the cursor over a page shows the underlying extracted content and double clicking on it takes the user to a di erent view which displays the digitized version of the document.



1 Corbelli, Andrea; Baraldi, Lorenzo; Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita "Layout analysis and content classification in digitized books" Digital Libraries and Multimedia Archives, vol. 701, Firenze, pp. 153 -165 , Feb. 4-5, 2017 | DOI: 10.1007/978-3-319-56300-8_14 Conference
2 Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita "Historical Document Digitization through Layout Analysis and Deep Content Classification" Proceedings of the 23rd International Conference on Pattern Recognition, Cancun, Mexico, 4-8 Dec 2016, 2016 | DOI: 10.1109/ICPR.2016.7900272 Conference
3 Grana, Costantino; Serra, Giuseppe; Manfredi, Marco; Coppi, Dalia; Cucchiara, Rita "Layout analysis and content enrichment of digitized books" MULTIMEDIA TOOLS AND APPLICATIONS, vol. 75, pp. 3879 -3900 , 2016 | DOI: 10.1007/s11042-014-2360-0 Journal
4 Coppi, Dalia; Grana, Costantino; Cucchiara, Rita "Illustrations Segmentation in Digitized Documents Using Local Correlation Features" Proceedings of the 10th Italian Research Conference on Digital Libraries, PROCEDIA COMPUTER SCIENCE, vol. 38, Padova, pp. 76 -83 , Jan. 30-31, 2014 | DOI: 10.1016/j.procs.2014.10.014 Conference

