A Run Length Smoothing-based Algorithm for Non-Manhattan Document Segmentation
Abstract
Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run-Lengh Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.
Autore Pugliese
Tutti gli autori
-
ESPOSITO F.;FERILLI S.
Titolo volume/Rivista
Non Disponibile
Anno di pubblicazione
2012
ISSN
Non Disponibile
ISBN
Non Disponibile
Numero di citazioni Wos
Nessuna citazione
Ultimo Aggiornamento Citazioni
Non Disponibile
Numero di citazioni Scopus
Non Disponibile
Ultimo Aggiornamento Citazioni
Non Disponibile
Settori ERC
Non Disponibile
Codici ASJC
Non Disponibile
Condividi questo sito sui social