A Histogram-Based Technique for Automatic Threshold Assessment in a Run Length Smoothing-based Algorithm
Abstract
Document layout analysis is crucial in the automatic document processing workflow, because its outcome affects all subsequent processing steps. A first problem concerns the possibility of dealing not only with documents having easy layout, but with so-called Non-Manhattan layout documents as well. Another problem is that most available techniques can be applied to scanned document, due to the emphasis in previous decades being put on legacy documents digitization. Conversely, nowadays most documents come directly in digital format, and thus new techniques must be developed. A famous approach proposed in the literature for layout analysis was the RLSA, suitable to scanned black&white images and based the application of Run Length Smoothing and the AND logical operator. A recent variant thereof is based on the application of the OR operator, for which reason has been called RLSO. It exploits a bottom-up approach that proved able to handle even non-Manhattan layouts, on both scanned and natively digital documents. Like RLSA, it is based on the definition of thresholds for the smoothing operator, but the different approach requires different criteria than those that work in RLSA to define proper values. Since this is a hard and unnatural task for an (even expert) user, this paper proposes a technique to automatically define such thresholds for each single document, based on the distribution of spacing therein. Application on selected samples of documents, that aimed at covering a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. It can provide a useful basis also for handling more complex cases.
Autore Pugliese
Tutti gli autori
-
ESPOSITO F.;FERILLI S.;BASILE T.M.
Titolo volume/Rivista
Non Disponibile
Anno di pubblicazione
2010
ISSN
Non Disponibile
ISBN
978-1-60558-773-8
Numero di citazioni Wos
Nessuna citazione
Ultimo Aggiornamento Citazioni
Non Disponibile
Numero di citazioni Scopus
6
Ultimo Aggiornamento Citazioni
Non Disponibile
Settori ERC
Non Disponibile
Codici ASJC
Non Disponibile
Condividi questo sito sui social