Automatic Document Layout Analysis through Relational Machine Learning
Abstract
The current spread of digital documents raised the need of effective content-based retrieval techniques. Since manual indexing is infeasible and subjective, automatic techniques are the obvious solution. In particular, the ability of properly identifying and understanding a document’s structure is crucial, in order to focus on the most significant components only. At a geometrical level, this task is known as Layout Analysis, and thoroughly studied in the literature. On suitable descriptions of the document layout, Machine Learning techniques can be applied to automatically infer models of classes of documents and of their components. Indeed, organizing the documents on the grounds of the knowledge they contain is fundamental for being able to correctly access them according to the user’s needs. Thus, the quality of the layout analysis outcome biases the next understanding steps. Unfortunately, due to the variety of document styles and formats, the automatically found structure often needs to be manually adjusted. We propose the application of supervised Machine Learning techniques to infer correction rules to be applied to forthcoming documents. A first-order logic representation is suggested, because corrections often depend on the relationships of the wrong components with the surrounding ones. Moreover, as a consequence of the continuous flow of documents, the learned models often need to be updated and refined, which calls for incremental abilities. The proposed technique, embedded in a prototypical version of the document processing system DOMINUS, using the incremental first-order logic learner INTHELEX, revealed good performance in real-world experiments.
Autore Pugliese
Tutti gli autori
-
ESPOSITO F.;DI MAURO N.;FERILLI S.;BASILE T.M.
Titolo volume/Rivista
Non Disponibile
Anno di pubblicazione
2011
ISSN
1860-949X
ISBN
978-3-642-22912-1
Numero di citazioni Wos
Nessuna citazione
Ultimo Aggiornamento Citazioni
Non Disponibile
Numero di citazioni Scopus
Non Disponibile
Ultimo Aggiornamento Citazioni
Non Disponibile
Settori ERC
Non Disponibile
Codici ASJC
Non Disponibile
Condividi questo sito sui social