Pubblicazioni Apulia Research Gate

Extracting General Lists from Web Documents: A Hybrid Approach

Torna indietro

Abstract

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of aWeb page independently do not generalize well.We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.

Autore Pugliese

Malerba Donato

Tutti gli autori

MALERBA D.

Titolo volume/Rivista

Non Disponibile

Anno di pubblicazione

2011

ISSN

Non Disponibile

ISBN

978-3-642-21821-7

Numero di citazioni Wos

Nessuna citazione

Ultimo Aggiornamento Citazioni

Non Disponibile

Numero di citazioni Scopus

Non Disponibile

Ultimo Aggiornamento Citazioni

Non Disponibile

Settori ERC

Non Disponibile

Codici ASJC

Non Disponibile