Effettua una ricerca
Teresa Maria Basile
Ruolo
Ricercatore
Organizzazione
Università degli Studi di Bari Aldo Moro
Dipartimento
DIPARTIMENTO INTERATENEO DI FISICA
Area Scientifica
AREA 09 - Ingegneria industriale e dell'informazione
Settore Scientifico Disciplinare
ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore ERC 1° livello
Non Disponibile
Settore ERC 2° livello
Non Disponibile
Settore ERC 3° livello
Non Disponibile
Information Retrieval in large digital document repositories is at the same time a hard and crucial task. While the primary type of information available in documents is usually text, images play a very important role because they pictorially describe concepts that are dealt with in the document. Unfortunately, the semantic gap separating such a visual content from the underlying meaning is very wide. Additionally image processing techniques are usually very demanding in computational resources. Hence, only recently the area of Content-Based Image Retrieval has gained more attention. In this paper we describe a new technique to identify known objects in a picture based on a comparison of the shapes to known models. The comparison works by progressive approximations to save computational resources, and relies on novel algorithmic and representational solutions to improve preliminary shape extraction.
Document layout analysis is crucial in the automatic document processing workflow, because its outcome affects all subsequent processing steps. A first problem concerns the possibility of dealing not only with documents having easy layout, but with so-called Non-Manhattan layout documents as well. Another problem is that most available techniques can be applied to scanned document, due to the emphasis in previous decades being put on legacy documents digitization. Conversely, nowadays most documents come directly in digital format, and thus new techniques must be developed. A famous approach proposed in the literature for layout analysis was the RLSA, suitable to scanned black&white images and based the application of Run Length Smoothing and the AND logical operator. A recent variant thereof is based on the application of the OR operator, for which reason has been called RLSO. It exploits a bottom-up approach that proved able to handle even non-Manhattan layouts, on both scanned and natively digital documents. Like RLSA, it is based on the definition of thresholds for the smoothing operator, but the different approach requires different criteria than those that work in RLSA to define proper values. Since this is a hard and unnatural task for an (even expert) user, this paper proposes a technique to automatically define such thresholds for each single document, based on the distribution of spacing therein. Application on selected samples of documents, that aimed at covering a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. It can provide a useful basis also for handling more complex cases.
Automatic processing of text documents requires techniques that can go beyond the lexical level, and are able to handle the semantics underlying natural language sentences. A support for such techniques can be provided by taxonomies that connect terms to the underlying concepts, and concepts to each other according to different kinds of relationships. An outstanding example of such a kind of resources is WordNet. On the other hand, whenever automatic inferences are to be made on a given domain, a generalization technique, and corresponding operational procedures, are needed. This paper proposes a generalization technique for taxonomic information and applies it to WordNet, providing examples that prove its behavior to be sensible and effective.
The purpose of this paper is to develop a diagnostic tool that can analyze light microscope images of human oocytes and derive a description of the oocyte cytoplasm that is useful for quality assessment in assisted insemination. The proposed approach includes three main phases: 1) segmentation; 2) feature extraction; and 3) clustering. In the segmentation phase, a region of interest inside the cytoplasm is extracted through morphological operators and the Hough transform. In the second phase, regions that result from segmentation are processed through a multiresolution texture analysis to extract a set of features that describe different levels of cytoplasm granularity. To this aim, we evaluate some statistics in the Haar wavelet transform domain. Finally, the extracted features are used to cluster oocytes according to different levels of granularity. This approach is made by fuzzy clustering. Experimental results on a collection of microscope images of oocytes are reported to show the effectiveness of the proposed approach. In addition, comparison with alternative methods for feature extraction and clustering is performed.
For many real-world applications it is important to choose the right representation language. While the setting of First Order Logic (FOL) is the most suitable one to model the multi-relational data of real and complex domains, on the other hand it puts the question of the computational complexity of the knowledge induction process. A way of tackling the complexity of such real domains, in which a lot of relationships are required to model the objects involved, is to use a method that reformulates a multi-relational learning task into an attribute-value one. In this chapter we present an approximate reasoning method able to keep low the complexity of a relational problem by using a stochastic inference procedure. The complexity of the relational language is decreased by means of a propositionalization technique, while the NP-completeness of the deduction is tackled using an approximate query evaluation. The proposed approximate reasoning technique has been used to solve the problem of relational rule induction as well as the task of relational clustering. An anytime algorithm has been used for the induction, implemented by a population based method, able to efficiently extract knowledge from relational data, while the clustering task, both unsupervised and supervised, has been solved using a Partition Around Medoid (PAM) clustering algorithm. The validity of the proposed techniques has been proved making an empirical evaluation on real-world datasets.
The current spread of digital documents raised the need of effective content-based retrieval techniques. Since manual indexing is infeasible and subjective, automatic techniques are the obvious solution. In particular, the ability of properly identifying and understanding a document’s structure is crucial, in order to focus on the most significant components only. At a geometrical level, this task is known as Layout Analysis, and thoroughly studied in the literature. On suitable descriptions of the document layout, Machine Learning techniques can be applied to automatically infer models of classes of documents and of their components. Indeed, organizing the documents on the grounds of the knowledge they contain is fundamental for being able to correctly access them according to the user’s needs. Thus, the quality of the layout analysis outcome biases the next understanding steps. Unfortunately, due to the variety of document styles and formats, the automatically found structure often needs to be manually adjusted. We propose the application of supervised Machine Learning techniques to infer correction rules to be applied to forthcoming documents. A first-order logic representation is suggested, because corrections often depend on the relationships of the wrong components with the surrounding ones. Moreover, as a consequence of the continuous flow of documents, the learned models often need to be updated and refined, which calls for incremental abilities. The proposed technique, embedded in a prototypical version of the document processing system DOMINUS, using the incremental first-order logic learner INTHELEX, revealed good performance in real-world experiments.
The coalition structure generation problem represents an active research area in multi-agent systems. A coalition structure is defined as a partition of the agents involved in a system into disjoint coalitions. The problem of finding the optimal coalition structure is NP-complete. In order to find the optimal solution in a combinatorial optimization problem it is theoretically possible to enumerate the solutions and evaluate each. But this approach is infeasible since the number of solutions often grows exponentially with the size of the problem. In this paper we present a greedy adaptive search procedure (GRASP) to efficiently search the space of coalition structures in order to find an optimal one.
The main goal of the project was the development of a District Service Center for the SMEs of the Textile and Clothing sector. In particular, it investigates the introduction of innovative technologies to improve the process/product innovation of the sector. In this direction, the research unit proposal consisted in introducing document processing and indexing techniques on a variety (both for structure and content) of document formats whit the aim of improving the exchange of data among companies and the semantic content-based retrieval for the real companies’ needs.
Activities of most organizations, and of universities in particular, involve the need to store, process and manage collections of different kinds of documents. Examples that require advanced solutions to such issues include the management of libraries, scientific conferences, research projects. DOMINUSplus is an open project born with the aim of harmonizing the Artificial Intelligence approaches developed at the LACAM laboratory with the research on Digital Libraries in a general software backbone for document processing and management, extensible with ad-hoc solutions for specific problems and context (such as universities).
In Artificial Intelligence with Coalition Structure Generation (CSG) one refers to those cooperative complex problems that require to find an optimal partition (maximizing a social welfare) of a set of entities involved in a system. The solution of the CSG problem finds applications in many fields such as Machine Learning (set covering machines, clustering), Data Mining (decision tree, discretization), Graph Theory, Natural Language Processing (aggregation), Semantic Web (service composition), and Bioinformatics. The problem of finding the optimal coalition structure is NP-complete. In this paper we present a greedy adaptive search procedure (GRASP) with path-relinking to efficiently search the space of coalition structures. Experiments and comparisons to other algorithms prove the validity of the proposed method in solving this hard combinatorial problem.
Users of Digital libraries require more intelligent interaction functionality to satisfy their needs. In this perspective, the most important features are flexibility and capability of adapting these functionalities to specific users. However, the main problem of current systems is their inability to support different needs of individual users due both to their inability to identify those needs, and, more importantly, to insufficient mapping of those needs to the available resources/services. The approaches considered in this paper to tackle such problems concern the use of Machine Learning techniques to adapt the set of user stereotypes with the aim of modelling user interests and behaviour in order to provide the most suitable service. A purposely designed simulation scenario was exploited to show the applicability of the proposal.
Information retrieval effectiveness has become a crucial issue with the enormous growth of available digital documents and the spread of Digital Libraries. Search and retrieval are mostly carried out on the textual content of documents, and traditionally only at the lexical level. However, pure term-based queries are very limited because most of the information in natural language is carried by the syntactic and logic structure of sentences. To take into account such a structure, powerful relational languages, such as first-order logic, must be exploited. However, logic formulæ constituents are typically uninterpreted (they are considered as purely syntactic entities), whereas words in natural language express underlying concepts that involve several implicit relationships, as those expressed in a taxonomy. This problem can be tackled by providing the logic interpreter with suitable taxonomic knowledge. This work proposes the exploitation of a similarity framework that includes both structural and taxonomic features to assess the similarity between First-Order Logic (Horn clause) descriptions of texts in natural language, in order to support more sophisticated information retrieval approaches than simple term-based queries. Evaluation on a sample case shows the viability of the solution, although further work is still needed to study the framework more deeply and to further refine it.
Horn clause Logic is a powerful representation language exploited in Logic Programming as a computer programming framework and in Inductive Logic Programming as a formalism for expressing examples and learned theories in domains where relations among objects must be expressed to fully capture the relevant information. While the predicates that make up the description language are defined by the knowledge engineer and handled only syntactically by the interpreters, they sometimes express information that can be properly exploited only with reference to a suitable background knowledge in order to capture unexpressed and underlying relationships among the concepts described. This is typical when the representation includes numerical information, such as single values or intervals, for which simple syntactic matching is not sufficient. This work proposes an extension of an existing framework for similarity assessment between First-Order Logic Horn clauses, that is able to handle numeric information in the descriptions. The viability of the solution is demonstrated on sample problems.
One of the most appreciated functionality of computers nowadays is their being a means for communication and information sharing among people. With the spread of the Internet, several complex interactions have taken place among people, giving rise to huge Information Networks based on these interactions. Social Networks potentially represent an invaluable source of information that can be exploited for scientific and commercial purposes. On the other hand, due to their distinguishing peculiarities (huge size and inherent relational setting) with respect to all previous information extraction tasks faced in Computer Science, they require new techniques to gather this information. Social Network Mining (SNM) is the corresponding research area, aimed at extracting information about the network objects and behavior that cannot be obtained based on the explicit/implicit description of the objects alone, ignoring their explicit/implicit relationships. Statistical Relational Learning (SRL) is a very promising approach to SNM, since it combines expressive representation formalisms, able to model complex relational networks, with statistical methods able to handle uncertainty about objects and relations. This paper is a survey of some SRL formalisms and techniques adopted to solve some SNM tasks.
Reaching high precision and recall rates in the results of term-based queries on text collections is becoming more and more crucial, as long as the amount of available documents increases and their quality tends to decrease. In particular, retrieval techniques based on the strict correspondence between terms in the query and terms in the documents miss important and relevant documents where it just happens that the terms selected by their authors are slightly different than those used by the final user that issues the query. Our proposal is to explicitly consider term co-occurrences when building the vector space. Indeed, the presence in a document of different but related terms to those in the query should strengthen the confidence that the document is relevant as well. Missing a query term in a document, but finding several terms strictly related to it, should equally support the hypothesis that the document is actually relevant. The computational perspective that embeds such a relatedness consists in matrix operations that capture direct or indirect term co-occurrence in the collection. We propose two different approaches to enforce such a perspective, and run preliminary experiments on a prototypical implementation, suggesting that this technique is potentially profitable.
Condividi questo sito sui social