Effettua una ricerca
Nicola Di Mauro
Ruolo
Ricercatore
Organizzazione
Università degli Studi di Bari Aldo Moro
Dipartimento
DIPARTIMENTO DI INFORMATICA
Area Scientifica
AREA 01 - Scienze matematiche e informatiche
Settore Scientifico Disciplinare
INF/01 - Informatica
Settore ERC 1° livello
Non Disponibile
Settore ERC 2° livello
Non Disponibile
Settore ERC 3° livello
Non Disponibile
Automatic processing of text documents requires techniques that can go beyond the lexical level, and are able to handle the semantics underlying natural language sentences. A support for such techniques can be provided by taxonomies that connect terms to the underlying concepts, and concepts to each other according to different kinds of relationships. An outstanding example of such a kind of resources is WordNet. On the other hand, whenever automatic inferences are to be made on a given domain, a generalization technique, and corresponding operational procedures, are needed. This paper proposes a generalization technique for taxonomic information and applies it to WordNet, providing examples that prove its behavior to be sensible and effective.
In this paper we propose to apply the Information Bottleneck (IB) approach to the sub-class of Statistical Relational Learning (SRL) languages that are reducible to Bayesian networks. When the resulting networks involve hidden variables, learning these languages requires the use of techniques for learning from incomplete data such as the Expectation Maximization (EM) algorithm. Recently, the IB approach was shown to be able to avoid some of the local maxima in which EM can get trapped when learning with hidden variables. Here we present the algorithm Relational Information Bottleneck (RIB) that learns the parameters of SRL languages reducible to Bayesian Networks. In particular, we present the specialization of RIB to a language belonging to the family of languages based on the distribution semantics, Logic Programs with Annotated Disjunction (LPADs). This language is prototypical for such a family and its equivalent Bayesian networks contain hidden variables. RIB is evaluated on the IMDB, Cora and artificial datasets and compared with LeProbLog, EM, Alchemy and PRISM. The experimental results show that RIB has good performances especially when some logical atoms are unobserved. Moreover, it is particularly suitable when learning from interpretations that share the same Herbrand base.
The recent explosion in Internet usage and the growing amount of digital images caused by the more and more ubiquitous presence of digital cameras has created a demand for effective and flexible techniques for automatic image retrieval. As the volume of the data increases, memory and processing requirements need to correspondingly increase at the same rapid pace, and this is often prohibitively expensive. Image collections on this scale make performing even the most common and simple image processing and machine learning tasks non trivial. In this paper we present a method to reduce the computational complexity of a widely known method for image indexing and retrieval based on a second order statistical measure. The aim of the paper is twofold: Q1) is it possible to efficiently extract an approximate distribution of the image features with a resulting low error? Q2) how the resulting approximate distribution affects the similarity-based accuracy? In particular, we propose a sampling method to approximate the distribution of correlograms, adopting a Monte Carlo approach to compute the distribution on a subset of pixels uniformly sampled from the original image. A further variant is to sample the neighborhood of each pixel too. Validation on the Caltech 101 dataset proved that the proposed approximate distribution, obtained with a considerable decrease of the computational time, has an error very low when compared to the exact distribution. Result obtained in the second experiment on a similarity-based ranking task are encouraging.
For many real-world applications it is important to choose the right representation language. While the setting of First Order Logic (FOL) is the most suitable one to model the multi-relational data of real and complex domains, on the other hand it puts the question of the computational complexity of the knowledge induction process. A way of tackling the complexity of such real domains, in which a lot of relationships are required to model the objects involved, is to use a method that reformulates a multi-relational learning task into an attribute-value one. In this chapter we present an approximate reasoning method able to keep low the complexity of a relational problem by using a stochastic inference procedure. The complexity of the relational language is decreased by means of a propositionalization technique, while the NP-completeness of the deduction is tackled using an approximate query evaluation. The proposed approximate reasoning technique has been used to solve the problem of relational rule induction as well as the task of relational clustering. An anytime algorithm has been used for the induction, implemented by a population based method, able to efficiently extract knowledge from relational data, while the clustering task, both unsupervised and supervised, has been solved using a Partition Around Medoid (PAM) clustering algorithm. The validity of the proposed techniques has been proved making an empirical evaluation on real-world datasets.
The current spread of digital documents raised the need of effective content-based retrieval techniques. Since manual indexing is infeasible and subjective, automatic techniques are the obvious solution. In particular, the ability of properly identifying and understanding a document’s structure is crucial, in order to focus on the most significant components only. At a geometrical level, this task is known as Layout Analysis, and thoroughly studied in the literature. On suitable descriptions of the document layout, Machine Learning techniques can be applied to automatically infer models of classes of documents and of their components. Indeed, organizing the documents on the grounds of the knowledge they contain is fundamental for being able to correctly access them according to the user’s needs. Thus, the quality of the layout analysis outcome biases the next understanding steps. Unfortunately, due to the variety of document styles and formats, the automatically found structure often needs to be manually adjusted. We propose the application of supervised Machine Learning techniques to infer correction rules to be applied to forthcoming documents. A first-order logic representation is suggested, because corrections often depend on the relationships of the wrong components with the surrounding ones. Moreover, as a consequence of the continuous flow of documents, the learned models often need to be updated and refined, which calls for incremental abilities. The proposed technique, embedded in a prototypical version of the document processing system DOMINUS, using the incremental first-order logic learner INTHELEX, revealed good performance in real-world experiments.
Probabilistic logic programming can be used to model domains with complex and uncertain relationships among entities. While the problem of learning the parameters of such programs has been considered by various authors, the problem of learning the structure is yet to be explored in depth. In this work we present an approximate search method based on a one-player game approach, called LEMUR. It sees the problem of learning the structure of a probabilistic logic program as a multi-armed bandit problem, relying on the Monte-Carlo tree search UCT algorithm that combines the precision of tree search with the generality of random sampling. LEMUR works by modifying the UCT algorithm in a fashion similar to FUSE, that considers a finite unknown horizon and deals with the problem of having a huge branching factor. The proposed system has been tested on various real-world datasets and has shown good performance with respect to other state of the art statistical relational learning approaches in terms of classification abilities.
Statistical Relational Learning (SRL) is a growing field in Machine Learning that aims at the integration of logic-based learning approaches with probabilistic graphical models. Markov Logic Networks (MLNs) are one of the state-of-the-art SRL models that combine first-order logic and Markov networks (MNs) by attaching weights to first-order formulas and viewing these as templates for features of MNs. Learning models in SRL consists in learning the structure (logical clauses in MLNs) and the parameters (weights for each clause in MLNs). Structure learning of MLNs is performed by maximizing a likelihood function (or a function thereof) over relational databases and MLNs have been successfully applied to problems in relational and uncertain domains. However, most complex domains are characterized by incomplete data. Until now SRL models have mostly used Expectation-Maximization (EM) for learning statistical parameters under missing values. Multistrategic learning in the relational setting has been a successful approach to dealing with complex problems where multiple inference mechanisms can help solve different subproblems. Abduction is an inference strategy that has been proven useful for completing missing values in observations. In this paper we propose two frameworks for integrating abduction in SRL models. The first tightly integrates logical abduction with structure and parameter learning of MLNs in a single step. During structure search guided by conditional likelihood, clause evaluation is performed by first trying to logically abduce missing values in the data and then by learning optimal pseudo-likelihood parameters using the completed data. The second approach integrates abduction with Structural EM of [17] by performing logical abductive inference in the E-step and then by trying to maximize parameters in the M-step.
The coalition structure generation problem represents an active research area in multi-agent systems. A coalition structure is defined as a partition of the agents involved in a system into disjoint coalitions. The problem of finding the optimal coalition structure is NP-complete. In order to find the optimal solution in a combinatorial optimization problem it is theoretically possible to enumerate the solutions and evaluate each. But this approach is infeasible since the number of solutions often grows exponentially with the size of the problem. In this paper we present a greedy adaptive search procedure (GRASP) to efficiently search the space of coalition structures in order to find an optimal one.
The main goal of the project was the development of a District Service Center for the SMEs of the Textile and Clothing sector. In particular, it investigates the introduction of innovative technologies to improve the process/product innovation of the sector. In this direction, the research unit proposal consisted in introducing document processing and indexing techniques on a variety (both for structure and content) of document formats whit the aim of improving the exchange of data among companies and the semantic content-based retrieval for the real companies’ needs.
In Artificial Intelligence with Coalition Structure Generation (CSG) one refers to those cooperative complex problems that require to find an optimal partition (maximizing a social welfare) of a set of entities involved in a system. The solution of the CSG problem finds applications in many fields such as Machine Learning (set covering machines, clustering), Data Mining (decision tree, discretization), Graph Theory, Natural Language Processing (aggregation), Semantic Web (service composition), and Bioinformatics. The problem of finding the optimal coalition structure is NP-complete. In this paper we present a greedy adaptive search procedure (GRASP) with path-relinking to efficiently search the space of coalition structures. Experiments and comparisons to other algorithms prove the validity of the proposed method in solving this hard combinatorial problem.
With the increasing amount of information in electronic form the fields of Machine Learning and Data Mining continue to grow by providing new advances in theory, applications and systems. The aim of this paper is to consider some recent theoretical aspects and approaches to ML and DM with an emphasis on the Italian research.
The rising interest around tractable Probabilistic Graphical Models is due to the guarantees on inference feasibility they provide. Among them, Cutset Networks (CNets) have recently been introduced as models embedding Pearl’s cutset conditioning algorithm in the form of weighted probabilistic model trees with tree-structured models as leaves. Learning the structure of CNets has been tackled as a greedy search leveraging heuristics from decision tree learning. Even if efficient, the learned models are far from being accurate in terms of likelihood. Here, we exploit the decomposable score of CNets to learn their structure and parameters by directly maximizing the likelihood, including the BIC criterion and informative priors on smoothing parameters. In addition, we show how to create mixtures of CNets by adopting a well known bagging method from the discriminative framework as an effective and cheap alternative to the classical EM. We compare our algorithms against the original variants on a set of standard benchmarks for graphical model structure learning, empirically proving our claims.
Tables are among the most informative components of documents, because they are exploited to compactly and intuitively represent data, typically for understandability purposes. The needs are to identify and extract tables from documents, and, on the other hand, to be able to extract the data they contain. The latter task involves the understanding of a table structure. Due to the variability in style, size, and aims of tables, algorithmic approaches to this task can be insufficient, and the exploitation of machine learning systems may represent an effective solution. This paper proposes the exploitation of a first-order logic representation, that is able to capture the complex spatial relationships involved in a table structure, and of a learning system that can mix the power of this representation with the flexibility of statistical approaches. The obtained encouraging results suggest further investigation and refinement of the proposal.
The need to deal with the inherent uncertainty in real-world relational or networked data leads to the proposal of new probabilistic models, such as probabilistic graphs. Every edge in a probabilistic graph is associated with a probability whose value represents the likelihood of its existence, or the strength of the relation between the entities it connects. The aim of this paper is to propose two machine learning techniques for the link classification problem in relational data exploiting the probabilistic graph representation. Both the proposed methods will exploit a language-constrained reachability method to infer the probability of possible hidden relationships that may exists between two nodes in a probabilistic graph. Each hidden relationships between two nodes may be viewed as a feature (or a factor), and its corresponding probability as its weight, while an observed relationship is considered as a positive instance for its corresponding link label. Given a training set of observed links, the first learning approach is to use a propositionalization technique adopting a L2-regularized Logistic Regression to learn a model able to predict unobserved link labels. Since in some cases the edges’ probability may be not known in advance or they could not be precisely defined for a classification task, the second xposed approach is to exploit the inference method and to use a mean squared technique to learn the edges’ probabilities. Both the proposed methods have been evaluated on real world data sets and the corresponding results proved their validity.
Information retrieval effectiveness has become a crucial issue with the enormous growth of available digital documents and the spread of Digital Libraries. Search and retrieval are mostly carried out on the textual content of documents, and traditionally only at the lexical level. However, pure term-based queries are very limited because most of the information in natural language is carried by the syntactic and logic structure of sentences. To take into account such a structure, powerful relational languages, such as first-order logic, must be exploited. However, logic formulæ constituents are typically uninterpreted (they are considered as purely syntactic entities), whereas words in natural language express underlying concepts that involve several implicit relationships, as those expressed in a taxonomy. This problem can be tackled by providing the logic interpreter with suitable taxonomic knowledge. This work proposes the exploitation of a similarity framework that includes both structural and taxonomic features to assess the similarity between First-Order Logic (Horn clause) descriptions of texts in natural language, in order to support more sophisticated information retrieval approaches than simple term-based queries. Evaluation on a sample case shows the viability of the solution, although further work is still needed to study the framework more deeply and to further refine it.
Horn clause Logic is a powerful representation language exploited in Logic Programming as a computer programming framework and in Inductive Logic Programming as a formalism for expressing examples and learned theories in domains where relations among objects must be expressed to fully capture the relevant information. While the predicates that make up the description language are defined by the knowledge engineer and handled only syntactically by the interpreters, they sometimes express information that can be properly exploited only with reference to a suitable background knowledge in order to capture unexpressed and underlying relationships among the concepts described. This is typical when the representation includes numerical information, such as single values or intervals, for which simple syntactic matching is not sufficient. This work proposes an extension of an existing framework for similarity assessment between First-Order Logic Horn clauses, that is able to handle numeric information in the descriptions. The viability of the solution is demonstrated on sample problems.
In multiagent adversarial environments, the adversary consists of a team of opponents that may interfere with the achievement of goals. In this domain agents must be able to quickly adapt to the environment and infer knowledge from other agents’ deportment to identify the future behaviors of opponents. We present a relational model to characterize adversary teams based on its behavior. A team’s deportment is represent by a set of relational sequences of basic actions extracted from their observed behaviors. Based on this, we present a similarity measure to classify the teams’ behavior. The sequence extraction and classification are implemented in the domain of simulated robotic soccer, and experimental results are presented.
The need for feasible inference in Probabilistic Graphical Models (PGMs) has lead to tractable models like Sum-Product Networks (SPNs). Their highly expressive power and their ability to provide exact and tractable inference make them very attractive for several real world applications, from computer vision to NLP. Recently, great attention around SPNs has focused on structure learning, leading to different algorithms being able to learn both the network and its parameters from data. Here, we enhance one of the best structure learner, LearnSPN, aiming to improve both the structural quality of the learned networks and their achieved likelihoods. Our algorithmic variations are able to learn simpler, deeper and more robust networks. These results have been obtained by exploiting some insights in the building process done by LearnSPN, by hybridizing the network adopting tree-structured models as leaves, and by blending bagging estimations into mixture creation. We prove our claims by empirically evaluating the learned SPNs on several benchmark datasets against other competitive SPN and PGM structure learners.
One of the most appreciated functionality of computers nowadays is their being a means for communication and information sharing among people. With the spread of the Internet, several complex interactions have taken place among people, giving rise to huge Information Networks based on these interactions. Social Networks potentially represent an invaluable source of information that can be exploited for scientific and commercial purposes. On the other hand, due to their distinguishing peculiarities (huge size and inherent relational setting) with respect to all previous information extraction tasks faced in Computer Science, they require new techniques to gather this information. Social Network Mining (SNM) is the corresponding research area, aimed at extracting information about the network objects and behavior that cannot be obtained based on the explicit/implicit description of the objects alone, ignoring their explicit/implicit relationships. Statistical Relational Learning (SRL) is a very promising approach to SNM, since it combines expressive representation formalisms, able to model complex relational networks, with statistical methods able to handle uncertainty about objects and relations. This paper is a survey of some SRL formalisms and techniques adopted to solve some SNM tasks.
Collaborative Filtering (CF) aims at predicting the user interest for a given item. In CF systems a set of users ratings is used to predict the rating of a given user on a given item using the ratings of a set of users who have already rated the item and whose preferences are similar to those of the user. In this paper we propose to use a framework based on uncertain graphs in order to deal with collaborative filtering problems. In this framework relationships among users and items and their corresponding likelihood will be encoded in a uncertain graph that can then be used to infer the probability of existence of a link between an user and an item involved in the graph. In order to solve CF tasks the framework uses an approximate inference method adopting a constrained simple path query language. The aim of the paper is to verify whether uncertain graphs are a valuable tool for CF, by solving classical, complex and structured problems. The performance of the proposed approach is reported when applied to a real-world domain.
Condividi questo sito sui social