Effettua una ricerca
Michelangelo Ceci
Ruolo
Professore Associato
Organizzazione
Università degli Studi di Bari Aldo Moro
Dipartimento
DIPARTIMENTO DI INFORMATICA
Area Scientifica
AREA 09 - Ingegneria industriale e dell'informazione
Settore Scientifico Disciplinare
ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore ERC 1° livello
Non Disponibile
Settore ERC 2° livello
Non Disponibile
Settore ERC 3° livello
Non Disponibile
Over the last decade, the advances in the high-throughput omic technologies have given the possibility to profile tumor cells at different levels, fostering the discovery of new biological data and the proliferation of a large number of bio-technological databases. In this paper we describe a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in a such huge quantity of data. In this framework, a key role is played by the Connectivity Map, a databank which relates diseases, physiological processes, and the action of drugs. The system will be used in a pilot study on the Multiple Myeloma (MM).
microRNAs (miRNAs) are a class of small non-coding RNAs which have been recognized as ubiquitous post-transcriptional regulators. The analysis of interactions between different miRNAs and their target genes is necessary for the understanding of miRNAs' role in the control of cell life and death. In this paper we propose a novel data mining algorithm, called HOCCLUS2, specifically designed to bicluster miRNAs and target messenger RNAs (mRNAs) on the basis of their experimentally-verified and/or predicted interactions. Indeed, existing biclustering approaches, typically used to analyze gene expression data, fail when applied to miRNA:mRNA interactions since they usually do not extract possibly overlapping biclusters (miRNAs and their target genes may have multiple roles), extract a huge amount of biclusters (difficult to browse and rank on the basis of their importance) and work on similarities of feature values (do not limit the analysis to reliable interactions). Results To overcome these limitations, HOCCLUS2 i) extracts possibly overlapping biclusters, to catch multiple roles of both miRNAs and their target genes; ii) extracts hierarchically organized biclusters, to facilitate bicluster browsing and to distinguish between universe and pathway-specific miRNAs; iii) extracts highly cohesive biclusters, to consider only reliable interactions; iv) ranks biclusters according to the functional similarities, computed on the basis of Gene Ontology, to facilitate bicluster analysis. Conclusions Our results show that HOCCLUS2 is a valid tool to support biologists in the identification of context-specific miRNAs regulatory modules and in the detection of possibly unknown miRNAs target genes. Indeed, results prove that HOCCLUS2 is able to extract cohesiveness-preserving biclusters, when compared with competitive approaches, and statistically confirm (at a confidence level of 99%) that mRNAs which belong to the same biclusters are, on average, more functionally similar than mRNAs which belong to different biclusters. Finally, the hierarchy of biclusters provides useful insights to understand the intrinsic hierarchical organization of miRNAs and their potential multiple interactions on target genes.
The amount of data produced by ubiquitous computing applications is quickly growing, due to the pervasive presence of small devices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiquitous data', require a (multi-) relational approach to their analysis. However, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm approximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be considered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoretically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.
Traditional pattern discovery approaches permit to identify frequent patterns expressed in form of conjunctions of items and represent their frequent co-occurrences. Although such approaches have been proved to be effective in descriptive knowledge discovery tasks, they can miss interesting combinations of items which do not necessarily occur together. To avoid this limitation, we propose a method for discovering interesting patterns that consider disjunctions of items that, otherwise, would be pruned in the search. The method works in the relational data mining setting and conserves anti-monotonicity properties that permit to prune the search. Disjunctions are obtained by joining relations which can simultaneously or alternatively occur, namely relations deemed similar in the applicative domain. Experiments and comparisons prove the viability of the proposed approach.
Longitudinal data consist of the repeated measurements of some variables which describe a process (or phenomenon) over time. They can be analyzed to unearth information on the dynamics of the process. In this paper we propose a temporal data mining framework to analyze these data and acquire knowledge, in the form of temporal patterns, on the events which can frequently trigger particular stages of the dynamic process. The application to a biomedical scenario is addressed. The goal is to analyze biosignal data in order to discover patterns of events, expressed in terms of breathing and cardiovascular system time-annotated disorders, which may trigger particular stages of the human central nervous system during sleep.
Abstract. The problem of accurately predicting the energy production from re- newable sources has recently received an increasing attention from both the in- dustrial and the research communities. It presents several challenges, such as facing with the high rate data are provided by sensors, the heterogeneity of the data collected, power plants efficiency, as well as uncontrollable factors, such as weather conditions and user consumption profiles. In this paper we describe Vi-POC (Virtual Power Operating Center), a project conceived to assist energy producers and, more in general decision makers in the energy market. In this pa- per we present the Vi-POC project and how we face with challenges posed by the specific application domain. The solutions we propose have roots both in big data management and in stream data mining.
A key task in data mining and information retrieval is learning preference relations. Most of methods reported in the literature learn preference relations between objects which are represented by attribute-value pairs or feature vectors (propositional representation). The growing interest in data mining techniques which are able to directly deal with more sophisticated representations of complex objects, motivates the investigation of relational learning methods for learning preference relations. In this paper, we present a probabilistic relational data mining method which permits to model preference relations between complex objects. Preference relations are then used to rank objects. Experiments on two ranking problems for scientific literature mining prove the effectiveness of the proposed method.
Spatial autocorrelation is the correlation among data values which is strictly due to the relative spatial proximity of the objects that the data refer to. Inappropriate treatment of data with spatial dependencies, where spatial autocorrelation is ignored, can obfuscate important insights. In this paper, we propose a data mining method that explicitly considers spatial autocorrelation in the values of the response (target) variable when learning predictive clustering models. The method is based on the concept of predictive clustering trees (PCTs), according to which hierarchies of clusters of similar data are identified and a predictive model is associated to each cluster. In particular, our approach is able to learn predictive models for both a continuous response (regression task) and a discrete response (classification task). We evaluate our approach on several real world problems of spatial regression and spatial classification. The consideration of the autocorrelation in the models improves predictions that are consistently clustered in space and that clusters try to preserve the spatial arrangement of the data, at the same time providing a multi-level insight into the spatial autocorrelation phenomenon. The evaluation of SCLUS in several ecological domains (e.g. predicting outcrossing rates within a conventional field due to the surrounding genetically modified fields, as well as predicting pollen dispersal rates from two lines of plants) confirms itscapability of building spatial aware models which capture the spatial distribution of the target variable. In general, the maps obtained by using SCLUS do not require further post-smoothing of the results if we want to use them in practice.
Spatial data is common in ecological studies; however, one major problem with spatial data is the presence of the spatial autocorrelation. This phenomenon indicates that data measured at locations relatively close one to each other tend to have more similar values than data measured at locations further apart. Spatial autocorrelation violates the statistical assumption that the analyzed data are independent and identically distributed. This chapter focuses on effects of the spatial autocorrelation when predicting gene flow from Genetically Modified (GM) to non-GM maize fields under real multi-field crop management practices at a regional scale. We present the SCLUS method, an extension of the method CLUS (Blockeel et al., 1998), which learns spatially aware predictive clustering trees (PCTs). The method can consider locally and globally the effects of the spatial autocorrelation as well as can deal with the “ecological fallacy” problem (Robinson, 1950). The chapter concludes with a presentation of an application of this approach for gene flow modeling.
Analyzing biosignal data is an activity of great importance which can unearth information on the course of a disease. In this paper we propose a temporal data mining approach to analyze these data and acquire knowledge, in the form of temporal patterns, on the physiological events which can frequently trigger particular stages of disease. The proposed approach is realized through a four-stepped computational solution: first, disease stages are determined, then a subset of stages of interest is identified, subsequently physiological time-annotated events which can trigger those stages are detected, finally, patterns are discovered from the extracted events. The application to the sleep sickness scenario is addressed to discover patterns of events, in terms of breathing and cardiovascular system time-annotated disorders, which may trigger particular sleep stages.
Studying Greek and Latin cultural heritage has always been considered essential to the understanding of important aspects of the roots of current European societies. However, only a small fraction of the total production of texts from ancient Greece and Rome has survived up to the present, leaving many gaps in the historiographic records. Epigraphy, which is the study of inscriptions (epigraphs), helps to fill these gaps. In particular, the goal of epigraphy is to clarify the meanings of epigraphs; to classify their uses according to their dating and cultural contexts; and to study aspects of the writing, the writers, and their “consumers.” Although several research projects have recently been promoted for digitally storing and retrieving data and metadata about epigraphs, there has actually been no attempt to apply data mining technologies to discover previously unknown cultural aspects. In this context, we propose to exploit the temporal dimension associated with epigraphs (dating) by applying a data mining method for novelty detection. The main goal is to discover relational novelty patterns—that is, patterns expressed as logical clauses describing significant variations (in frequency) over the different epochs, in terms of relevant features such as language, writing style, and material. As a case study, we considered the set of Inscriptiones Christianae Vrbis Romae stored in Epigraphic Database Bari, an epigraphic repository. Some patterns discovered by the data mining method were easily deciphered by experts since they captured relevant cultural changes, whereas others disclosed unexpected variations, which might be used to formulate new questions, thus expanding the research opportunities in the field of epigraphy
The automatic discovery of process models can help to gain insight into various perspectives (e.g., control flow or data perspective) of the process executions traced in an event log. Frequent patterns mining offers a means to build human understandable representations of these process models. This paper describes the application of a multi-relational method of frequent pattern discovery into process mining. Multi-relational data mining is demanded for the variety of activities and actors involved in the process executions traced in an event log which leads to a relational (or structural) representation of the process executions. Peculiarity of this work is in the integration of disjunctive forms into relational patterns discovered from event logs. The introduction of disjunctive forms enables relational patterns to express frequent variants of process models. The effectiveness of using relational patterns with disjunctions to describe process models with variants is assessed on real logs of process executions.
Most of the works on learning from networked data assume that the network is static. In this paper we consider a different scenario, where the network is dynamic, i.e. nodes/relationships can be added or removed and relationships can change in their type over time. We assume that the “core” of the network is more stable than the “marginal” part of the network, nevertheless it can change with time. These changes are of interest for this work, since they reflect a crucial step in the network evolution. Indeed, we tackle the problem of discovering evolution chains, which express the temporal evolution of the “core” of the network. To describe the “core” of the network, we follow a frequent pattern-mining approach, with the critical difference that the frequency of a pattern is computed along a time-period and not on a static dataset. The proposed method proceeds in two steps: 1) identification of changes through the discovery of emerging patterns; 2) composition of evolution chains by joining emerging patterns. We test the effectiveness of the method on both real and synthetic data.
Bisociations represent interesting relationships between seemingly unconnected concepts from two or more contexts. Most of the existing approaches that permit the discovery of bisociations from data rely on the assumption that contexts are static or considered as unchangeable domains. Actually, several real-world domains are intrinsically dynamic and can change over time. The same domain can change and can become completely different from what/how it was before: a dynamic domain observed at different time-points can present different representations and can be reasonably assimilated to a series of distinct static domains. In this work, we investigate the task of linking concepts from a dynamic domain through the discovery of bisociations which link concepts over time. This provides us with a means to unearth linkages which have not been discovered when observing the domain as static, but which may have developed over time, when considering the dynamic nature. We propose a computational solution which, assuming a time interval-based discretization of the domain, explores the spaces of association rules mined in the intervals and chains the rules on the basis of the concept generalization and information theory criteria. The application to the literature-based discovery shows how the method can re-discover known connections in biomedical terminology. Experiments and comparisons using alternative techniques highlight the additional peculiarities of this work.
The discovery of new and potentially meaningful relationships between named entities in biomedical literature can take great advantage from the application of multi-relational data mining approaches in text mining. This is motivated by the peculiarity of multi-relational data mining to be able to express and manipulate relationships between entities. We investigate the application of such an approach to address the task of identifying informative syntactic structures, which are frequent in biomedical abstract corpora. Initially, named entities are annotated in text corpora according to some biomedical dictionary (e.g. MeSH taxonomy). Tagged entities are then integrated in syntactic structures with the role of subject and/or object of the corresponding verb. These structures are represented in a first-order language. Multi-relational approach to frequent pattern discovery allows to identify the verb-based relationships between the named entities which frequently occur in the corpora. Preliminary experiments with a collection of abstracts obtained by querying Medline on a specific disease are reported.
In Document Image Understanding, one of the fundamental tasks is that of recognizing semantically relevant components in the layout extracted from a document image. This process can be automatized by learning classifiers able to automatically label such components. However, the learning process assumes the availability of a huge set of documents whose layout components have been previously manually labeled. Indeed, this contrasts with the more common situation in which we have only few labeled documents and abundance of unlabeled ones. In addition, labeling layout documents introduces further complexity aspects due to multi-modal nature of the components (textual and spatial information may coexist). In this work, we investigate the application of a relational classifier that works in the transductive setting. The relational setting is justified by the multi-modal nature of the data we are dealing with, while transduction is justified by the possibility of exploiting the large amount of information conveyed in the unlabeled layout components. The classifier bootstraps the labeling process in an iterative way: reliable classifications are used in subsequent iterative steps as training examples. The proposed computational solution has been evaluated on document images of scientific literature.
Abstract. Link prediction in network data is a data mining task which is receiving significant attention due to its applicability in various do- mains. An example can be found in social network analysis, where the goal is to identify connections between users. Another application can be found in computational biology, where the goal is to identify previ- ously unknown relationships among biological entities. For example, the identification of regulatory activities (links) among genes would allow bi- ologists to discover possible gene regulatory networks. In the literature, several approaches for link prediction can be found, but they often fail in simultaneously considering all the possible criteria (e.g. network topol- ogy, nodes properties, autocorrelation among nodes). In this paper we present a semi-supervised data mining approach which learns to combine the scores returned by several link prediction algorithms. The proposed solution exploits both a small set of validated examples of links and a huge set of unlabeled links. The application we consider regards the iden- tification of links between genes and miRNAs, which can contribute to the understanding of their roles in many biological processes. The spe- cific application requires to learn from only positively labeled examples of links and to face with the high unbalancing between labeled and unla- beled examples. Results show a significant improvement with respect to single prediction algorithms and with respect to baseline combination.
Classical Greek and Latin culture is the very foundation of the identity of modern Europe. Today, a variety of modern subjects and disciplines have their roots in the classical world: from philosophy to architecture, from geometry to law. However, only a small fraction of the total production of texts from ancient Greece and Rome has survived up to the present days, leaving many ample gaps in the historiographic records. Epigraphy, which is the study of inscriptions (epigraphs), aims at plug this gap. In particular, the goal of Epigraphy is to clarify the meanings of epigraphs, classifying their uses according to dates and cultural contexts, and drawing conclusions about the writing and the writers. Indeed, they are a kind of cultural heritage for which several research projects have recently been promoted for the purposes of preservation, storage, indexing and on-line usage. In this paper, we describe the system EDB (Epigraphic Database Bari) which stores about 40,000 Christian inscriptions of Rome, including those published in the Inscriptiones Christianae Vrbis Romae septimo saeculo antiquiores, nova series editions. EDB provides, in addition to the possibility of storing metadata, the possibility of i) supporting information retrieval through a thesaurus based query engine, ii) supporting time-based analysis of epigraphs in order to detect and represent novelties, and iii) geo-referencing epigraphs by exploiting a spatial database.
In traditional OLAP systems, roll-up and drill-down operations over data cubes exploit fixed hierarchies defined on discrete attributes, which play the roles of dimensions, and operate along them. New emerging application scenarios, such as sensor networks, have stimulated research on OLAP systems, where even continuous attributes are considered as dimensions of analysis, and hierarchies are defined over continuous domains. The goal is to avoid the prior definition of an ad-hoc discretization hierarchy along each OLAP dimension. Following this research trend, in this paper we propose a novel method, founded on a density-based hierarchical clustering algorithm, to support roll-up and drill-down operations over OLAP data cubes with continuous dimensions. The method hierarchically clusters dimension instances by also taking fact-table measures into account. Thus, we enhance the clustering effect with respect to the possible analysis. Experiments on two well-known multidimensional datasets clearly show the advantages of the proposed solution.
Spatial autocorrelation is the correlation among data values, strictly due to the relative location proximity of the objects that the data refer to. This statistical property clearly indicates a violation of the assumption of observation independence - a pre-condition assumed by most of the data mining and statistical models. Inappropriate treatment of data with spatial dependencies could obfuscate important insights when spatial autocorrelation is ignored. In this paper, we propose a data mining method that explicitly considers autocorrelation when building the models. The method is based on the concept of predictive clustering trees (PCTs). The proposed approach combines the possibility of capturing both global and local effects and dealing with positive spatial autocorrelation. The discovered models adapt to local properties of the data, providing at the same time spatially smoothed predictions. Results show the effectiveness of the proposed solution.
microRNAs (miRNAs) are an important class of regulatory factors controlling gene expressions at post-transcriptional level. Studies on interactions between different miRNAs and their target genes are of utmost importance to understand the role of miRNAs in the control of biological processes. This paper contributes to these studies by proposing a method for the extraction of co-clusters of miRNAs and messenger RNAs (mRNAs). Different from several already available co-clustering algorithms, our approach efficiently extracts a set of possibly overlapping, exhaustive and hierarchically organized co-clusters. The algorithm is well-suited for the task at hand since: i) mRNAs and miRNAs can be involved in different regulatory networks that may or may not be co-active under some conditions, ii) exhaustive co-clusters guarantee that possible co-regulations are not lost, iii) hierarchical browsing of co-clusters facilitates biologists in the interpretation of results. Results on synthetic and on real human miRNA:mRNA data show the effectiveness of the approach.
The problem of accurately predicting the energy production from renewable sources has recently received an increasing attention from both the industrial and the research communities. It presents sev- eral challenges, such as facing with the rate data are provided by sensors, the heterogeneity of the data collected, power plants effi- ciency, as well as uncontrollable factors, such as weather conditions and user consumption profiles. In this paper we describe Vi-POC (Virtual Power Operating Center), a project conceived to assist en- ergy producers and decision makers in the energy market. In this paper we present the Vi-POC project and how we face with chal- lenges posed by the specific application. The solutions we propose have roots both in big data management and in stream data mining.
Background MicroRNAs (miRNAs) are small non-coding RNAs which play a key role in the post-transcriptional regulation of many genes. Elucidating miRNA-regulated gene networks is crucial for the understanding of mechanisms and functions of miRNAs in many biological processes, such as cell proliferation, development, differentiation and cell homeostasis, as well as in many types of human tumors. To this aim, we have recently presented the biclustering method HOCCLUS2, for the discovery of miRNA regulatory networks. Experiments on predicted interactions revealed that the statistical and biological consistency of the obtained networks is negatively affected by the poor reliability of the output of miRNA target prediction algorithms. Recently, some learning approaches have been proposed to learn to combine the outputs of distinct prediction algorithms and improve their accuracy. However, the application of classical supervised learning algorithms presents two challenges: i) the presence of only positive examples in datasets of experimentally verified interactions and ii) unbalanced number of labeled and unlabeled examples. Results We present a learning algorithm that learns to combine the score returned by several prediction algorithms, by exploiting information conveyed by (only positively labeled/) validated and unlabeled examples of interactions. To face the two related challenges, we resort to a semi-supervised ensemble learning setting. Results obtained using miRTarBase as the set of labeled (positive) interactions and mirDIP as the set of unlabeled interactions show a significant improvement, over competitive approaches, in the quality of the predictions. This solution also improves the effectiveness of HOCCLUS2 in discovering biologically realistic miRNA:mRNA regulatory networks from large-scale prediction data. Using the miR-17-92 gene cluster family as a reference system and comparing results with previous experiments, we find a large increase in the number of significantly enriched biclusters in pathways, consistent with miR-17-92 functions. Conclusion The proposed approach proves to be fundamental for the computational discovery of miRNA regulatory networks from large-scale predictions. This paves the way to the systematic application of HOCCLUS2 for a comprehensive reconstruction of all the possible multiple interactions established by miRNAs in regulating the expression of gene networks, which would be otherwise impossible to reconstruct by considering only experimentally validated interactions.
The Geographically Weighted Regression (GWR) is a method of spatial statistical analysis which allows the exploration of geographical differences in the linear effect of one or more predictor variables upon a response variable. The parameters of this linear regression model are locally determined for every point of the space by processing a sample of distance decay weighted neighboring observations. While this use of locally linear regression has proved appealing in the area of spatial econometrics, it also presents some limitations. First, the form of the GWR regression surface is globally defined over the whole sample space, although the parameters of the surface are locally estimated for every space point. Second, the GWR estimation is founded on the assumption that all predictor variables are equally relevant in the regression surface, without dealing with spatially localized collinearity problems. Third, time dependence among observations taken at consecutive time points is not considered as information-bearing for future predictions. In this paper, a tree-structured approach is adapted to recover the functional form of a GWR model only at the local level. A stepwise approach is employed to determine the local form of each GWR model by selecting only the most promising predictors. Parameters of these predictors are estimated at every point of the local area. Finally, a time-space transfer technique is tailored to capitalize on the time dimension of GWR trees learned in the past and to adapt them towards the present. Experiments confirm that the tree-based construction of GWR models improves both the local estimation of parameters of GWR and the global estimation of parameters performed by classical model trees. Furthermore, the effectiveness of the time-space transfer technique is investigated.
We present an algorithm for hierarchical multi-label classifi- cation (HMC) in a network context. It is able to classify instances that may belong to multiple classes at the same time and consider the hierar- chical organization of the classes. It assumes that the instances are placed in a network and uses information on the network connections during the learning of the predictive model. Many real world prediction problems have classes that are organized hierarchically and instances that can have pairwise connections. One example is web document classification, where topics (classes) are typically organized into a hierarchy and documents are connected by hyperlinks. Another example, which is considered in this paper, is gene/protein function prediction, where genes/proteins are connected and form protein-to-protein interaction (PPI) networks. Net- work datasets are characterized by a form of autocorrelation, where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. Combining the hierarchical multi-label classification task with network prediction is thus not trivial and re- quires the introduction of the new concept of network autocorrelation for HMC. The proposed algorithm is able to profitably exploit network autocorrelation when learning a tree-based prediction model for HMC. The learned model is in the form of a Predictive Clustering Tree (PCT) and predicts multiple (hierarchically organized) labels at the leaves. Ex- periments show the effectiveness of the proposed approach for different problems of gene function prediction, considering different PPI networks. The results show that different networks introduce different benefits in different problems of gene function prediction.
Link prediction in network data is a data mining task which is receiving significant attention due to its applicability in various domains. An example can be found in social network analysis, where the goal is to identify connections between users. Another application can be found in computational biology, where the goal is to identify previously unknown relationships among biological entities. For example, the identification of regulatory activities (links) among genes would allow biologists to discover possible gene regulatory networks. In the literature, several approaches for link prediction can be found, but they often fail in simultaneously considering all the possible criteria (e.g. network topology, nodes properties, autocorrelation among nodes). In this paper we present a semi-supervised data mining approach which learns to combine the scores returned by several link prediction algorithms. The proposed solution exploits both a small set of validated examples of links and a huge set of unlabeled links. The application we consider regards the identification of links between genes and miRNAs, which can contribute to the understanding of their roles in many biological processes. The specific application requires to learn from only positively labeled examples of links and to face with the high unbalancing between labeled and unlabeled examples. Results show a significant improvement with respect to single prediction algorithms and with respect to baseline combination.
Networked data are, nowadays, collected in various application domains such as social networks, biological networks, sensor networks, spatial networks, peer-to-peer networks etc. Recently, the application of data stream mining to networked data, in order to study their evolution over time, is receiving increasing attention in the research community. Following this main stream of research, we propose an algorithm for mining ranking models from networked data which may evolve over time. In order to properly deal with the concept drift problem, the algorithm exploits an ensemble learning approach which allows us to weight the importance of learned ranking models from past data when ranking new data. Learned models are able to take the network autocorrelation into account, that is, the statistical dependency between the values of the same attribute on related nodes. Empirical results prove the effectiveness of the proposed algorithm and show that it performs better than other approaches proposed in the literature.
Technologies in available biomedical repositories do not yet provide adequate mechanisms to support the understanding and analysis of the stored content. In this project we investigate this problem under different perspectives. Our contribution is the design of computational solutions for the analysis of biomedical documents and images. These integrate sophisticated technologies and innovative approaches of Information Extraction, Data Mining and Machine Learning to perform descriptive tasks of knowledge discovery from biomedical repositories.
The special issue of the Journal of Intelligent Information Systems (JIIS) features papers from the first International Workshop on New Frontiers in Mining Complex Patterns (NFMCP 2011), which was held in Bristol UK, on September 24th 2012 in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2012). The first paper, 'Link Classification with Probabilistic Graphs', by Nicola Di Mauro, Claudio Taranto and Floriana Esposito, proposes two machine learning techniques for the link classification problem in relational data exploiting the probabilistic graph representation. The second paper, 'Hierarchical Object-Driven Action Rules', by Ayman Hajja, Zbigniew W. Ras, and Alicja A. Wieczorkowska, proposes a hybrid action rule extraction approach that combines key elements from both the classical action rule mining approach, and the object-driven action rule extraction approach to discover action rules from object-driven information systems.
One of the recently addressed research directions focuses on the issues raised by the diffusion of highly dynamic on-line information, particularly on the problem of mining topic evolutions from news. Among several applications, risk identification and analysis may exploit mining topic evolution from news in order to support law enforcement officers in risk and threat assessment. Assimilating the concept of topic to the concept of crime typology represented by a group of "similar" criminals, it is possible to apply topic evolution mining techniques to discover evolutions of criminal behaviors over time. At this aim, we incrementally analyze streams of publicly available news about criminals (e.g. daily police reports, public court records, legal instruments) in order to identify clusters of similar criminals and represent their evolution over time. Experimental results on both real world and synthetically generated datasets prove the effectiveness of the proposed approach.
In recent years, improvement in ubiquitous technologies and sensor networks have motivated the application of data mining techniques to network organized data. Network data describe entities represented by nodes, which may be connected with (related to) each other by edges. Many network datasets are characterized by a form of autocorrelation where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. This phenomenon is a direct violation of the assumption that data are independently and identically distributed (i.i.d.). At the same time, it offers the unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. In this work, we propose a method for learning to rank from network data when data distribution may change over time. The learned models can be used to predict the ranking of nodes in the network for new time periods. The proposed method modifies the SVMRank algorithm in order to emphasize the importance of models learned in time periods during which data follow a data distribution that is similar to that observed in the new time period. We evaluate our approach on several real world problems of learning to rank from network data, coming from the area of sensor networks.
One of the recently addressed research directions focuses on the problem of mining topic evolutions from textual documents. Following this main stream of research, in this paper we face the different, but related, problem of mining the topic evolution of entities (persons, companies, etc.) mentioned in the documents. To this aim, we incrementally analyze streams of time-stamped documents in order to identify clusters of similar entities and represent their evolution over time. The proposed solution is based on the concept of temporal profiles of entities extracted at periodic instants in time. Experiments performed both on synthetic and real world datasets prove that the proposed framework is a valuable tool to discover underlying evolutions of entities and results show significant improvements over the considered baseline methods
Motif discovery in biological sequences is an important field in bioinformatics. Most of the scientific research focuses on the de novo discovery of single motifs, but biological activities are typically co-regulated by several factors and this feature is properly reflected by higher order structures, called composite motifs, or cis-regulatory modules or simply modules. A module is a set of motifs, constrained both in number and location, which is statistically overrepresented and hence may be indicative of a biological function. Several methods have been studied for the de novo discovery of modules. We propose an alternative approach based on the discovery of rules that define strong spatial associations between single motifs and suggest the structure of a module. Single motifs involved in the mined rules might be either de novo discovered by motif discovery algorithms or taken from databases of single motifs. Rules are expressed in a first-order logic formalism and are mined by means of an inductive logic programming system. We also propose computational solutions to two issues: the hard discretization of numerical inter-motif distances and the choice of a minimum support threshold. All methods have been implemented and integrated in a tool designed to support biologists in the discovery and characterization of composite motifs. A case study is reported in order to show the potential of the tool.
Multi-Relational Data Mining (MRDM) refers to the process of discovering implicit, previously unknown and potentially useful information from data scattered in multiple tables of a relational database. Following the mainstream of MRDM research, we tackle the regression where the goal is to examine samples of past experience with known continuous answers (response) and generalize future cases through an inductive process. Mr-SMOTI, the solution we propose, resorts to the structural approach in order to recursively partition data stored into a tightly-coupled database and build a multi-relational model tree which captures the linear dependence between the response variable and one or more explanatory variables. The model tree is top-down induced by choosing, at each step, either to partition the training space or to introduce a regression variable in the linear models with the leaves. The tight-coupling with the database makes the knowledge on data structures (foreign keys) available free of charge to guide the search in the multi-relational pattern space. Experiments on artificial and real databases demonstrate that in general Mr-SMOTI outperforms both SMOTI and M5' which are two propositional model tree induction systems, and TILDE-RT which is a state-of-art structural model tree induction system.
Network reconstruction from data is a data mining task which is receiving a significant attention due to its applicability in several domains. For example, it can be applied in social network analysis, where the goal is to identify connections among users and, thus, sub-communities. Another example can be found in computational biology, where the goal is to identify previously unknown relationships among biological entities and, thus, relevant interaction networks. Such task is usually solved by adopting methods for link prediction and for the identification of relevant sub-networks. Focusing on the biological domain, in [4] and [3] we proposed two methods for learning to combine the output of several link prediction algorithms and for the identification of biological significant interaction networks involving two important types of RNA molecules, i.e. microRNAs (miRNAs) and messenger RNAs (mRNAs). The relevance of this application comes from the importance of identifying (previously unknown) regulatory and cooperation activities for the understanding of the biological roles of miRNAs and mRNAs. In this paper, we review the contribution given by the combination of the proposed methods for network reconstruction and the solutions we adopt in order to meet specific challenges coming from the specific domain we consider.
Regression inference in network data is a challenging task in machine learning and data mining. Network data describe entities represented by nodes, which may be connected with (related to) each other by edges. Many network datasets are characterized by a form of autocorrelation where the values of the response variable at a given node depend on the values of the variables (predictor and response) at the nodes connected to the given node. This phenomenon is a direct violation of the assumption of independent (i.i.d.) observations: At the same time, it offers a unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. In this paper, we propose a data mining method that explicitly considers autocorrelation when building regression models from network data. The method is based on the concept of predictive clustering trees (PCTs), which can be used both for clustering and predictive tasks: PCTs are decision trees viewed as hierarchies of clusters and provide symbolic descriptions of the clusters. In addition, PCTs can be used for multi-objective prediction problems, including multi-target regression and multi-target classification. Empirical results on real world problems of network regression show that the proposed extension of PCTs performs better than traditional decision tree induction when autocorrelation is present in the data.
Network data describe entities represented by nodes, which may be connected with (related to) each other by edges.Many network datasets are characterized by a form of autocorrelation, where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. This phenomenon is a direct violation of the assumption that data are independently and identically distributed. At the same time, it offers an unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. Regression inference in network data is a challenging task. While many approaches for network classification exist, there are very few approaches for network regression. In this paper, we propose a data mining algorithm, calledNCLUS, that explicitly considers autocorrelationwhen building regression models from network data. The algorithm is based on the concept of predictive clustering trees (PCTs) that can be used for clustering, prediction and multitarget prediction, including multi-target regression and multi-target classification.We evaluate our approach on several real world problems of network regression, coming from the areas of social and spatial networks. Empirical results showthat our algorithm performs better than PCTs learned by completely disregarding network information, as well as PCTs that are tailored for spatial data, but do not take autocorrelation into account, and a variety of other existing approaches
In traditional OLAP systems, roll-up and drill-down operations over data cubes exploit fixed hierarchies defined on discrete attributes that play the roles of dimensions, and operate along them. However, in recent years, a new tendency of considering even continuous attributes as dimensions, hence hierarchical members become continuous accordingly, has emerged mostly due to novel and emerging application scenarios like sensor and data stream management tools. A clear advantage of this emerging approach is that of avoiding the beforehand definition of an ad-hoc discretization hierarchy along each OLAP dimension. Following this latest trend, in this paper we propose a novel method for effectively and efficiently supporting roll-up and drill-down operations over OLAP data cubes with continuous dimensions via a density-based hierarchical clustering algorithm. This algorithm allows us to hierarchically cluster together dimension instances by also taking fact-table measures into account in order to enhance the clustering effect with respect to the possible analysis. Experiments on two well-known multidimensional datasets clearly show the advantages of the proposed solution.
A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. In this project we design a framework which combines technologies for the acquisition and storage of printed documents with knowledge-based techniques to represent and understand the information they contain. The innovative aspects of this work strengthen its applicability to tools that have been developed for building digital libraries.
Document summarization involves reducing a text document into a short set of phrases or sentences that convey the main meaning of the text. In digital libraries, summaries can be used as concise descriptions which the user can read for a rapid comprehension of the retrieved documents. Most of the existing approaches rely on the classification algorithms which tend to generate “crisp” summaries, where the phrases are considered equally relevant and no information on their degree of importance or factor of significance is provided. Motivated by this, we present a probabilistic relational data mining method to model preference relations on sentences of document images. Preference relations are then used to rank the sentences which will form the final summary. We empirically evaluate the method on real document images.
Networks are data structures more and more frequently used for modeling interactions in social and biological phenomena, as well as between various types of devices, tools and machines. They can be either static or dynamic, dependently on whether the modeled interactions are fixed or changeable over time. Static networks have been extensively investigated in data mining, while fewer studies have focused on dynamic networks and how to discover complex patterns in large, evolving networks. In this paper we focus on the task of discovering changes in evolving networks and we overcome some limits of existing methods (i) by resorting to a relational approach for representing networks characterized by heterogeneous nodes and/or heterogeneous relationships, and (ii) by proposing a novel algorithm for discovering changes in the structure of a dynamic network over time. Experimental results and comparisons with existing approaches on real-world datasets prove the effectiveness and efficiency of the proposed solution and provide some insights on the effect of some parameters in discovering and modeling the evolution of the whole network, or a subpart of it.
The rapid growth in the amount of spatial data available in Geographical Information Systems has given rise to substantial demand of data mining tools which can help uncover interesting spatial patterns. We advocate the relational mining approach to spatial domains, due to both various forms of spatial correlation which characterize these domains and the need to handle spatial relationships in a systematic way. We present some major achievements in this research direction and point out some open problems.
The task of gene regulatory network reconstruction from high-throughput data is receiving increasing attention in recent years. As a consequence, many inference methods for solving this task have been proposed in the literature. It has been recently observed, however, that no single inference method performs optimally across all datasets. It has also been shown that the integration of predictions from multiple inference methods is more robust and shows high performance across diverse datasets. Inspired by this research, in this paper, we propose a machine learning solution which learns to combine predictions from multiple inference methods. While this approach adds additional complexity to the inference process, we expect it would also carry substantial benefits. These would come from the automatic adaptation to patterns on the outputs of individual inference methods, so that it is possible to identify regulatory interactions more reliably when these patterns occur. This article demonstrates the benefits (in terms of accuracy of the reconstructed networks) of the proposed method, which exploits an iterative, semi-supervised ensemble-based algorithm. The algorithm learns to combine the interactions predicted by many different inference methods in the multi-view learning setting. The empirical evaluation of the proposed algorithm on a prokaryotic model organism (E. coli) and on a eukaryotic model organism (S. cerevisiae) clearly shows improved performance over the state of the art methods. The results indicate that gene regulatory network reconstruction for the real datasets is more difficult for S. cerevisiae than for E. coli. The software, all the datasets used in the experiments and all the results are available for download at the following link: http://figshare.com/articles/Semi_supervised_Multi_View_Learning_for_Gene_Network_Reconstruction/1604827.
In recent years, a growing interest has been given to trajectory data mining applications that permit to support mobility prediction with the aim of anticipating or pre-fetching possible services. Proposed approaches typically consider only spatiotemporal information provided by collected trajectories. However, in some scenarios, such as that of tourist supporting, semantic information which express needs and interest of the user (tourist) should be taken into account. This semantic information can be extracted from textual documents already consulted by the tourists. In this paper, we present the application of a time-slice density estimation approach that permits to suggest/predict the next destination of the tourist. In particular, time-slice density estimation permits to measure the rate of change of tourist's interests at a given geographical position over a user-defined time horizon. Tourist interests depend both on the geographical position of the tourist with respect to a reference system and on semantic information provided by geo-referenced documents associated to the visited sites.
Advances of high throughput technologies have yielded the possibility to investigate human cells of healthy and morbid ones at different levels. Consequently, this has made possible the discovery of new biological and biomedical data and the proliferation of a large number of databases. In this paper, we describe the IS-BioBank (Integrated Semantic Biological Data Bank) proposal. It consists of the realization of a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in such a huge quantity of data. In this framework, a key role has been played by the Connectivity Map, a databank which relates diseases, physiological processes, and the action of drugs. The system will be used in a pilot study on the Multiple Myeloma (MM).
Over the last decade, the advances in the high-throughput omic technologies have given the possibility to profile tumor cells at different levels, fostering the discovery of new biological data and the proliferation of a large number of bio-technological databases. In this paper we describe a framework for enabling the interoperability among different biological data sources and for ultimately supporting expert users in the complex process of extraction, navigation and visualization of the precious knowledge hidden in such a huge quantity of data. The system will be used in a pilot study on the Multiple Myeloma (MM).
Learning classifiers of spatial data presents several issues, such as the heterogeneity of spatial objects, the implicit definition of spatial relationships among objects, the spatial autocorrelation and the abundance of unlabelled data which potentially convey a large amount of information. The first three issues are due to the inherent structure of spatial units of analysis, which can be easily accommodated if a (multi-)relational data mining approach is considered. The fourth issue demands for the adoption of a transductive setting, which aims to make predictions for a given set of unlabelled data. Transduction is also motivated by the contiguity of the concept of positive autocorrelation, which typically affect spatial phenomena, with the smoothness assumption which characterize the transductive setting. In this work, we investigate a relational approach to spatial classification in a transductive setting. Computational solutions to the main difficulties met in this approach are presented. In particular, a relational upgrade of the nave Bayes classifier is proposed as discriminative model, an iterative algorithm is designed for the transductive classification of unlabelled data, and a distance measure between relational descriptions of spatial objects is defined in order to determine the k-nearest neighbors of each example in the dataset. Computational solutions have been tested on two real-world spatial datasets. The transformation of spatial data into a multi-relational representation and experimental results are reported and commented.
Many spatial phenomena are characterized by positive autocorrelation, i.e., variables take similar values at pairs of close locations. This property is strongly related to the smoothness assumption made in transductive learning, according to which if points in a high-density region are close, corresponding outputs should also be close. This observation, together with the prior availability of large sets of unlabelled data, which is typical in spatial applications, motivates the investigation of transductive learning for spatial data mining. The task considered in this work is spatial regression. We apply the co-training technique in order to iteratively learn two separate models, such that each model is used to make predictions on unlabeled data for the other. One model is built on the set of attribute-value observations measured at specific sites, while the other is built on the set of aggregated values measured for the same attributes in nearby sites. Experiments prove the effectiveness of the proposed approach on spatial domains.
In many textual repositories, documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. In this paper we present a novel approach for automatic classification of documents into a hierarchy of categories that works in the transductive setting and exploits relevant example selection. While resorting to the transductive learning setting permits to classify repositories where only few examples are labelled by exploiting information potentially conveyed by unlabelled data, relevant example selection permits to tame the complexity of the task and increase the rate of learning by focusing only on informative examples. Results on real world datasets show the effectiveness of the proposed solutions
A fundamental task of document image understanding is to recognize semantically relevant components in the layout extracted from a document image. This task can be automatized by learning classifiers to label such components. The application of inductive learning algorithms assumes the availability of a large set of documents, whose layout components have been previously labeled through manual annotation. This contrasts with the more common situation in which we have only few labeled documents and an abundance of unlabeled ones. A further degree of complexity of the learning task is represented by the importance of spatial relationships between layout components, which cannot be adequately represented by feature vectors. To face these problems, we investigate the application of a relational classifier that works in the transductive setting. Transduction is justified by the possibility of exploiting the large amount of information conveyed in the unlabeled documents and by the contiguity of the concept of positive autocorrelation with the smoothness assumption which characterizes the transductive setting. The classifier takes advantage of discovered emerging patterns that permit us to qualitatively characterize classes. Computational solutions have been tested on document images of scientific literature and the experimental results show the advantages and drawbacks of the approach.
Consider a multi-relational database, to be used for classification, that contains a large number of unlabeled data. It follows that the cost of labeling such data is prohibitive. Transductive learning, which learns from labeled as well as from unlabeled data already known at learning time, is highly suited to address this scenario. In this paper, we construct multi-views from a relational database, by considering different subsets of the tables as contained in a multi-relational database. These views are used to boost the classification of examples in a co-training schema. The automatically generated views allow us to overcome the independence problem that negatively affect the performance of co-training methods. Our experimental evaluation empirically shows that co-training is beneficial in the transductive learning setting when mining multi-relational data and that our approach works well with only a small amount of labeled data.
Background: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers.Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function.Conclusions: Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.
Motivation: Catalogs, such as Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically (general functions include more specific functions). This has recently motivated the development of several machine learning algorithms under the assumption that instances may belong to multiple hierarchy organized classes. Besides relationships among classes, it is also possible to identify relationships among examples. Although such relationships have been identified and extensively studied in the in the area of protein-to-protein interaction (PPI) networks, they have not received much attention in hierarchical protein function prediction. The use of such relationships between genes introduces autocorrelation and violates the assumption that instances are independently and identically distributed, which underlines most machine learning algorithms. While this consideration introduces additional complexity to the learning process, we expect it would also carry substantial benefits. Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). The empirical evaluation of the proposed algorithm, called NHMC, on 24 yeast datasets using MIPSFUN and GO annotations and exploiting three different PPI networks, clearly shows that taking autocorrelation into account improves performance. Conclusions: Our results suggest that explicitly taking network autocorrelation into account increases the predictive capability of the models, especially when the underlying PPI network is dense. Furthermore, NHMC can be used as a tool to assess network data and the information it provides with respect to the gene function.
Condividi questo sito sui social