Effettua una ricerca
Donato Malerba
Ruolo
Professore Ordinario
Organizzazione
Università degli Studi di Bari Aldo Moro
Dipartimento
DIPARTIMENTO DI INFORMATICA
Area Scientifica
AREA 09 - Ingegneria industriale e dell'informazione
Settore Scientifico Disciplinare
ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore ERC 1° livello
Non Disponibile
Settore ERC 2° livello
Non Disponibile
Settore ERC 3° livello
Non Disponibile
Process mining refers to the discovery, conformance and enhancement of process models from event logs currently produced by several information systems (e.g. workflow management systems). By tightly coupling event logs and process models, process mining makes possible to detect deviations, predict delays, support decision making and recommend process redesigns. Event logs are data sets containing the executions (called traces) of a business process. Several process mining algorithms have been defined to mine event logs and deliver valuable models (e.g. Petri nets) of how logged processes are being executed. However, they often generate spaghetti-like process models, which can be hard to understand. This is caused by the inherent complexity of real-life processes, which tend to be less structured and more flexible than what the stakeholders typically expect. In particular, spaghetti-like process models are discovered when all possible behaviors are shown in a single model as a result of considering the set of traces in the event log all at once. To minimize this problem, trace clustering can be used as a preprocessing step. It splits up an event log into clusters of similar traces, so as to handle variability in the recorded behavior and facilitate process model discovery. In this paper, we investigate a multiple view aware approach to trace clustering, based on a co-training strategy. In an assessment, using benchmark event logs, we show that the presented algorithm is able to discover a clustering pattern of the log, such that related traces result appropriately clustered. We evaluate the significance of the formed clusters using established machine learning and process mining metrics.
microRNAs (miRNAs) are a class of small non-coding RNAs which have been recognized as ubiquitous post-transcriptional regulators. The analysis of interactions between different miRNAs and their target genes is necessary for the understanding of miRNAs' role in the control of cell life and death. In this paper we propose a novel data mining algorithm, called HOCCLUS2, specifically designed to bicluster miRNAs and target messenger RNAs (mRNAs) on the basis of their experimentally-verified and/or predicted interactions. Indeed, existing biclustering approaches, typically used to analyze gene expression data, fail when applied to miRNA:mRNA interactions since they usually do not extract possibly overlapping biclusters (miRNAs and their target genes may have multiple roles), extract a huge amount of biclusters (difficult to browse and rank on the basis of their importance) and work on similarities of feature values (do not limit the analysis to reliable interactions). Results To overcome these limitations, HOCCLUS2 i) extracts possibly overlapping biclusters, to catch multiple roles of both miRNAs and their target genes; ii) extracts hierarchically organized biclusters, to facilitate bicluster browsing and to distinguish between universe and pathway-specific miRNAs; iii) extracts highly cohesive biclusters, to consider only reliable interactions; iv) ranks biclusters according to the functional similarities, computed on the basis of Gene Ontology, to facilitate bicluster analysis. Conclusions Our results show that HOCCLUS2 is a valid tool to support biologists in the identification of context-specific miRNAs regulatory modules and in the detection of possibly unknown miRNAs target genes. Indeed, results prove that HOCCLUS2 is able to extract cohesiveness-preserving biclusters, when compared with competitive approaches, and statistically confirm (at a confidence level of 99%) that mRNAs which belong to the same biclusters are, on average, more functionally similar than mRNAs which belong to different biclusters. Finally, the hierarchy of biclusters provides useful insights to understand the intrinsic hierarchical organization of miRNAs and their potential multiple interactions on target genes.
Recently, several algorithms based on the MapReduce framework have been proposed for frequent pattern mining in Big Data. However, the proposed solutions come with their own technical challenges, such as inter-communication costs, in-process synchronizations, balanced data distribution and input parameters tuning, which negatively affect the computation time. In this paper we present MrAdam, a novel parallel, distributed algorithm which addresses these problems. The key principle underlying the design of MrAdam is that one can make reasonable decisions in the absence of perfect answers. Indeed, given the classical threshold for minimum support and a user-specified error bound, MrAdam exploits the Chernoff bound to mine "approximate" frequent itemsets with statistical error guarantees on their actual supports. These itemsets are generated in parallel and independently from subsets of the input dataset, by exploiting the MapReduce parallel computation framework. The result collections of frequent itemsets from each subset are aggregated and filtered by using a novel technique to provide a single collection in output. MrAdam can scale well on gigabytes of data and tens of machines, as experimentally proven on real datasets. In the experiments we also show that the proposed algorithm returns a good statistically bounded approximation of the exact results.
The amount of data produced by ubiquitous computing applications is quickly growing, due to the pervasive presence of small devices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiquitous data', require a (multi-) relational approach to their analysis. However, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm approximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be considered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoretically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.
Traditional pattern discovery approaches permit to identify frequent patterns expressed in form of conjunctions of items and represent their frequent co-occurrences. Although such approaches have been proved to be effective in descriptive knowledge discovery tasks, they can miss interesting combinations of items which do not necessarily occur together. To avoid this limitation, we propose a method for discovering interesting patterns that consider disjunctions of items that, otherwise, would be pruned in the search. The method works in the relational data mining setting and conserves anti-monotonicity properties that permit to prune the search. Disjunctions are obtained by joining relations which can simultaneously or alternatively occur, namely relations deemed similar in the applicative domain. Experiments and comparisons prove the viability of the proposed approach.
Longitudinal data consist of the repeated measurements of some variables which describe a process (or phenomenon) over time. They can be analyzed to unearth information on the dynamics of the process. In this paper we propose a temporal data mining framework to analyze these data and acquire knowledge, in the form of temporal patterns, on the events which can frequently trigger particular stages of the dynamic process. The application to a biomedical scenario is addressed. The goal is to analyze biosignal data in order to discover patterns of events, expressed in terms of breathing and cardiovascular system time-annotated disorders, which may trigger particular stages of the human central nervous system during sleep.
The rising need of energy to improve the quality of life has paved the way for the development and the incentive of different kinds of renewable energy technologies. In particular, the recent increase in the number of installed PhotoVoltaic (PV) plants has boosted the marketing of new monitoring systems designed to take under control the energy production of PV plants. In this paper, we present an intelligent monitoring system, called SUNInspector, which resorts to spatio-temporal data mining techniques, in order to monitor energy productions of PV plants and detect real-time possible plant faults. SUNInspector uses spatio-temporal patterns, called trend clusters, to model the trends according to the energy production of the PV plants varies depending on the region where it is installed (spatial dependence) and the period of the year of the measurements (temporal dipendence). Each time a PV plant transmits its energy production measurement, the risk of a plant fault is measured by evaluating the persistence of an high difference between the real production and the expected production. A case study with PV plants distributed over the South of Italy is illustrated.
The analysis of spatial autocorrelation has defined a new paradigm in ecology. Attention to spatial pattern leads to insights that would otherwise overlooked, while ignoring space may lead to false conclusions about ecological relationships. In this paper, we propose an intelligent forecasting technique, which explicitly accounts for the property of spatial autocorrelation when learning linear autoregressive models (ARIMA) of spatial correlated ecologic time series. The forecasting algorithm makes use of an autoregressive statistical technique, which achieves accurate forecasts of future data by taking into account temporal and spatial dimension of ecologic data. It uses a novel spatial-aware inference procedure, which permits to learn the autoregressive model by processing a time series in a neighborhood (spatial lags). Parameters of forecasting models are jointly learned on spatial lags of time series. Experiments with ecologic data investigate the accuracy of the proposed spatial-aware forecasting model with respect to the traditional one.
In this paper, we face the problem of extracting spatial relationships from geographical entities mentioned in textual documents. This is part of a research project which aims at geo-referencing document contents, hence making the realization of a Geographical Information Retrieval system possible. The driving factor of this research is the huge amount of Web documents which mention geographic places and relate them spatially. Several approaches have been proposed for the extraction of spatial relationships. However, they all assume the availability of either a large set of manually annotated documents or complex hand-crafted rules. In both cases, a rather tedious and time-consuming activity is required by domain experts. We propose an alternative approach based on the combined use of both a spatial ontology, which defines the topological relationships (classes) to be identified within text, and a nearest-prototype classifier, which helps to recognize instances of the topological relationships. This approach is unsupervised, so it does not need annotated data. Moreover, it is based on an ontology, which prevents the hand-crafting of ad hoc rules. Experimental results on real datasets show the viability of this approach.
The problem of accurately predicting the energy production from renewable sources has recently received an increasing attention from both the industrial and the research communities. It presents several challenges, such as facing with the high rate data are provided by sensors, the heterogeneity of the data collected, power plants efficiency, as well as uncontrollable factors, such as weather conditions and user consumption profiles. In this paper we describe Vi-POC (Virtual Power Operating Center), a project conceived to assist energy producers and, more in general decision makers in the energy market. In this paper we present the Vi-POC project and how we face with challenges posed by the specific application domain. The solutions we propose have roots both in big data management and in stream data mining.
A spatio-temporal data stream is a sequence of time-stamped geo-referenced data elements which arrive at consecutive time points. In addition to the spatial and temporal dimensions which are information bearing, stream poses further challenges to data mining, which are avoiding multiple scans of the entire data sets, optimizing memory usage, and mining only the most recent patterns. In this paper, we address the challenges of mining spatiotemporal data streams for a new class of space-time patterns, called trend-clusters. These patterns combine spatial clustering and trend discovery in stream environments. In particular, we propose a novel algorithm, called TRUST, which allows to retrieve groups of spatially continuous geo-referenced data which variate according to a close trend polyline in the recent window past. Experiments demonstrate the effectiveness of the proposed algorithm.
In predictive data mining tasks, we should account for autocorrelations of both the independent variables and the dependent variable, which we can observe in neighborhood of a target node and that same node. The prediction on a target node should be based on the value of the neighbours which might even be unavailable. To address this problem, the values of the neighbours should be inferred collectively. We present a novel computational solution to perform collective inferences in a network regression task. We define an iterative algorithm, in order to make regression inferences about predictions of multiple nodes simultaneously and feed back the more reliable predictions made by the previous models in the labeled network. Experiments investigate the effectiveness of the proposed algorithm in spatial networks
A key task in data mining and information retrieval is learning preference relations. Most of methods reported in the literature learn preference relations between objects which are represented by attribute-value pairs or feature vectors (propositional representation). The growing interest in data mining techniques which are able to directly deal with more sophisticated representations of complex objects, motivates the investigation of relational learning methods for learning preference relations. In this paper, we present a probabilistic relational data mining method which permits to model preference relations between complex objects. Preference relations are then used to rank objects. Experiments on two ranking problems for scientific literature mining prove the effectiveness of the proposed method.
The trend cluster discovery retrieves areas of spatially close sensors which measure a numeric random field having a prominent data trend along a time horizon. We propose a computation preserving algorithm which employees an incremental learning strategy to continuously maintain sliding window trend clusters across a sensor network. Our proposal reduces the amount of data to be processed and saves the computation time as a consequence. An empirical study proves the effectiveness of the proposed algorithm to take under control computation cost of detecting sliding window trend clusters.
Emerging real life applications, such as environmental compliance, ecological studies and meteorology, are characterized by real-time data acquisition through a number of (wireless) remote sensors. Operatively, remote sensors are installed across a spatially distributed network; they gather information along a number of attribute dimensions and periodically feed a central server with the measured data. The server is required to monitor these data, issue possible alarms or compute fast aggregates. As data analysis requests, which are submitted to a server, may concern both present and past data, the server is forced to store the entire stream. But, in the case of massive streams (large networks and/or frequent transmissions), the limited storage capacity of a server may impose to reduce the amount of data stored on the disk. One solution to address the storage limits is to compute summaries of the data as they arrive and use these summaries to interpolate the real data which are discarded instead. On any future demands of further analysis of the discarded data, the server pieces together the data from the summaries stored in database and processes them according to the requests. This work introduces the multiple possibilities and facets of a recently defined spatio-temporal pattern, called trend cluster, and its applications to summarize, interpolate and identify anomalies in a sensor network. As an example application, the authors illustrate the application of trend cluster discovery to monitor the efficiency of photovoltaic power plants. The work closes with remarks on new possibilities for surveillance gained by recent developments of sensing technology, and with an outline of future challenges.
Despite the growing ubiquity of sensor deployments and the advances in sensor data analysis technology, relatively little attention has been paid to the spatial non-stationarity of sensed data which is an intrinsic property of the geographically distributed data. In this paper we deal with non-stationarity of geographically distributed data for the task of regression. At this purpose, we extend the Geographically Weighted Regression (GWR) method which permits the exploration of the geographical differences in the linear effect of one or more predictor variables upon a response variable. The parameters of this linear regression model are locally determined for every point of the space by processing a sample of weighted neighboring observations. Although the use of locally linear regression has proved appealing in the area of sensor data analysis, it also poses some problems. The parameters of the surface are locally estimated for every space point, but the form of the GWR regression surface is globally defined over the whole sample space. Moreover, the GWR estimation is founded on the assumption that all predictor variables are equally relevant in the regression surface, without dealing with spatially localized phenomena of collinearity. Our proposal overcomes these limitations with a novel tree-based approach which is adapted to the aim of recovering the functional form of a regression model only at the local level. A stepwise approach is then employed to determine the local form of each regression model by selecting only the most promising predictors and providing a mechanism to estimate parameters of these predictors at every point of the local area. Experiments with several geographically distributed datasets confirm that the tree based construction of GWR models improves both the local estimation of parameters of GWR and the global estimation of parameters performed by classical model trees.
Spatial autocorrelation is the correlation among data values which is strictly due to the relative spatial proximity of the objects that the data refer to. Inappropriate treatment of data with spatial dependencies, where spatial autocorrelation is ignored, can obfuscate important insights. In this paper, we propose a data mining method that explicitly considers spatial autocorrelation in the values of the response (target) variable when learning predictive clustering models. The method is based on the concept of predictive clustering trees (PCTs), according to which hierarchies of clusters of similar data are identified and a predictive model is associated to each cluster. In particular, our approach is able to learn predictive models for both a continuous response (regression task) and a discrete response (classification task). We evaluate our approach on several real world problems of spatial regression and spatial classification. The consideration of the autocorrelation in the models improves predictions that are consistently clustered in space and that clusters try to preserve the spatial arrangement of the data, at the same time providing a multi-level insight into the spatial autocorrelation phenomenon. The evaluation of SCLUS in several ecological domains (e.g. predicting outcrossing rates within a conventional field due to the surrounding genetically modified fields, as well as predicting pollen dispersal rates from two lines of plants) confirms itscapability of building spatial aware models which capture the spatial distribution of the target variable. In general, the maps obtained by using SCLUS do not require further post-smoothing of the results if we want to use them in practice.
Spatial data is common in ecological studies; however, one major problem with spatial data is the presence of the spatial autocorrelation. This phenomenon indicates that data measured at locations relatively close one to each other tend to have more similar values than data measured at locations further apart. Spatial autocorrelation violates the statistical assumption that the analyzed data are independent and identically distributed. This chapter focuses on effects of the spatial autocorrelation when predicting gene flow from Genetically Modified (GM) to non-GM maize fields under real multi-field crop management practices at a regional scale. We present the SCLUS method, an extension of the method CLUS (Blockeel et al., 1998), which learns spatially aware predictive clustering trees (PCTs). The method can consider locally and globally the effects of the spatial autocorrelation as well as can deal with the “ecological fallacy” problem (Robinson, 1950). The chapter concludes with a presentation of an application of this approach for gene flow modeling.
Anomaly detection and change analysis are challenging tasks in stream data mining. We illustrate a novel method that addresses both these tasks in geophysical applications. The method is designed for numeric data routinely sampled through a sensor network. It extends the traditional time series forecasting theory by accounting for the spatial information of geophysical data. In particular, a forecasting model is computed incrementally by accounting for the temporal correlation of data which exhibit a spatial correlation in the recent past. For each sensor the observed value is compared to its spatial-aware forecast, in order to identify the outliers. Finally, the spatial correlation of outliers is analyzed, in order to classify changes and reduce the number of false anomalies. The performance of the presented method is evaluated in both artificial and real data streams.
Analyzing biosignal data is an activity of great importance which can unearth information on the course of a disease. In this paper we propose a temporal data mining approach to analyze these data and acquire knowledge, in the form of temporal patterns, on the physiological events which can frequently trigger particular stages of disease. The proposed approach is realized through a four-stepped computational solution: first, disease stages are determined, then a subset of stages of interest is identified, subsequently physiological time-annotated events which can trigger those stages are detected, finally, patterns are discovered from the extracted events. The application to the sleep sickness scenario is addressed to discover patterns of events, in terms of breathing and cardiovascular system time-annotated disorders, which may trigger particular sleep stages.
Studying Greek and Latin cultural heritage has always been considered essential to the understanding of important aspects of the roots of current European societies. However, only a small fraction of the total production of texts from ancient Greece and Rome has survived up to the present, leaving many gaps in the historiographic records. Epigraphy, which is the study of inscriptions (epigraphs), helps to fill these gaps. In particular, the goal of epigraphy is to clarify the meanings of epigraphs; to classify their uses according to their dating and cultural contexts; and to study aspects of the writing, the writers, and their “consumers.” Although several research projects have recently been promoted for digitally storing and retrieving data and metadata about epigraphs, there has actually been no attempt to apply data mining technologies to discover previously unknown cultural aspects. In this context, we propose to exploit the temporal dimension associated with epigraphs (dating) by applying a data mining method for novelty detection. The main goal is to discover relational novelty patterns—that is, patterns expressed as logical clauses describing significant variations (in frequency) over the different epochs, in terms of relevant features such as language, writing style, and material. As a case study, we considered the set of Inscriptiones Christianae Vrbis Romae stored in Epigraphic Database Bari, an epigraphic repository. Some patterns discovered by the data mining method were easily deciphered by experts since they captured relevant cultural changes, whereas others disclosed unexpected variations, which might be used to formulate new questions, thus expanding the research opportunities in the field of epigraphy
The automatic discovery of process models can help to gain insight into various perspectives (e.g., control flow or data perspective) of the process executions traced in an event log. Frequent patterns mining offers a means to build human understandable representations of these process models. This paper describes the application of a multi-relational method of frequent pattern discovery into process mining. Multi-relational data mining is demanded for the variety of activities and actors involved in the process executions traced in an event log which leads to a relational (or structural) representation of the process executions. Peculiarity of this work is in the integration of disjunctive forms into relational patterns discovered from event logs. The introduction of disjunctive forms enables relational patterns to express frequent variants of process models. The effectiveness of using relational patterns with disjunctions to describe process models with variants is assessed on real logs of process executions.
Most of the works on learning from networked data assume that the network is static. In this paper we consider a different scenario, where the network is dynamic, i.e. nodes/relationships can be added or removed and relationships can change in their type over time. We assume that the “core” of the network is more stable than the “marginal” part of the network, nevertheless it can change with time. These changes are of interest for this work, since they reflect a crucial step in the network evolution. Indeed, we tackle the problem of discovering evolution chains, which express the temporal evolution of the “core” of the network. To describe the “core” of the network, we follow a frequent pattern-mining approach, with the critical difference that the frequency of a pattern is computed along a time-period and not on a static dataset. The proposed method proceeds in two steps: 1) identification of changes through the discovery of emerging patterns; 2) composition of evolution chains by joining emerging patterns. We test the effectiveness of the method on both real and synthetic data.
Nowadays sensors are deployed everywhere in order to support real-time data applications. They periodically gather information along a number of attribute dimensions (e.g., temperature and humidity). Applications typically require monitoring these data, fast computing aggregates, predicting unknown data, or issuing alarms. To this aim, this paper introduces a recently defined spatio-temporal pattern, called trend cluster, and its multiple applications to summarize, interpolate and detect outliers in sensor network data. As an example, we illustrate the application of trend cluster discovery to air climate data monitoring
In Document Image Understanding, one of the fundamental tasks is that of recognizing semantically relevant components in the layout extracted from a document image. This process can be automatized by learning classifiers able to automatically label such components. However, the learning process assumes the availability of a huge set of documents whose layout components have been previously manually labeled. Indeed, this contrasts with the more common situation in which we have only few labeled documents and abundance of unlabeled ones. In addition, labeling layout documents introduces further complexity aspects due to multi-modal nature of the components (textual and spatial information may coexist). In this work, we investigate the application of a relational classifier that works in the transductive setting. The relational setting is justified by the multi-modal nature of the data we are dealing with, while transduction is justified by the possibility of exploiting the large amount of information conveyed in the unlabeled layout components. The classifier bootstraps the labeling process in an iterative way: reliable classifications are used in subsequent iterative steps as training examples. The proposed computational solution has been evaluated on document images of scientific literature.
Classical Greek and Latin culture is the very foundation of the identity of modern Europe. Today, a variety of modern subjects and disciplines have their roots in the classical world: from philosophy to architecture, from geometry to law. However, only a small fraction of the total production of texts from ancient Greece and Rome has survived up to the present days, leaving many ample gaps in the historiographic records. Epigraphy, which is the study of inscriptions (epigraphs), aims at plug this gap. In particular, the goal of Epigraphy is to clarify the meanings of epigraphs, classifying their uses according to dates and cultural contexts, and drawing conclusions about the writing and the writers. Indeed, they are a kind of cultural heritage for which several research projects have recently been promoted for the purposes of preservation, storage, indexing and on-line usage. In this paper, we describe the system EDB (Epigraphic Database Bari) which stores about 40,000 Christian inscriptions of Rome, including those published in the Inscriptiones Christianae Vrbis Romae septimo saeculo antiquiores, nova series editions. EDB provides, in addition to the possibility of storing metadata, the possibility of i) supporting information retrieval through a thesaurus based query engine, ii) supporting time-based analysis of epigraphs in order to detect and represent novelties, and iii) geo-referencing epigraphs by exploiting a spatial database.
In traditional OLAP systems, roll-up and drill-down operations over data cubes exploit fixed hierarchies defined on discrete attributes, which play the roles of dimensions, and operate along them. New emerging application scenarios, such as sensor networks, have stimulated research on OLAP systems, where even continuous attributes are considered as dimensions of analysis, and hierarchies are defined over continuous domains. The goal is to avoid the prior definition of an ad-hoc discretization hierarchy along each OLAP dimension. Following this research trend, in this paper we propose a novel method, founded on a density-based hierarchical clustering algorithm, to support roll-up and drill-down operations over OLAP data cubes with continuous dimensions. The method hierarchically clusters dimension instances by also taking fact-table measures into account. Thus, we enhance the clustering effect with respect to the possible analysis. Experiments on two well-known multidimensional datasets clearly show the advantages of the proposed solution.
The task being addressed in this paper consists of trying to forecast the future value of a time series variable on a certain geographical location, based on historical data of this variable collected on both this and other locations. In general, this time series forecasting task can be performed by using machine learning models, which transform the original problem into a regression task. The target variable is the future value of the series, while the predictors are previous past values of the series up to a certain p-length time window. In this paper, we convey information on both the spatial and temporal historical data to the predictive models, with the goal of improving their forecasting ability. We build technical indicators, which are summaries of certain properties of the spatio-temporal data, grouped in the spatio-temporal clusters and use them to enhance the forecasting ability of regression models. A case study with air temperature data is presented.
The task being addressed in this paper consists of trying to forecast the future value of a time series variable on a certain geographical location, based on historical data of this variable collected on both this and other locations. In general, this time series forecasting task can be performed by using machine learning models, which transform the original problem into a regression task. The target variable is the future value of the series, while the predictors are previous past values of the series up to a certain p-length time window. In this paper, we convey information on both the spatial and temporal historical data to the predictive models, with the goal of improving their forecasting ability. We build technical indicators, which are summaries of certain properties of the spatio-temporal data, grouped in the spatio-temporal clusters and use them to enhance the forecasting ability of regression models. A case study with air temperature data is presented.
The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of aWeb page independently do not generalize well.We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.
Sequential pattern mining is an important data mining task with applications in basket analysis, world wide web, medicine and telecommunication. This task is challenging because sequence databases are usually large with many and long sequences and the number of possible sequential patterns to mine can be exponential. We proposed a new sequential pattern mining algorithm called FAST which employs a representation of the dataset with indexed sparse id-lists to fast counting the support of sequential patterns. We also use a lexicographic tree to improve the efficiency of candidates generation. FAST mines the complete set of patterns by greatly reducing the effort for support counting and candidate sequences generation. Experimental results on artificial and real data show that our method outperforms existing methods in literature up to an order of magnitude or two for large datasets.
Spatial autocorrelation is the correlation among data values, strictly due to the relative location proximity of the objects that the data refer to. This statistical property clearly indicates a violation of the assumption of observation independence - a pre-condition assumed by most of the data mining and statistical models. Inappropriate treatment of data with spatial dependencies could obfuscate important insights when spatial autocorrelation is ignored. In this paper, we propose a data mining method that explicitly considers autocorrelation when building the models. The method is based on the concept of predictive clustering trees (PCTs). The proposed approach combines the possibility of capturing both global and local effects and dealing with positive spatial autocorrelation. The discovered models adapt to local properties of the data, providing at the same time spatially smoothed predictions. Results show the effectiveness of the proposed solution.
microRNAs (miRNAs) are an important class of regulatory factors controlling gene expressions at post-transcriptional level. Studies on interactions between different miRNAs and their target genes are of utmost importance to understand the role of miRNAs in the control of biological processes. This paper contributes to these studies by proposing a method for the extraction of co-clusters of miRNAs and messenger RNAs (mRNAs). Different from several already available co-clustering algorithms, our approach efficiently extracts a set of possibly overlapping, exhaustive and hierarchically organized co-clusters. The algorithm is well-suited for the task at hand since: i) mRNAs and miRNAs can be involved in different regulatory networks that may or may not be co-active under some conditions, ii) exhaustive co-clusters guarantee that possible co-regulations are not lost, iii) hierarchical browsing of co-clusters facilitates biologists in the interpretation of results. Results on synthetic and on real human miRNA:mRNA data show the effectiveness of the approach.
The problem of accurately predicting the energy production from renewable sources has recently received an increasing attention from both the industrial and the research communities. It presents several challenges, such as facing with the rate data are provided by sensors, the heterogeneity of the data collected, power plants efficiency, as well as uncontrollable factors, such as weather conditions and user consumption profiles. In this paper we describe emph{Vi-POC} (Virtual Power Operating Center), a project conceived to assist energy producers and, more in general decision makers in the energy market. In this paper we present the Vi-POC project and how we face with challenges posed by the specific application domain. The solutions we propose have roots both in big data management and in stream data mining.
Clustering geosensor data is a problem that has recently attracted a large amount of research. In this paper, we focus on clustering geophysical time series data measured by a geo-sensor network. Clusters are built by accounting for both spatial and temporal information of data. We use clusters to produce globally meaningful information from time series obtained by individual sensors. The cluster information is integrated to the ARIMA model, in order to yield accurate forecasting results. Experiments investigate the trade-off between accuracy and efficiency of the proposed algorithm.
Background MicroRNAs (miRNAs) are small non-coding RNAs which play a key role in the post-transcriptional regulation of many genes. Elucidating miRNA-regulated gene networks is crucial for the understanding of mechanisms and functions of miRNAs in many biological processes, such as cell proliferation, development, differentiation and cell homeostasis, as well as in many types of human tumors. To this aim, we have recently presented the biclustering method HOCCLUS2, for the discovery of miRNA regulatory networks. Experiments on predicted interactions revealed that the statistical and biological consistency of the obtained networks is negatively affected by the poor reliability of the output of miRNA target prediction algorithms. Recently, some learning approaches have been proposed to learn to combine the outputs of distinct prediction algorithms and improve their accuracy. However, the application of classical supervised learning algorithms presents two challenges: i) the presence of only positive examples in datasets of experimentally verified interactions and ii) unbalanced number of labeled and unlabeled examples. Results We present a learning algorithm that learns to combine the score returned by several prediction algorithms, by exploiting information conveyed by (only positively labeled/) validated and unlabeled examples of interactions. To face the two related challenges, we resort to a semi-supervised ensemble learning setting. Results obtained using miRTarBase as the set of labeled (positive) interactions and mirDIP as the set of unlabeled interactions show a significant improvement, over competitive approaches, in the quality of the predictions. This solution also improves the effectiveness of HOCCLUS2 in discovering biologically realistic miRNA:mRNA regulatory networks from large-scale prediction data. Using the miR-17-92 gene cluster family as a reference system and comparing results with previous experiments, we find a large increase in the number of significantly enriched biclusters in pathways, consistent with miR-17-92 functions. Conclusion The proposed approach proves to be fundamental for the computational discovery of miRNA regulatory networks from large-scale predictions. This paves the way to the systematic application of HOCCLUS2 for a comprehensive reconstruction of all the possible multiple interactions established by miRNAs in regulating the expression of gene networks, which would be otherwise impossible to reconstruct by considering only experimentally validated interactions.
Information acquisition in a pervasive sensor network is often affected by faults due to power outage at nodes, wrong time synchronizations, interference, network transmission failures, sensor hardware issues or excessive energy consumption for communications. These issues impose a trade-off between the precision of the measurements and the costs of communication and processing which are directly proportional to the number of sensors and/or transmissions. We present a spatio-temporal interpolation technique which allows an accurate estimation of sensor network missing data by computing the inverse distance weighting of the trend cluster representation of the transmitted data. The trend-cluster interpolation has been evaluated in a real climate sensor network in order to prove the efficacy of our solution in reducing the amount of transmissions by guaranteeing accurate estimation of missing data.
With the increasing amount of information in electronic form the fields of Machine Learning and Data Mining continue to grow by providing new advances in theory, applications and systems. The aim of this paper is to consider some recent theoretical aspects and approaches to ML and DM with an emphasis on the Italian research.
The Geographically Weighted Regression (GWR) is a method of spatial statistical analysis which allows the exploration of geographical differences in the linear effect of one or more predictor variables upon a response variable. The parameters of this linear regression model are locally determined for every point of the space by processing a sample of distance decay weighted neighboring observations. While this use of locally linear regression has proved appealing in the area of spatial econometrics, it also presents some limitations. First, the form of the GWR regression surface is globally defined over the whole sample space, although the parameters of the surface are locally estimated for every space point. Second, the GWR estimation is founded on the assumption that all predictor variables are equally relevant in the regression surface, without dealing with spatially localized collinearity problems. Third, time dependence among observations taken at consecutive time points is not considered as information-bearing for future predictions. In this paper, a tree-structured approach is adapted to recover the functional form of a GWR model only at the local level. A stepwise approach is employed to determine the local form of each GWR model by selecting only the most promising predictors. Parameters of these predictors are estimated at every point of the local area. Finally, a time-space transfer technique is tailored to capitalize on the time dimension of GWR trees learned in the past and to adapt them towards the present. Experiments confirm that the tree-based construction of GWR models improves both the local estimation of parameters of GWR and the global estimation of parameters performed by classical model trees. Furthermore, the effectiveness of the time-space transfer technique is investigated.
We present an algorithm for hierarchical multi-label classifi- cation (HMC) in a network context. It is able to classify instances that may belong to multiple classes at the same time and consider the hierar- chical organization of the classes. It assumes that the instances are placed in a network and uses information on the network connections during the learning of the predictive model. Many real world prediction problems have classes that are organized hierarchically and instances that can have pairwise connections. One example is web document classification, where topics (classes) are typically organized into a hierarchy and documents are connected by hyperlinks. Another example, which is considered in this paper, is gene/protein function prediction, where genes/proteins are connected and form protein-to-protein interaction (PPI) networks. Net- work datasets are characterized by a form of autocorrelation, where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. Combining the hierarchical multi-label classification task with network prediction is thus not trivial and re- quires the introduction of the new concept of network autocorrelation for HMC. The proposed algorithm is able to profitably exploit network autocorrelation when learning a tree-based prediction model for HMC. The learned model is in the form of a Predictive Clustering Tree (PCT) and predicts multiple (hierarchically organized) labels at the leaves. Ex- periments show the effectiveness of the proposed approach for different problems of gene function prediction, considering different PPI networks. The results show that different networks introduce different benefits in different problems of gene function prediction.
Link prediction in network data is a data mining task which is receiving significant attention due to its applicability in various domains. An example can be found in social network analysis, where the goal is to identify connections between users. Another application can be found in computational biology, where the goal is to identify previously unknown relationships among biological entities. For example, the identification of regulatory activities (links) among genes would allow biologists to discover possible gene regulatory networks. In the literature, several approaches for link prediction can be found, but they often fail in simultaneously considering all the possible criteria (e.g. network topology, nodes properties, autocorrelation among nodes). In this paper we present a semi-supervised data mining approach which learns to combine the scores returned by several link prediction algorithms. The proposed solution exploits both a small set of validated examples of links and a huge set of unlabeled links. The application we consider regards the identification of links between genes and miRNAs, which can contribute to the understanding of their roles in many biological processes. The specific application requires to learn from only positively labeled examples of links and to face with the high unbalancing between labeled and unlabeled examples. Results show a significant improvement with respect to single prediction algorithms and with respect to baseline combination.
Networked data are, nowadays, collected in various application domains such as social networks, biological networks, sensor networks, spatial networks, peer-to-peer networks etc. Recently, the application of data stream mining to networked data, in order to study their evolution over time, is receiving increasing attention in the research community. Following this main stream of research, we propose an algorithm for mining ranking models from networked data which may evolve over time. In order to properly deal with the concept drift problem, the algorithm exploits an ensemble learning approach which allows us to weight the importance of learned ranking models from past data when ranking new data. Learned models are able to take the network autocorrelation into account, that is, the statistical dependency between the values of the same attribute on related nodes. Empirical results prove the effectiveness of the proposed algorithm and show that it performs better than other approaches proposed in the literature.
Nowadays ubiquitous sensor stations are deployed worldwide, in order to measure several geophysical variables (e.g. temperature, humidity, light) for a growing number of ecological and industrial processes. Although these variables are, in general, measured over large zones and long (potentially unbounded) periods of time, stations cannot cover any space location. On the other hand, due to their huge volume, data produced cannot be entirely recorded for future analysis. In this scenario, summarization, i.e. the computation of aggregates of data, can be used to reduce the amount of produced data stored on the disk, while interpolation, i.e. the estimation of unknown data in each location of interest, can be used to supplement station records. We illustrate a novel data mining solution, named interpolative clustering, that has the merit of addressing both these tasks in time-evolving, multivariate geophysical applications. It yields a time-evolving clustering model, in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes, in order to predict data. Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data. Weights of the linear combination are defined, in order to reflect the inverse distance of the unseen data to each cluster geometry. The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations. Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm
In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have public Web pages; if we can map the database record with the appropriate Web page then the new information could be used to further describe the person's database record. To accomplish this goal we employ link paths which contain anchor texts from multiple paths through the Web ending at the Web page in question. We hypothesize that the information from these link paths can be used to generate an accurate Web page to database record mapping. Experiments on two large, real world data sets, DBLP and IMDB for the structured data and computer science faculty members' Web pages and official movie homepages for the Web page data, show that our method does provide an accurate mapping. Finally, we conclude by issuing a call for further research on this promising new task.
Technologies in available biomedical repositories do not yet provide adequate mechanisms to support the understanding and analysis of the stored content. In this project we investigate this problem under different perspectives. Our contribution is the design of computational solutions for the analysis of biomedical documents and images. These integrate sophisticated technologies and innovative approaches of Information Extraction, Data Mining and Machine Learning to perform descriptive tasks of knowledge discovery from biomedical repositories.
The growing integration of wind turbines into the power grid can only be balanced with precise forecasts of upcoming energy productions. This information plays as basis for operation and management strategies for a reliable and economical integration into the power grid. A precise forecast needs to overcome problems of variable energy production caused by fluctuating weather conditions. In this paper, we define a data mining approach, in order to process a past set of the wind power measurements of a wind turbine and extract a robust prediction model. We resort to a time series clustering algorithm, in order to extract a compact, informative representation of the time series of wind power measurements in the past set. We use cluster prototypes for predicting upcoming wind powers of the turbine. We illustrate a case study with real data collected from a wind turbine installed in the Apulia region.
The detection of congested areas can play an important role in the development of systems of traffic management. Usually, the problem is investigated under two main perspectives which concern the representation of space and the shape of the dense regions respectively. However, the adoption of movement tracking technologies enables the generation of mobility data in a streaming style, which adds an aspect of complexity not yet addressed in the literature. We propose a computational solution to mine dense regions in the urban space from mobility data streams. Our proposal adopts a stream data mining strategy which enables the detection of two types of dense regions, one based on spatial closeness, the other one based on temporal proximity. We prove the viability of the approach on vehicular data streams in the urban space.
One of the recently addressed research directions focuses on the issues raised by the diffusion of highly dynamic on-line information, particularly on the problem of mining topic evolutions from news. Among several applications, risk identification and analysis may exploit mining topic evolution from news in order to support law enforcement officers in risk and threat assessment. Assimilating the concept of topic to the concept of crime typology represented by a group of "similar" criminals, it is possible to apply topic evolution mining techniques to discover evolutions of criminal behaviors over time. At this aim, we incrementally analyze streams of publicly available news about criminals (e.g. daily police reports, public court records, legal instruments) in order to identify clusters of similar criminals and represent their evolution over time. Experimental results on both real world and synthetically generated datasets prove the effectiveness of the proposed approach.
In recent years, improvement in ubiquitous technologies and sensor networks have motivated the application of data mining techniques to network organized data. Network data describe entities represented by nodes, which may be connected with (related to) each other by edges. Many network datasets are characterized by a form of autocorrelation where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. This phenomenon is a direct violation of the assumption that data are independently and identically distributed (i.i.d.). At the same time, it offers the unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. In this work, we propose a method for learning to rank from network data when data distribution may change over time. The learned models can be used to predict the ranking of nodes in the network for new time periods. The proposed method modifies the SVMRank algorithm in order to emphasize the importance of models learned in time periods during which data follow a data distribution that is similar to that observed in the new time period. We evaluate our approach on several real world problems of learning to rank from network data, coming from the area of sensor networks.
One of the recently addressed research directions focuses on the problem of mining topic evolutions from textual documents. Following this main stream of research, in this paper we face the different, but related, problem of mining the topic evolution of entities (persons, companies, etc.) mentioned in the documents. To this aim, we incrementally analyze streams of time-stamped documents in order to identify clusters of similar entities and represent their evolution over time. The proposed solution is based on the concept of temporal profiles of entities extracted at periodic instants in time. Experiments performed both on synthetic and real world datasets prove that the proposed framework is a valuable tool to discover underlying evolutions of entities and results show significant improvements over the considered baseline methods
Motif discovery in biological sequences is an important field in bioinformatics. Most of the scientific research focuses on the de novo discovery of single motifs, but biological activities are typically co-regulated by several factors and this feature is properly reflected by higher order structures, called composite motifs, or cis-regulatory modules or simply modules. A module is a set of motifs, constrained both in number and location, which is statistically overrepresented and hence may be indicative of a biological function. Several methods have been studied for the de novo discovery of modules. We propose an alternative approach based on the discovery of rules that define strong spatial associations between single motifs and suggest the structure of a module. Single motifs involved in the mined rules might be either de novo discovered by motif discovery algorithms or taken from databases of single motifs. Rules are expressed in a first-order logic formalism and are mined by means of an inductive logic programming system. We also propose computational solutions to two issues: the hard discretization of numerical inter-motif distances and the choice of a minimum support threshold. All methods have been implemented and integrated in a tool designed to support biologists in the discovery and characterization of composite motifs. A case study is reported in order to show the potential of the tool.
Recent advances on tracking technologies enable the collection of spatio-temporal data in the form of trajectories. The analysis of such data can convey knowledge in prominent applications, and mining groups of moving objects turns out to be a valuable mean to model their movement. Existing approaches pay particular attention in groups where objects are close and move together or follow similar trajectories by assuming that movement cannot change over time. Instead, we observe that groups can be of interest also when objects are spatially distant and have different but inter-related movements: objects can start from different places and join together to move towards a common location. To take into account inter-related movements, we have to analyze the objects jointly, follow their respective movements and consider changes of movements over time. Motivated by this, we introduce the notion of communities and propose a computational solution to discover them. The method is structured in three steps. The first step performs a feature extraction technique to elicit the inter-related movements between the objects. The second one leverages a tree-structure in order to group objects with similar inter-related movements. In the third step, these groupings are used to mine communities as groups of objects which exhibit inter-related movements over time. We evaluate our approach on real data-sets and compare it with existing algorithms.
Multi-Relational Data Mining (MRDM) refers to the process of discovering implicit, previously unknown and potentially useful information from data scattered in multiple tables of a relational database. Following the mainstream of MRDM research, we tackle the regression where the goal is to examine samples of past experience with known continuous answers (response) and generalize future cases through an inductive process. Mr-SMOTI, the solution we propose, resorts to the structural approach in order to recursively partition data stored into a tightly-coupled database and build a multi-relational model tree which captures the linear dependence between the response variable and one or more explanatory variables. The model tree is top-down induced by choosing, at each step, either to partition the training space or to introduce a regression variable in the linear models with the leaves. The tight-coupling with the database makes the knowledge on data structures (foreign keys) available free of charge to guide the search in the multi-relational pattern space. Experiments on artificial and real databases demonstrate that in general Mr-SMOTI outperforms both SMOTI and M5' which are two propositional model tree induction systems, and TILDE-RT which is a state-of-art structural model tree induction system.
Network reconstruction from data is a data mining task which is receiving a significant attention due to its applicability in several domains. For example, it can be applied in social network analysis, where the goal is to identify connections among users and, thus, sub-communities. Another example can be found in computational biology, where the goal is to identify previously unknown relationships among biological entities and, thus, relevant interaction networks. Such task is usually solved by adopting methods for link prediction and for the identification of relevant sub-networks. Focusing on the biological domain, in [4] and [3] we proposed two methods for learning to combine the output of several link prediction algorithms and for the identification of biological significant interaction networks involving two important types of RNA molecules, i.e. microRNAs (miRNAs) and messenger RNAs (mRNAs). The relevance of this application comes from the importance of identifying (previously unknown) regulatory and cooperation activities for the understanding of the biological roles of miRNAs and mRNAs. In this paper, we review the contribution given by the combination of the proposed methods for network reconstruction and the solutions we adopt in order to meet specific challenges coming from the specific domain we consider.
In predictive data mining tasks, we should account for auto-correlations of both the independent variables and the dependent variable, which we can observe in neighborhood of a target node and that same node. The prediction on a target node should be based on the value of the neighbours which might even be unavailable. To address this problem, the values of the neighbours should be inferred collectively. We present a novel computational solution to perform collective inferences in a network regression task. We dene an iterative algorithm, in order to make regression inferences about predictions of multiple nodes simultaneously and feed back the more reliable predictions made by the previous models in the labeled network. Experiments investigate the effectiveness of the proposed algorithm in spatial networks.
In traditional OLAP systems, roll-up and drill-down operations over data cubes exploit fixed hierarchies defined on discrete attributes that play the roles of dimensions, and operate along them. However, in recent years, a new tendency of considering even continuous attributes as dimensions, hence hierarchical members become continuous accordingly, has emerged mostly due to novel and emerging application scenarios like sensor and data stream management tools. A clear advantage of this emerging approach is that of avoiding the beforehand definition of an ad-hoc discretization hierarchy along each OLAP dimension. Following this latest trend, in this paper we propose a novel method for effectively and efficiently supporting roll-up and drill-down operations over OLAP data cubes with continuous dimensions via a density-based hierarchical clustering algorithm. This algorithm allows us to hierarchically cluster together dimension instances by also taking fact-table measures into account in order to enhance the clustering effect with respect to the possible analysis. Experiments on two well-known multidimensional datasets clearly show the advantages of the proposed solution.
Emerging real life applications, such as environmental compliance, ecological studies and meteorology, are characterized by real-time data acquisition through remote sensor networks. The most important aspect of the sensor readings is that they comprise a space dimension and a time dimension which are both information bearing. Additionally, they usually arrive at a rapid rate in a continuous, unbounded stream. Streaming prevents us from storing all readings and performing multiple scans of the entire data set. The drift of data distribution poses the additional problem of mining patterns which may change over the time. We address these challenges for the trend cluster cluster discovery, that is, the discovery of clusters of spatially close sensors which transmit readings, whose temporal variation, called trend polyline, is similar along the time horizon of a window. We present a stream framework which segments the stream into equally-sized windows, computes online intra-window trend clusters and stores these trend clusters in a database. Trend clusters are queried offline at any time, to determine trend clusters along larger windows (i.e. windows of windows). Experiments with several streams demonstrate the effectiveness of the proposed framework in discovering accurate and relevant to human trend clusters
Nowadays ubiquitous sensor stations are deployed to measure geophysical fields for several ecological and environmental processes. Although these fields are measured at the specific location of stations, geo-statistical problems demand for inference processes to supplement, smooth and standardize recorded data. We study how predictive regional trees can supplement data sampled periodically in an ubiquitous sensing scenario. Data records that are similar one to each other are clustered according to a rectangular decomposition of the region of analysis; a predictive model is associated to the region covered by each cluster. The cluster model depicts the spatial variation of data over a map, the predictive model supplements any unknown record that is recognized belong to a cluster region. We illustrate an incremental algorithm to yield time-evolving predictive regional trees that account for the fact that the statistical properties of the recorded data may change over time. This algorithm is evaluated with spatio-temporal data collections.
Process mining techniques are able to extract knowledge from event logs commonly available in today's information systems. These techniques provide new means to discover, monitor, and improve processes in a variety of application domains. There are two main drivers for the growing interest in process mining. On the one hand, more and more events are being recorded, thus, providing detailed information about the history of processes. On the other hand, there is a need to improve and support business processes in competitive and rapidly changing environments. This manifesto is created by the IEEE Task Force on Process Mining and aims to promote the topic of process mining. Moreover, by defining a set of guiding principles and listing important challenges, this manifesto hopes to serve as a guide for software developers, scientists, consultants, business managers, and end-users. The goal is to increase the maturity of process mining as a new tool to improve the (re)design, control, and support of operational business processes.
Processes are everywhere in our daily lives. More and more information about executions of processes are recorded in event logs by several information systems. Process mining techniques are used to analyze historic information hidden in event logs and to provide surprising insights for managers, system developers, auditors, and end users. While existing process mining techniques mainly analyze full process instances (cases), this paper extends the analysis to running cases, which have not yet completed. For running cases, process mining can be used to notify future events. This forecasting ability can provide insights for check conformance and support decision making. This paper details a process mining approach, which uses predictive clustering to equip an execution scenario with a prediction model. This model accounts for recent events of running cases to predict the characteristics of future events. Several tests with benchmark logs investigate the viability of the proposed approach.
A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. In this project we design a framework which combines technologies for the acquisition and storage of printed documents with knowledge-based techniques to represent and understand the information they contain. The innovative aspects of this work strengthen its applicability to tools that have been developed for building digital libraries.
Networks are data structures more and more frequently used for modeling interactions in social and biological phenomena, as well as between various types of devices, tools and machines. They can be either static or dynamic, dependently on whether the modeled interactions are fixed or changeable over time. Static networks have been extensively investigated in data mining, while fewer studies have focused on dynamic networks and how to discover complex patterns in large, evolving networks. In this paper we focus on the task of discovering changes in evolving networks and we overcome some limits of existing methods (i) by resorting to a relational approach for representing networks characterized by heterogeneous nodes and/or heterogeneous relationships, and (ii) by proposing a novel algorithm for discovering changes in the structure of a dynamic network over time. Experimental results and comparisons with existing approaches on real-world datasets prove the effectiveness and efficiency of the proposed solution and provide some insights on the effect of some parameters in discovering and modeling the evolution of the whole network, or a subpart of it.
In spatial domains, objects present high heterogeneity and are connected by several relationships to form complex networks. Mining spatial networks can provide information on both the objects and their interactions. In this work we propose a descriptive data mining approach to discover relational disjunctive patterns in spatial networks. Relational disjunctive patterns permit to represent spatial relationships that occur simultaneously with or alternatively to other relationships. Pruning of the search space is based on the anti-monotonicity property of support. The application to the problem of urban accessibility proves the viability of the proposal.
The rapid growth in the amount of spatial data available in Geographical Information Systems has given rise to substantial demand of data mining tools which can help uncover interesting spatial patterns. We advocate the relational mining approach to spatial domains, due to both various forms of spatial correlation which characterize these domains and the need to handle spatial relationships in a systematic way. We present some major achievements in this research direction and point out some open problems.
We define a new kind of stream cube, called geo-trend stream cube, which uses trends to aggregate a numeric measure which is streamed by a sensor network and is organized around space and time dimensions. We specify space-time roll-up and drill-down to explore trends at coarse grained and inner grained hierarchical view.
In recent years, a growing interest has been given to trajectory data mining applications that permit to support mobility prediction with the aim of anticipating or pre-fetching possible services. Proposed approaches typically consider only spatiotemporal information provided by collected trajectories. However, in some scenarios, such as that of tourist supporting, semantic information which express needs and interest of the user (tourist) should be taken into account. This semantic information can be extracted from textual documents already consulted by the tourists. In this paper, we present the application of a time-slice density estimation approach that permits to suggest/predict the next destination of the tourist. In particular, time-slice density estimation permits to measure the rate of change of tourist's interests at a given geographical position over a user-defined time horizon. Tourist interests depend both on the geographical position of the tourist with respect to a reference system and on semantic information provided by geo-referenced documents associated to the visited sites.
We consider distributed computing environments where geo-referenced sensors feed a unique central server with numeric and uni-dimensional data streams. Knowledge discovery from these geographically distributed data streams poses several challenges including the requirement of data summarization in order to store the streamed data in a central server with a limited memory. We propose an enhanced segmentation algorithm in order to group data sources in the same spatial cluster if they stream data which evolve according to a close trajectory over the time. A trajectory is constructed by tracking only data points which represent a change of trend in the associated spatial cluster. Clusters of trajectories are discovered on-the-fly and stored in the database. Experiments prove effectiveness and accuracy of our approach.
Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.
Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.
This paper faces the problem of harvesting geographic information from Web documents, specifically, extracting facts on spatial relations among geographic places. The motivation is twofold. First, researchers on Spatial Data Mining often assume that spatial data are already available, thanks to current GIS and positioning technologies. Nevertheless, this is not applicable to the case of spatial information embedded in data without an explicit spatial modeling, such as documents. Second, despite the huge amount of Web documents conveying useful geographic information, there is not much work on how to harvest spatial data from these documents. The problem is particularly challenging because of the lack of annotated documents, which prevents the application of supervised learning techniques. In this paper, we propose to harvest facts on geographic places through an unsupervised approach which recognizes spatial relations among geographic places without supposing the availability of annotated documents. The proposed approach is based on the combined use of a spatial ontology and a prototype-based classifier. A case study on topological and directional relations is reported and commented.
classification, transductive Learning, collective learning
Learning classifiers of spatial data presents several issues, such as the heterogeneity of spatial objects, the implicit definition of spatial relationships among objects, the spatial autocorrelation and the abundance of unlabelled data which potentially convey a large amount of information. The first three issues are due to the inherent structure of spatial units of analysis, which can be easily accommodated if a (multi-)relational data mining approach is considered. The fourth issue demands for the adoption of a transductive setting, which aims to make predictions for a given set of unlabelled data. Transduction is also motivated by the contiguity of the concept of positive autocorrelation, which typically affect spatial phenomena, with the smoothness assumption which characterize the transductive setting. In this work, we investigate a relational approach to spatial classification in a transductive setting. Computational solutions to the main difficulties met in this approach are presented. In particular, a relational upgrade of the nave Bayes classifier is proposed as discriminative model, an iterative algorithm is designed for the transductive classification of unlabelled data, and a distance measure between relational descriptions of spatial objects is defined in order to determine the k-nearest neighbors of each example in the dataset. Computational solutions have been tested on two real-world spatial datasets. The transformation of spatial data into a multi-relational representation and experimental results are reported and commented.
Many spatial phenomena are characterized by positive autocorrelation, i.e., variables take similar values at pairs of close locations. This property is strongly related to the smoothness assumption made in transductive learning, according to which if points in a high-density region are close, corresponding outputs should also be close. This observation, together with the prior availability of large sets of unlabelled data, which is typical in spatial applications, motivates the investigation of transductive learning for spatial data mining. The task considered in this work is spatial regression. We apply the co-training technique in order to iteratively learn two separate models, such that each model is used to make predictions on unlabeled data for the other. One model is built on the set of attribute-value observations measured at specific sites, while the other is built on the set of aggregated values measured for the same attributes in nearby sites. Experiments prove the effectiveness of the proposed approach on spatial domains.
A fundamental task of document image understanding is to recognize semantically relevant components in the layout extracted from a document image. This task can be automatized by learning classifiers to label such components. The application of inductive learning algorithms assumes the availability of a large set of documents, whose layout components have been previously labeled through manual annotation. This contrasts with the more common situation in which we have only few labeled documents and an abundance of unlabeled ones. A further degree of complexity of the learning task is represented by the importance of spatial relationships between layout components, which cannot be adequately represented by feature vectors. To face these problems, we investigate the application of a relational classifier that works in the transductive setting. Transduction is justified by the possibility of exploiting the large amount of information conveyed in the unlabeled documents and by the contiguity of the concept of positive autocorrelation with the smoothness assumption which characterizes the transductive setting. The classifier takes advantage of discovered emerging patterns that permit us to qualitatively characterize classes. Computational solutions have been tested on document images of scientific literature and the experimental results show the advantages and drawbacks of the approach.
Consider a multi-relational database, to be used for classification, that contains a large number of unlabeled data. It follows that the cost of labeling such data is prohibitive. Transductive learning, which learns from labeled as well as from unlabeled data already known at learning time, is highly suited to address this scenario. In this paper, we construct multi-views from a relational database, by considering different subsets of the tables as contained in a multi-relational database. These views are used to boost the classification of examples in a co-training schema. The automatically generated views allow us to overcome the independence problem that negatively affect the performance of co-training methods. Our experimental evaluation empirically shows that co-training is beneficial in the transductive learning setting when mining multi-relational data and that our approach works well with only a small amount of labeled data.
The information acquisition in a pervasive sensor network is often affected by faults due to power outage at nodes, wrong time synchronizations, interference, network transmission failures, sensor hardware issues or excessive energy consumption for communications. These issues impose a trade-off between the precision of the measurements and the costs of communication and processing, which are directly proportional to the number of sensors.
In many real-time applications, such as wireless sensor network monitoring, traffic control or health monitoring systems, it is required to analyze continuous and unbounded geographically distributed streams of data (e.g. temperature or humidity measurements transmitted by sensors of weather stations). Storing and querying geo-referenced stream data poses specific challenges both in time (real-time processing) and in space (limited storage capacity). Summarization algorithms can be used to reduce the amount of data to be permanently stored into a data warehouse without losing information for further subsequent analysis. In this paper we present a framework in which data streams are seen as time-varying realizations of stochastic processes. Signal compression techniques, based on transformed domains, are applied and compared with a geometrical segmentation in terms of compression efficiency and accuracy in the subsequent reconstruction
Spatio-temporal data collected in sensor networks are often affected by faults due to power outage at nodes, wrong time synchronizations, interference, network transmission failures, sensor hardware issues or high energy consumption during communications. Therefore, acquisition of information by wireless sensor networks is a challenging step in monitoring physical ubiquitous phenomena (e.g. weather, pollution, traffic). This issue gives raise to a fundamental trade-off: higher density of sensors provides more data, higher resolution and better accuracy, but requires more communications and processing. A data mining approach to reduce communication and energy requirements is investigated: the number of transmitting sensors is decreased as much as possible, even keeping a reasonable degree of data accuracy. Kriging techniques and trend cluster discovery are employed to estimate unknown data in any un-sampled location of the space and at any time point of the past. Kriging is a statistical interpolation group of techniques, suited for spatial data, which estimates the unknown data in any space location by a proper weighted mean of nearby observed data. The trend clusters are stream patterns which compactly represent sensor data by means of spatial clusters having prominent data trends in time. Kriging is here applied to estimate unknown data taking into account a spatial correlation model of the sensor network. Trends are used as a guideline to transfer this model across the time horizon of the trend itself. Experiments are performed with a real sensor data network, in order to evaluate this interpolation technique and demonstrate that Kriging and trend clusters outperform, in terms of accuracy, interpolation competitors like Nearest Neighbor or Inverse Distance Weighting.
The discovery and extraction of general lists on the Web continues to be an important problem facing the Web mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing a Web page’s DOM-structure is not sufficient for the general list finding task.
With the development of AIS (Automatic Identification System), more and more vessels are equipped with AIS technology. Vessels' reports (e.g. position in geodetic coordinates, speed, course), periodically transmitted by AIS, have become an abundant and inexpensive source of ubiquitous motion information for the maritime surveillance. In this study, we investigate the problem of processing the ubiquitous data, which are enclosed in the AIS messages of a vessel, in order to display an interpolation of the itinerary of the vessel. We define a graph-aware itinerary mining strategy, which uses spatio-temporal knowledge enclosed in each AIS message to constrain the itinerary search. Experiments investigate the impact of the proposed spatio-temporal data mining algorithm on the accuracy and efficiency of the itinerary interpolation process, also when reducing the amount of AIS messages processed per vessel.
Background: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers.Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function.Conclusions: Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.
Ubiquitous sensor stations continuously measure several geophysical fields over large zones and long (potentially unbounded) periods of time. However, observations can never cover every location nor every time. In addition, due to its huge volume, the data produced cannot be entirely recorded for future analysis. In this scenario, interpolation, i.e., the estimation of unknown data in each location or time of interest, can be used to supplement station records. Although in GIScience there has been a tendency to treat space and time separately, integrating space and time could yield better results than treating them separately when interpolating geophysical fields. According to this idea, a spatiotemporal interpolation process, which accounts for both space and time, is described here. It operates in two phases. First, the exploration phase addresses the problem of interaction. This phase is performed on-line using data recorded from a network throughout a time window. The trend cluster discovery process determines prominent data trends and geographically-aware station interactions in the window. The result of this process is given before a new data window is recorded. Second, the estimation phase uses the inverse distance weighting approach both to approximate observed data and to estimate missing data. The proposed technique has been evaluated using two large real climate sensor networks. The experiments empirically demonstrate that, in spite of a notable reduction in the volume of data, the technique guarantees accurate estimation of missing data.
The growing integration of wind turbines into the power grid can only be balanced with precise forecasts of upcoming energy productions. This information plays as basis for operation and management strategies for a reliable and economical integration into the power grid. A precise forecast needs to overcome problems of variable energy production caused by fluctuating weather conditions. In this paper, we define a data mining approach, in order to process a past set of the wind power measurements of a wind turbine and extract a robust prediction model. We resort to a time series clustering algorithm, in order to extract a compact, informative representation of the time series of wind power measurements in the past set. We use cluster prototypes for predicting upcoming wind powers of the turbine. We illustrate a case study with real data collected from a wind turbine installed in the Apulia region.
Condividi questo sito sui social