Effettua una ricerca
Annalisa Appice
Ruolo
Professore Associato
Organizzazione
Università degli Studi di Bari Aldo Moro
Dipartimento
DIPARTIMENTO DI INFORMATICA
Area Scientifica
AREA 09 - Ingegneria industriale e dell'informazione
Settore Scientifico Disciplinare
ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore ERC 1° livello
Non Disponibile
Settore ERC 2° livello
Non Disponibile
Settore ERC 3° livello
Non Disponibile
Photovoltaics (PV) is the field of technology and research related to the application of solar cells, in order to convert sunlight directly into electricity. In the last decade, PV plants have become ubiquitous in several countries of the European Union (EU). This paves the way for marketing new smart systems, designed to monitor the energy production of a PV park grid and supply intelligent services for customer and production applications. In this paper, we describe a new business intelligence system developed to monitor the efficiency of the energy production of a PV park. The system includes services for data collection, summarization (based on trend cluster discovery), synthetic data generation, supervisory monitoring, report building and visualization.
Process mining refers to the discovery, conformance and enhancement of process models from event logs currently produced by several information systems (e.g. workflow management systems). By tightly coupling event logs and process models, process mining makes possible to detect deviations, predict delays, support decision making and recommend process redesigns. Event logs are data sets containing the executions (called traces) of a business process. Several process mining algorithms have been defined to mine event logs and deliver valuable models (e.g. Petri nets) of how logged processes are being executed. However, they often generate spaghetti-like process models, which can be hard to understand. This is caused by the inherent complexity of real-life processes, which tend to be less structured and more flexible than what the stakeholders typically expect. In particular, spaghetti-like process models are discovered when all possible behaviors are shown in a single model as a result of considering the set of traces in the event log all at once. To minimize this problem, trace clustering can be used as a preprocessing step. It splits up an event log into clusters of similar traces, so as to handle variability in the recorded behavior and facilitate process model discovery. In this paper, we investigate a multiple view aware approach to trace clustering, based on a co-training strategy. In an assessment, using benchmark event logs, we show that the presented algorithm is able to discover a clustering pattern of the log, such that related traces result appropriately clustered. We evaluate the significance of the formed clusters using established machine learning and process mining metrics.
The amount of data produced by ubiquitous computing applications is quickly growing, due to the pervasive presence of small devices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiquitous data', require a (multi-) relational approach to their analysis. However, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm approximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be considered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoretically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.
The rising need of energy to improve the quality of life has paved the way for the development and the incentive of different kinds of renewable energy technologies. In particular, the recent increase in the number of installed PhotoVoltaic (PV) plants has boosted the marketing of new monitoring systems designed to take under control the energy production of PV plants. In this paper, we present an intelligent monitoring system, called SUNInspector, which resorts to spatio-temporal data mining techniques, in order to monitor energy productions of PV plants and detect real-time possible plant faults. SUNInspector uses spatio-temporal patterns, called trend clusters, to model the trends according to the energy production of the PV plants varies depending on the region where it is installed (spatial dependence) and the period of the year of the measurements (temporal dipendence). Each time a PV plant transmits its energy production measurement, the risk of a plant fault is measured by evaluating the persistence of an high difference between the real production and the expected production. A case study with PV plants distributed over the South of Italy is illustrated.
The analysis of spatial autocorrelation has defined a new paradigm in ecology. Attention to spatial pattern leads to insights that would otherwise overlooked, while ignoring space may lead to false conclusions about ecological relationships. In this paper, we propose an intelligent forecasting technique, which explicitly accounts for the property of spatial autocorrelation when learning linear autoregressive models (ARIMA) of spatial correlated ecologic time series. The forecasting algorithm makes use of an autoregressive statistical technique, which achieves accurate forecasts of future data by taking into account temporal and spatial dimension of ecologic data. It uses a novel spatial-aware inference procedure, which permits to learn the autoregressive model by processing a time series in a neighborhood (spatial lags). Parameters of forecasting models are jointly learned on spatial lags of time series. Experiments with ecologic data investigate the accuracy of the proposed spatial-aware forecasting model with respect to the traditional one.
A spatio-temporal data stream is a sequence of time-stamped geo-referenced data elements which arrive at consecutive time points. In addition to the spatial and temporal dimensions which are information bearing, stream poses further challenges to data mining, which are avoiding multiple scans of the entire data sets, optimizing memory usage, and mining only the most recent patterns. In this paper, we address the challenges of mining spatiotemporal data streams for a new class of space-time patterns, called trend-clusters. These patterns combine spatial clustering and trend discovery in stream environments. In particular, we propose a novel algorithm, called TRUST, which allows to retrieve groups of spatially continuous geo-referenced data which variate according to a close trend polyline in the recent window past. Experiments demonstrate the effectiveness of the proposed algorithm.
In predictive data mining tasks, we should account for autocorrelations of both the independent variables and the dependent variable, which we can observe in neighborhood of a target node and that same node. The prediction on a target node should be based on the value of the neighbours which might even be unavailable. To address this problem, the values of the neighbours should be inferred collectively. We present a novel computational solution to perform collective inferences in a network regression task. We define an iterative algorithm, in order to make regression inferences about predictions of multiple nodes simultaneously and feed back the more reliable predictions made by the previous models in the labeled network. Experiments investigate the effectiveness of the proposed algorithm in spatial networks
A key task in data mining and information retrieval is learning preference relations. Most of methods reported in the literature learn preference relations between objects which are represented by attribute-value pairs or feature vectors (propositional representation). The growing interest in data mining techniques which are able to directly deal with more sophisticated representations of complex objects, motivates the investigation of relational learning methods for learning preference relations. In this paper, we present a probabilistic relational data mining method which permits to model preference relations between complex objects. Preference relations are then used to rank objects. Experiments on two ranking problems for scientific literature mining prove the effectiveness of the proposed method.
The trend cluster discovery retrieves areas of spatially close sensors which measure a numeric random field having a prominent data trend along a time horizon. We propose a computation preserving algorithm which employees an incremental learning strategy to continuously maintain sliding window trend clusters across a sensor network. Our proposal reduces the amount of data to be processed and saves the computation time as a consequence. An empirical study proves the effectiveness of the proposed algorithm to take under control computation cost of detecting sliding window trend clusters.
Emerging real life applications, such as environmental compliance, ecological studies and meteorology, are characterized by real-time data acquisition through a number of (wireless) remote sensors. Operatively, remote sensors are installed across a spatially distributed network; they gather information along a number of attribute dimensions and periodically feed a central server with the measured data. The server is required to monitor these data, issue possible alarms or compute fast aggregates. As data analysis requests, which are submitted to a server, may concern both present and past data, the server is forced to store the entire stream. But, in the case of massive streams (large networks and/or frequent transmissions), the limited storage capacity of a server may impose to reduce the amount of data stored on the disk. One solution to address the storage limits is to compute summaries of the data as they arrive and use these summaries to interpolate the real data which are discarded instead. On any future demands of further analysis of the discarded data, the server pieces together the data from the summaries stored in database and processes them according to the requests. This work introduces the multiple possibilities and facets of a recently defined spatio-temporal pattern, called trend cluster, and its applications to summarize, interpolate and identify anomalies in a sensor network. As an example application, the authors illustrate the application of trend cluster discovery to monitor the efficiency of photovoltaic power plants. The work closes with remarks on new possibilities for surveillance gained by recent developments of sensing technology, and with an outline of future challenges.
Despite the growing ubiquity of sensor deployments and the advances in sensor data analysis technology, relatively little attention has been paid to the spatial non-stationarity of sensed data which is an intrinsic property of the geographically distributed data. In this paper we deal with non-stationarity of geographically distributed data for the task of regression. At this purpose, we extend the Geographically Weighted Regression (GWR) method which permits the exploration of the geographical differences in the linear effect of one or more predictor variables upon a response variable. The parameters of this linear regression model are locally determined for every point of the space by processing a sample of weighted neighboring observations. Although the use of locally linear regression has proved appealing in the area of sensor data analysis, it also poses some problems. The parameters of the surface are locally estimated for every space point, but the form of the GWR regression surface is globally defined over the whole sample space. Moreover, the GWR estimation is founded on the assumption that all predictor variables are equally relevant in the regression surface, without dealing with spatially localized phenomena of collinearity. Our proposal overcomes these limitations with a novel tree-based approach which is adapted to the aim of recovering the functional form of a regression model only at the local level. A stepwise approach is then employed to determine the local form of each regression model by selecting only the most promising predictors and providing a mechanism to estimate parameters of these predictors at every point of the local area. Experiments with several geographically distributed datasets confirm that the tree based construction of GWR models improves both the local estimation of parameters of GWR and the global estimation of parameters performed by classical model trees.
Spatial autocorrelation is the correlation among data values which is strictly due to the relative spatial proximity of the objects that the data refer to. Inappropriate treatment of data with spatial dependencies, where spatial autocorrelation is ignored, can obfuscate important insights. In this paper, we propose a data mining method that explicitly considers spatial autocorrelation in the values of the response (target) variable when learning predictive clustering models. The method is based on the concept of predictive clustering trees (PCTs), according to which hierarchies of clusters of similar data are identified and a predictive model is associated to each cluster. In particular, our approach is able to learn predictive models for both a continuous response (regression task) and a discrete response (classification task). We evaluate our approach on several real world problems of spatial regression and spatial classification. The consideration of the autocorrelation in the models improves predictions that are consistently clustered in space and that clusters try to preserve the spatial arrangement of the data, at the same time providing a multi-level insight into the spatial autocorrelation phenomenon. The evaluation of SCLUS in several ecological domains (e.g. predicting outcrossing rates within a conventional field due to the surrounding genetically modified fields, as well as predicting pollen dispersal rates from two lines of plants) confirms itscapability of building spatial aware models which capture the spatial distribution of the target variable. In general, the maps obtained by using SCLUS do not require further post-smoothing of the results if we want to use them in practice.
Spatial data is common in ecological studies; however, one major problem with spatial data is the presence of the spatial autocorrelation. This phenomenon indicates that data measured at locations relatively close one to each other tend to have more similar values than data measured at locations further apart. Spatial autocorrelation violates the statistical assumption that the analyzed data are independent and identically distributed. This chapter focuses on effects of the spatial autocorrelation when predicting gene flow from Genetically Modified (GM) to non-GM maize fields under real multi-field crop management practices at a regional scale. We present the SCLUS method, an extension of the method CLUS (Blockeel et al., 1998), which learns spatially aware predictive clustering trees (PCTs). The method can consider locally and globally the effects of the spatial autocorrelation as well as can deal with the “ecological fallacy” problem (Robinson, 1950). The chapter concludes with a presentation of an application of this approach for gene flow modeling.
Anomaly detection and change analysis are challenging tasks in stream data mining. We illustrate a novel method that addresses both these tasks in geophysical applications. The method is designed for numeric data routinely sampled through a sensor network. It extends the traditional time series forecasting theory by accounting for the spatial information of geophysical data. In particular, a forecasting model is computed incrementally by accounting for the temporal correlation of data which exhibit a spatial correlation in the recent past. For each sensor the observed value is compared to its spatial-aware forecast, in order to identify the outliers. Finally, the spatial correlation of outliers is analyzed, in order to classify changes and reduce the number of false anomalies. The performance of the presented method is evaluated in both artificial and real data streams.
The automatic discovery of process models can help to gain insight into various perspectives (e.g., control flow or data perspective) of the process executions traced in an event log. Frequent patterns mining offers a means to build human understandable representations of these process models. This paper describes the application of a multi-relational method of frequent pattern discovery into process mining. Multi-relational data mining is demanded for the variety of activities and actors involved in the process executions traced in an event log which leads to a relational (or structural) representation of the process executions. Peculiarity of this work is in the integration of disjunctive forms into relational patterns discovered from event logs. The introduction of disjunctive forms enables relational patterns to express frequent variants of process models. The effectiveness of using relational patterns with disjunctions to describe process models with variants is assessed on real logs of process executions.
Nowadays sensors are deployed everywhere in order to support real-time data applications. They periodically gather information along a number of attribute dimensions (e.g., temperature and humidity). Applications typically require monitoring these data, fast computing aggregates, predicting unknown data, or issuing alarms. To this aim, this paper introduces a recently defined spatio-temporal pattern, called trend cluster, and its multiple applications to summarize, interpolate and detect outliers in sensor network data. As an example, we illustrate the application of trend cluster discovery to air climate data monitoring
The discovery of new and potentially meaningful relationships between named entities in biomedical literature can take great advantage from the application of multi-relational data mining approaches in text mining. This is motivated by the peculiarity of multi-relational data mining to be able to express and manipulate relationships between entities. We investigate the application of such an approach to address the task of identifying informative syntactic structures, which are frequent in biomedical abstract corpora. Initially, named entities are annotated in text corpora according to some biomedical dictionary (e.g. MeSH taxonomy). Tagged entities are then integrated in syntactic structures with the role of subject and/or object of the corresponding verb. These structures are represented in a first-order language. Multi-relational approach to frequent pattern discovery allows to identify the verb-based relationships between the named entities which frequently occur in the corpora. Preliminary experiments with a collection of abstracts obtained by querying Medline on a specific disease are reported.
The task being addressed in this paper consists of trying to forecast the future value of a time series variable on a certain geographical location, based on historical data of this variable collected on both this and other locations. In general, this time series forecasting task can be performed by using machine learning models, which transform the original problem into a regression task. The target variable is the future value of the series, while the predictors are previous past values of the series up to a certain p-length time window. In this paper, we convey information on both the spatial and temporal historical data to the predictive models, with the goal of improving their forecasting ability. We build technical indicators, which are summaries of certain properties of the spatio-temporal data, grouped in the spatio-temporal clusters and use them to enhance the forecasting ability of regression models. A case study with air temperature data is presented.
A smart grid can be seen as a sensor network, with immense amounts of grid sensor data continuously transmitted from various sensors. Mining knowledge from these data in order to enrich the grid with knowledge-based services is a challenging task due to the massive scale and spatial coupling therein. In this paper, we present a novel knowledge-based fault diagnosis service which is designed to detect faults in the energy production of a grid of PhotoVoltaic (PV) plants. A case study with a grid of PV plants distributed over the South of Italy is illustrated.
Spatial autocorrelation is the correlation among data values, strictly due to the relative location proximity of the objects that the data refer to. This statistical property clearly indicates a violation of the assumption of observation independence - a pre-condition assumed by most of the data mining and statistical models. Inappropriate treatment of data with spatial dependencies could obfuscate important insights when spatial autocorrelation is ignored. In this paper, we propose a data mining method that explicitly considers autocorrelation when building the models. The method is based on the concept of predictive clustering trees (PCTs). The proposed approach combines the possibility of capturing both global and local effects and dealing with positive spatial autocorrelation. The discovered models adapt to local properties of the data, providing at the same time spatially smoothed predictions. Results show the effectiveness of the proposed solution.
Spatial autocorrelation is the correlation among data values, strictly due to the relative location proximity of the objects that the data refer to. This statistical property clearly indicates a violation of the assumption of observation independence - a pre-condition assumed by most of the data mining and statistical models. Inappropriate treatment of data with spatial dependencies could obfuscate important insights when spatial autocorrelation is ignored. In this paper, we propose a data mining method that explicitly considers autocorrelation when building the models. The method is based on the concept of predictive clustering trees (PCTs). The proposed approach combines the possibility of capturing both global and local effects and dealing with positive spatial autocorrelation. The discovered models adapt to local properties of the data, providing at the same time spatially smoothed predictions. Results show the effectiveness of the proposed solution.
Clustering geosensor data is a problem that has recently attracted a large amount of research. In this paper, we focus on clustering geophysical time series data measured by a geo-sensor network. Clusters are built by accounting for both spatial and temporal information of data. We use clusters to produce globally meaningful information from time series obtained by individual sensors. The cluster information is integrated to the ARIMA model, in order to yield accurate forecasting results. Experiments investigate the trade-off between accuracy and efficiency of the proposed algorithm.
Information acquisition in a pervasive sensor network is often affected by faults due to power outage at nodes, wrong time synchronizations, interference, network transmission failures, sensor hardware issues or excessive energy consumption for communications. These issues impose a trade-off between the precision of the measurements and the costs of communication and processing which are directly proportional to the number of sensors and/or transmissions. We present a spatio-temporal interpolation technique which allows an accurate estimation of sensor network missing data by computing the inverse distance weighting of the trend cluster representation of the transmitted data. The trend-cluster interpolation has been evaluated in a real climate sensor network in order to prove the efficacy of our solution in reducing the amount of transmissions by guaranteeing accurate estimation of missing data.
This paper describes the principles and implementation of an algorithm for the classification of hyperspectral remote sensing images. The proposed approach is novel and can be included within the category of the spectral–spatial classification algorithms. The elements of novelty of the algorithm are as follows: 1) the implementation of two classifiers that work iteratively, each one exploiting the decision of the other to improve the training phase, and 2) the use of relational features based on the current labeling and on the spatial structure of the image. The two classifiers are fed with the spectral features and with the spatial features, respectively. The spatial features are built using the relative abundance of each class in a neighborhood of the pixel (homogeneity index), where the neighborhood is properly defined. An important contribution to the success of the method is the adoption of a multiclass classifier, the multinomial logistic regression, and a proper use of the posterior probabilities to infer the class labeling and build the relational data. The results of the two classifiers are eventually combined by means of an ensemble decision. The algorithm has been successfully tested on three standard hyperspectral images taken from the Airborne Visible–Infrared Imaging Spectrometer and ROSIS airborne sensors and compared with classification algorithms recently proposed in the literature.
The Geographically Weighted Regression (GWR) is a method of spatial statistical analysis which allows the exploration of geographical differences in the linear effect of one or more predictor variables upon a response variable. The parameters of this linear regression model are locally determined for every point of the space by processing a sample of distance decay weighted neighboring observations. While this use of locally linear regression has proved appealing in the area of spatial econometrics, it also presents some limitations. First, the form of the GWR regression surface is globally defined over the whole sample space, although the parameters of the surface are locally estimated for every space point. Second, the GWR estimation is founded on the assumption that all predictor variables are equally relevant in the regression surface, without dealing with spatially localized collinearity problems. Third, time dependence among observations taken at consecutive time points is not considered as information-bearing for future predictions. In this paper, a tree-structured approach is adapted to recover the functional form of a GWR model only at the local level. A stepwise approach is employed to determine the local form of each GWR model by selecting only the most promising predictors. Parameters of these predictors are estimated at every point of the local area. Finally, a time-space transfer technique is tailored to capitalize on the time dimension of GWR trees learned in the past and to adapt them towards the present. Experiments confirm that the tree-based construction of GWR models improves both the local estimation of parameters of GWR and the global estimation of parameters performed by classical model trees. Furthermore, the effectiveness of the time-space transfer technique is investigated.
Nowadays ubiquitous sensor stations are deployed worldwide, in order to measure several geophysical variables (e.g. temperature, humidity, light) for a growing number of ecological and industrial processes. Although these variables are, in general, measured over large zones and long (potentially unbounded) periods of time, stations cannot cover any space location. On the other hand, due to their huge volume, data produced cannot be entirely recorded for future analysis. In this scenario, summarization, i.e. the computation of aggregates of data, can be used to reduce the amount of produced data stored on the disk, while interpolation, i.e. the estimation of unknown data in each location of interest, can be used to supplement station records. We illustrate a novel data mining solution, named interpolative clustering, that has the merit of addressing both these tasks in time-evolving, multivariate geophysical applications. It yields a time-evolving clustering model, in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes, in order to predict data. Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data. Weights of the linear combination are defined, in order to reflect the inverse distance of the unseen data to each cluster geometry. The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations. Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm
Technologies in available biomedical repositories do not yet provide adequate mechanisms to support the understanding and analysis of the stored content. In this project we investigate this problem under different perspectives. Our contribution is the design of computational solutions for the analysis of biomedical documents and images. These integrate sophisticated technologies and innovative approaches of Information Extraction, Data Mining and Machine Learning to perform descriptive tasks of knowledge discovery from biomedical repositories.
The growing integration of wind turbines into the power grid can only be balanced with precise forecasts of upcoming energy productions. This information plays as basis for operation and management strategies for a reliable and economical integration into the power grid. A precise forecast needs to overcome problems of variable energy production caused by fluctuating weather conditions. In this paper, we define a data mining approach, in order to process a past set of the wind power measurements of a wind turbine and extract a robust prediction model. We resort to a time series clustering algorithm, in order to extract a compact, informative representation of the time series of wind power measurements in the past set. We use cluster prototypes for predicting upcoming wind powers of the turbine. We illustrate a case study with real data collected from a wind turbine installed in the Apulia region.
The special issue of the Journal of Intelligent Information Systems (JIIS) features papers from the first International Workshop on New Frontiers in Mining Complex Patterns (NFMCP 2011), which was held in Bristol UK, on September 24th 2012 in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2012). The first paper, 'Link Classification with Probabilistic Graphs', by Nicola Di Mauro, Claudio Taranto and Floriana Esposito, proposes two machine learning techniques for the link classification problem in relational data exploiting the probabilistic graph representation. The second paper, 'Hierarchical Object-Driven Action Rules', by Ayman Hajja, Zbigniew W. Ras, and Alicja A. Wieczorkowska, proposes a hybrid action rule extraction approach that combines key elements from both the classical action rule mining approach, and the object-driven action rule extraction approach to discover action rules from object-driven information systems.
Multi-Relational Data Mining (MRDM) refers to the process of discovering implicit, previously unknown and potentially useful information from data scattered in multiple tables of a relational database. Following the mainstream of MRDM research, we tackle the regression where the goal is to examine samples of past experience with known continuous answers (response) and generalize future cases through an inductive process. Mr-SMOTI, the solution we propose, resorts to the structural approach in order to recursively partition data stored into a tightly-coupled database and build a multi-relational model tree which captures the linear dependence between the response variable and one or more explanatory variables. The model tree is top-down induced by choosing, at each step, either to partition the training space or to introduce a regression variable in the linear models with the leaves. The tight-coupling with the database makes the knowledge on data structures (foreign keys) available free of charge to guide the search in the multi-relational pattern space. Experiments on artificial and real databases demonstrate that in general Mr-SMOTI outperforms both SMOTI and M5' which are two propositional model tree induction systems, and TILDE-RT which is a state-of-art structural model tree induction system.
Regression inference in network data is a challenging task in machine learning and data mining. Network data describe entities represented by nodes, which may be connected with (related to) each other by edges. Many network datasets are characterized by a form of autocorrelation where the values of the response variable at a given node depend on the values of the variables (predictor and response) at the nodes connected to the given node. This phenomenon is a direct violation of the assumption of independent (i.i.d.) observations: At the same time, it offers a unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. In this paper, we propose a data mining method that explicitly considers autocorrelation when building regression models from network data. The method is based on the concept of predictive clustering trees (PCTs), which can be used both for clustering and predictive tasks: PCTs are decision trees viewed as hierarchies of clusters and provide symbolic descriptions of the clusters. In addition, PCTs can be used for multi-objective prediction problems, including multi-target regression and multi-target classification. Empirical results on real world problems of network regression show that the proposed extension of PCTs performs better than traditional decision tree induction when autocorrelation is present in the data.
In predictive data mining tasks, we should account for auto-correlations of both the independent variables and the dependent variable, which we can observe in neighborhood of a target node and that same node. The prediction on a target node should be based on the value of the neighbours which might even be unavailable. To address this problem, the values of the neighbours should be inferred collectively. We present a novel computational solution to perform collective inferences in a network regression task. We dene an iterative algorithm, in order to make regression inferences about predictions of multiple nodes simultaneously and feed back the more reliable predictions made by the previous models in the labeled network. Experiments investigate the effectiveness of the proposed algorithm in spatial networks.
Regression inference in network data is a challenging task in machine learning and data mining. Network data describe entities represented by nodes, which may be connected with (related to) each other by edges. Many network datasets are characterized by a form of autocorrelation where the values of the response variable at a given node depend on the values of the variables (predictor and response) at the nodes connected to the given node. This phenomenon is a direct violation of the assumption of independent (i.i.d.) observations: At the same time, it offers a unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. In this paper, we propose a data mining method that explicitly considers autocorrelation when building regression models from network data. The method is based on the concept of predictive clustering trees (PCTs), which can be used both for clustering and predictive tasks: PCTs are decision trees viewed as hierarchies of clusters and provide symbolic descriptions of the clusters. In addition, PCTs can be used for multi-objective prediction problems, including multi-target regression and multi-target classification. Empirical results on real world problems of network regression show that the proposed extension of PCTs performs better than traditional decision tree induction when autocorrelation is present in the data.
Network data describe entities represented by nodes, which may be connected with (related to) each other by edges.Many network datasets are characterized by a form of autocorrelation, where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. This phenomenon is a direct violation of the assumption that data are independently and identically distributed. At the same time, it offers an unique opportunity to improve the performance of predictive models on network data, as inferences about one entity can be used to improve inferences about related entities. Regression inference in network data is a challenging task. While many approaches for network classification exist, there are very few approaches for network regression. In this paper, we propose a data mining algorithm, calledNCLUS, that explicitly considers autocorrelationwhen building regression models from network data. The algorithm is based on the concept of predictive clustering trees (PCTs) that can be used for clustering, prediction and multitarget prediction, including multi-target regression and multi-target classification.We evaluate our approach on several real world problems of network regression, coming from the areas of social and spatial networks. Empirical results showthat our algorithm performs better than PCTs learned by completely disregarding network information, as well as PCTs that are tailored for spatial data, but do not take autocorrelation into account, and a variety of other existing approaches
Emerging real life applications, such as environmental compliance, ecological studies and meteorology, are characterized by real-time data acquisition through remote sensor networks. The most important aspect of the sensor readings is that they comprise a space dimension and a time dimension which are both information bearing. Additionally, they usually arrive at a rapid rate in a continuous, unbounded stream. Streaming prevents us from storing all readings and performing multiple scans of the entire data set. The drift of data distribution poses the additional problem of mining patterns which may change over the time. We address these challenges for the trend cluster cluster discovery, that is, the discovery of clusters of spatially close sensors which transmit readings, whose temporal variation, called trend polyline, is similar along the time horizon of a window. We present a stream framework which segments the stream into equally-sized windows, computes online intra-window trend clusters and stores these trend clusters in a database. Trend clusters are queried offline at any time, to determine trend clusters along larger windows (i.e. windows of windows). Experiments with several streams demonstrate the effectiveness of the proposed framework in discovering accurate and relevant to human trend clusters
Nowadays ubiquitous sensor stations are deployed to measure geophysical fields for several ecological and environmental processes. Although these fields are measured at the specific location of stations, geo-statistical problems demand for inference processes to supplement, smooth and standardize recorded data. We study how predictive regional trees can supplement data sampled periodically in an ubiquitous sensing scenario. Data records that are similar one to each other are clustered according to a rectangular decomposition of the region of analysis; a predictive model is associated to the region covered by each cluster. The cluster model depicts the spatial variation of data over a map, the predictive model supplements any unknown record that is recognized belong to a cluster region. We illustrate an incremental algorithm to yield time-evolving predictive regional trees that account for the fact that the statistical properties of the recorded data may change over time. This algorithm is evaluated with spatio-temporal data collections.
Processes are everywhere in our daily lives. More and more information about executions of processes are recorded in event logs by several information systems. Process mining techniques are used to analyze historic information hidden in event logs and to provide surprising insights for managers, system developers, auditors, and end users. While existing process mining techniques mainly analyze full process instances (cases), this paper extends the analysis to running cases, which have not yet completed. For running cases, process mining can be used to notify future events. This forecasting ability can provide insights for check conformance and support decision making. This paper details a process mining approach, which uses predictive clustering to equip an execution scenario with a prediction model. This model accounts for recent events of running cases to predict the characteristics of future events. Several tests with benchmark logs investigate the viability of the proposed approach.
The rapid growth in the amount of spatial data available in Geographical Information Systems has given rise to substantial demand of data mining tools which can help uncover interesting spatial patterns. We advocate the relational mining approach to spatial domains, due to both various forms of spatial correlation which characterize these domains and the need to handle spatial relationships in a systematic way. We present some major achievements in this research direction and point out some open problems.
The rapid growth in the amount of spatial data available in Geographical Information Systems has given rise to substantial demand of data mining tools which can help uncover interesting spatial patterns. We advocate the relational mining approach to spatial domains, due to both various forms of spatial correlation which characterize these domains and the need to handle spatial relationships in a systematic way. We present some major achievements in this research direction and point out some open problems.
We define a new kind of stream cube, called geo-trend stream cube, which uses trends to aggregate a numeric measure which is streamed by a sensor network and is organized around space and time dimensions. We specify space-time roll-up and drill-down to explore trends at coarse grained and inner grained hierarchical view.
In recent years, a growing interest has been given to trajectory data mining applications that permit to support mobility prediction with the aim of anticipating or pre-fetching possible services. Proposed approaches typically consider only spatiotemporal information provided by collected trajectories. However, in some scenarios, such as that of tourist supporting, semantic information which express needs and interest of the user (tourist) should be taken into account. This semantic information can be extracted from textual documents already consulted by the tourists. In this paper, we present the application of a time-slice density estimation approach that permits to suggest/predict the next destination of the tourist. In particular, time-slice density estimation permits to measure the rate of change of tourist's interests at a given geographical position over a user-defined time horizon. Tourist interests depend both on the geographical position of the tourist with respect to a reference system and on semantic information provided by geo-referenced documents associated to the visited sites.
We consider distributed computing environments where geo-referenced sensors feed a unique central server with numeric and uni-dimensional data streams. Knowledge discovery from these geographically distributed data streams poses several challenges including the requirement of data summarization in order to store the streamed data in a central server with a limited memory. We propose an enhanced segmentation algorithm in order to group data sources in the same spatial cluster if they stream data which evolve according to a close trajectory over the time. A trajectory is constructed by tracking only data points which represent a change of trend in the associated spatial cluster. Clusters of trajectories are discovered on-the-fly and stored in the database. Experiments prove effectiveness and accuracy of our approach.
Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.
Advances in pervasive computing and sensor technologies have paved the way for the explosive living ubiquity of geo-physical data streams. The management of the massive and unbounded streams of sensor data produced poses several challenges, including the real-time application of summarization techniques, which should allow the storage and query of this amount of georeferenced and timestamped data in a server with limited memory. In order to face this issue, we have designed a summarization technique, called SUMATRA, which segments the stream into windows, computes summaries window-by-window and stores these summaries in a database. Trend clusters are discovered as summaries of each window. They are clusters of georeferenced data which vary according to a similar trend along the window time horizon. Several compression techniques are also investigated to derive a compact, but accurate representation of these trends for storage in the database. A learning strategy to automatically choose the best trend compression technique is designed. Finally, an in-network modality for tree-based trend cluster discovery is investigated in order to achieve an efficacious aggregation schema which drastically reduces the number of bytes transmitted across the network and maintains a longer network lifespan. This schema is mapped onto the routing structure of a tree-based WSN topology. Experiments performed with several data streams of real sensor networks assess the summarization capability, the accuracy and the efficiency of the proposed summarization schema.
classification, transductive Learning, collective learning
Learning classifiers of spatial data presents several issues, such as the heterogeneity of spatial objects, the implicit definition of spatial relationships among objects, the spatial autocorrelation and the abundance of unlabelled data which potentially convey a large amount of information. The first three issues are due to the inherent structure of spatial units of analysis, which can be easily accommodated if a (multi-)relational data mining approach is considered. The fourth issue demands for the adoption of a transductive setting, which aims to make predictions for a given set of unlabelled data. Transduction is also motivated by the contiguity of the concept of positive autocorrelation, which typically affect spatial phenomena, with the smoothness assumption which characterize the transductive setting. In this work, we investigate a relational approach to spatial classification in a transductive setting. Computational solutions to the main difficulties met in this approach are presented. In particular, a relational upgrade of the nave Bayes classifier is proposed as discriminative model, an iterative algorithm is designed for the transductive classification of unlabelled data, and a distance measure between relational descriptions of spatial objects is defined in order to determine the k-nearest neighbors of each example in the dataset. Computational solutions have been tested on two real-world spatial datasets. The transformation of spatial data into a multi-relational representation and experimental results are reported and commented.
Many spatial phenomena are characterized by positive autocorrelation, i.e., variables take similar values at pairs of close locations. This property is strongly related to the smoothness assumption made in transductive learning, according to which if points in a high-density region are close, corresponding outputs should also be close. This observation, together with the prior availability of large sets of unlabelled data, which is typical in spatial applications, motivates the investigation of transductive learning for spatial data mining. The task considered in this work is spatial regression. We apply the co-training technique in order to iteratively learn two separate models, such that each model is used to make predictions on unlabeled data for the other. One model is built on the set of attribute-value observations measured at specific sites, while the other is built on the set of aggregated values measured for the same attributes in nearby sites. Experiments prove the effectiveness of the proposed approach on spatial domains.
Consider a multi-relational database, to be used for classification, that contains a large number of unlabeled data. It follows that the cost of labeling such data is prohibitive. Transductive learning, which learns from labeled as well as from unlabeled data already known at learning time, is highly suited to address this scenario. In this paper, we construct multi-views from a relational database, by considering different subsets of the tables as contained in a multi-relational database. These views are used to boost the classification of examples in a co-training schema. The automatically generated views allow us to overcome the independence problem that negatively affect the performance of co-training methods. Our experimental evaluation empirically shows that co-training is beneficial in the transductive learning setting when mining multi-relational data and that our approach works well with only a small amount of labeled data.
The information acquisition in a pervasive sensor network is often affected by faults due to power outage at nodes, wrong time synchronizations, interference, network transmission failures, sensor hardware issues or excessive energy consumption for communications. These issues impose a trade-off between the precision of the measurements and the costs of communication and processing, which are directly proportional to the number of sensors.
In many real-time applications, such as wireless sensor network monitoring, traffic control or health monitoring systems, it is required to analyze continuous and unbounded geographically distributed streams of data (e.g. temperature or humidity measurements transmitted by sensors of weather stations). Storing and querying geo-referenced stream data poses specific challenges both in time (real-time processing) and in space (limited storage capacity). Summarization algorithms can be used to reduce the amount of data to be permanently stored into a data warehouse without losing information for further subsequent analysis. In this paper we present a framework in which data streams are seen as time-varying realizations of stochastic processes. Signal compression techniques, based on transformed domains, are applied and compared with a geometrical segmentation in terms of compression efficiency and accuracy in the subsequent reconstruction
Spatio-temporal data collected in sensor networks are often affected by faults due to power outage at nodes, wrong time synchronizations, interference, network transmission failures, sensor hardware issues or high energy consumption during communications. Therefore, acquisition of information by wireless sensor networks is a challenging step in monitoring physical ubiquitous phenomena (e.g. weather, pollution, traffic). This issue gives raise to a fundamental trade-off: higher density of sensors provides more data, higher resolution and better accuracy, but requires more communications and processing. A data mining approach to reduce communication and energy requirements is investigated: the number of transmitting sensors is decreased as much as possible, even keeping a reasonable degree of data accuracy. Kriging techniques and trend cluster discovery are employed to estimate unknown data in any un-sampled location of the space and at any time point of the past. Kriging is a statistical interpolation group of techniques, suited for spatial data, which estimates the unknown data in any space location by a proper weighted mean of nearby observed data. The trend clusters are stream patterns which compactly represent sensor data by means of spatial clusters having prominent data trends in time. Kriging is here applied to estimate unknown data taking into account a spatial correlation model of the sensor network. Trends are used as a guideline to transfer this model across the time horizon of the trend itself. Experiments are performed with a real sensor data network, in order to evaluate this interpolation technique and demonstrate that Kriging and trend clusters outperform, in terms of accuracy, interpolation competitors like Nearest Neighbor or Inverse Distance Weighting.
With the development of AIS (Automatic Identification System), more and more vessels are equipped with AIS technology. Vessels' reports (e.g. position in geodetic coordinates, speed, course), periodically transmitted by AIS, have become an abundant and inexpensive source of ubiquitous motion information for the maritime surveillance. In this study, we investigate the problem of processing the ubiquitous data, which are enclosed in the AIS messages of a vessel, in order to display an interpolation of the itinerary of the vessel. We define a graph-aware itinerary mining strategy, which uses spatio-temporal knowledge enclosed in each AIS message to constrain the itinerary search. Experiments investigate the impact of the proposed spatio-temporal data mining algorithm on the accuracy and efficiency of the itinerary interpolation process, also when reducing the amount of AIS messages processed per vessel.
Ubiquitous sensor stations continuously measure several geophysical fields over large zones and long (potentially unbounded) periods of time. However, observations can never cover every location nor every time. In addition, due to its huge volume, the data produced cannot be entirely recorded for future analysis. In this scenario, interpolation, i.e., the estimation of unknown data in each location or time of interest, can be used to supplement station records. Although in GIScience there has been a tendency to treat space and time separately, integrating space and time could yield better results than treating them separately when interpolating geophysical fields. According to this idea, a spatiotemporal interpolation process, which accounts for both space and time, is described here. It operates in two phases. First, the exploration phase addresses the problem of interaction. This phase is performed on-line using data recorded from a network throughout a time window. The trend cluster discovery process determines prominent data trends and geographically-aware station interactions in the window. The result of this process is given before a new data window is recorded. Second, the estimation phase uses the inverse distance weighting approach both to approximate observed data and to estimate missing data. The proposed technique has been evaluated using two large real climate sensor networks. The experiments empirically demonstrate that, in spite of a notable reduction in the volume of data, the technique guarantees accurate estimation of missing data.
The growing integration of wind turbines into the power grid can only be balanced with precise forecasts of upcoming energy productions. This information plays as basis for operation and management strategies for a reliable and economical integration into the power grid. A precise forecast needs to overcome problems of variable energy production caused by fluctuating weather conditions. In this paper, we define a data mining approach, in order to process a past set of the wind power measurements of a wind turbine and extract a robust prediction model. We resort to a time series clustering algorithm, in order to extract a compact, informative representation of the time series of wind power measurements in the past set. We use cluster prototypes for predicting upcoming wind powers of the turbine. We illustrate a case study with real data collected from a wind turbine installed in the Apulia region.
Condividi questo sito sui social