Persone Apulia Research Gate

A Performance Evaluation Method for Climate Coupled Models

In the High Performance Computing context, the performance evaluation of a parallel algorithm is carried out mainly by considering the elapsed time for running the parallel application with both different number of cores and different problem sizes (for scaled speedup). Typically, parallel applications embed mechanisms to efficiently use the allocated resources, guaranteeing for example a good load balancing and reducing the parallel overhead. Unfortunately, this assumption is not true for coupled models. These models were born from the coupling of stand-alone climate models. The component models are developed independently from each other and they follow different development roadmaps. Moreover, they are characterized by different levels of parallelization as well as different requirements in terms of workload and they have their own scalability curve. Considering a coupled model as a single parallel application, we can note the lacking of a policy useful to balance the computational load on the available resources. This work tries to address the issues related to the performance evaluation of a coupled model as well as answering the following questions: once a given number of processors has been allocated for the whole coupled model, how does the run have to be configured in order to balance the workload? How many processors must be assigned to each of the component models? The methodology here described has been applied to evaluate the scalability of the CMCC-MED coupled model designed by the ANS Division of the CMCC. The evaluation has been carried out on two different computational architectures: a scalar cluster, based on IBM Power6 processors, and a vector cluster, based on NEC-SX9 processors.

A web API framework for developing grid portals

In this paper we describe a grid problem solving environment we developed for financial applications. We based its development on a portlet framework we have specifically developed and on a set of Web APIs that encapsulate all grid control and computation logic. Even though nowadays grid portals are characterized by various and different features and are implemented in very differing programming languages and technologies, we thought that they have many structural aspects in common. For this reason we decided to design and implement a set of Grid specific Web APIs, that we called GRB WAPI. Through them, a portal developer will not have to deal with grid technical details and will be able to manage a high level design. A portal developer will be able to concentrate on some other aspects that concern presentation, such as portal usability and functionality. We discarded the idea of developing a traditional library in order to free portal developers from a particular implementation technology. Thanks to this choice the portal presentation logic can be implemented in any web technology and can be on a different server.

An advanced system for portfolio optimisation

Portfolio optimisation is a crucial problem that every financial operator has to deal with. Nowadays the possibility to process large amount of data, to generate more confident forecasts and to solve more complex optimisation problems is powered by the adoption of advanced computing systems. This paper presents an efficient and effective decision support system for portfolio optimisation implemented on a grid platform. The system relies on a methodological kernel which efficiently integrates simulation and optimisation techniques. A large set of numerical experiments has been carried out to measure the performance of the system both in terms of computational efficiency and improved solution quality

An Interoperable Grid Workflow Management System

A WorkFlow Management System (WFMS) is a fundamental componentenabling to integrate data, applications and a wide set of project resources. Although a number of scientific WFMSs support this task, many analysis pipelines require large-scale Grid computing infrastructures to cope with their high compute and storage requirements. Such scientific workflows complicate the management of resources, especially in cases where they are offered by several resource providers, managed by different Grid middleware, since resource access must be synchronised in advance to allow reliable workflow execution. Different types of Grid middleware such as gLite, Unicore and Globus are used around the world and may cause interoperability issues if applications involve two or more of them. In this paperwe describe the ProGenGrid Workflow Management System which the main goal is to provide interoperability among these different grid middleware when executing workflows. It allows the composition of batch; parameter sweep and MPI based jobs. The ProGenGrid engine implements the logic to execute such jobs by using a standard language OGF compliant such as JSDL that has been extended for this purpose. Currently, we are testing our system on some bioinformatics case studies in the International Laboratory of Bioinformatics (LIBI) Project (www.libi.it).

Cloud Computing and Augmented Realty for Cultural Heritage

In this paper the use of augmented realty and cloud computing technology to enrich the scenes of cultural heritage contexts is proposed. The main objective is to develop a mobile application capable to improve user cultural experience during city sightseeing through the addition of detailed digital contents related to the site or the monument he is watching. The Wikitude SDK is the software library and framework used for the mobile application development. The huge amount of culturale dig- ital contents (mainly represented by images) justifies the exploitation of a cloud computing environment to obtain an innovative, multi-platform and user friendly augmented reality enriched solution. In particular, for the purpose of our research we’ve chosen KVM and OpenNebula open-source cloud platform as private cloud technology, since it exhibits great features like openness, flexibility, simplicity and scalability. The result of the work is validated through test beds realized both in laboratory, using images captured from PC display, and in a real environment.

CUDA Based Parallel Implementations of Space-Saving on a GPU

We present four CUDA based parallel implementations of the Space-Saving algorithm for determining frequent items on a GPU. The first variant exploits the open-source CUB library to simplify the implementation of a user's defined reduction, whilst the second is based on our own implementation of the parallel reduction. The third and the fourth, built on the previous variants, are meant to improve the performance by taking advantage of hardware based atomic instructions. In particular, we implement a warp based ballot mechanism to accelerate the Space-Saving updates. We show that our implementation of the parallel reduction, coupled with the ballot based update mechanism, is the fastest, and provides extensive experimental results regarding its performance.

Experience on the parallelization of the OASIS3 coupler

This work describes the optimization and paralleliza- tion of the OASIS3 coupler. Performance evaluation and profiling have been carried out by means of the CMCC-MED coupled model, developed at the Euro- Mediterranean Centre for Climate Change (CMCC) and currently running on a NEC SX9 cluster. Our experiments highlighted that extrapolation (accom- plished by the extrap function) and interpolation (im- plemented from the scriprmp function) transforma- tions take the most time. Optimization concerned I/O operations reducing coupling time by 27%. Paral- lelization of OASIS3 represents a further step towards overall improvement of the whole coupled model. Our proposed parallel approach distributes fields over a pool of available processes. Each process applies cou- pling transformations to its assigned fields. This ap- proach restricts parallelization level to the number of coupling fields. However, it can be fully combined with a parallelization approach considering the geo- graphical domain distribution. Finally a quantitative comparison of the parallel coupler with the OASIS3 pseudo-parallel version is proposed.

Fast and Accurate Mining of Correlated Heavy Hitters

The problem of mining Correlated Heavy Hitters (CHH) from a two- dimensional data stream has been introduced recently, and a deterministic algo- rithm based on the use of the Misra–Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through exten- sive experimental results, that our algorithm outperforms the Misra–Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space.

GRB_WAPI, a RESTful Framework for Grid Portals

Nowadays grid portals are characterized by various and different features and are implemented in very differing programming languages and technologies, still having many structural aspects in common. This paper describes a RESTful Web API, named GRB_WAPI, specifically developed for grid computing that encapsulates all grid control and computation logic need to build a grid portal. Through the adoption of this API a portal developer doesn’t not have to deal with grid technical details focusing just on the high level design of her system and on some other aspects that concern presentation, such as portal usability and functionality. The idea of developing a traditional library has been discarded in order to free portal developers from a particular implementation technology. Thanks to this choice the portal presentation logic can be implemented in any web technology and can be deployed on a different server. Traditional Web Services and SOAP protocol approach has been discarderd in order to adopt a RESTful approach to make the Web APIs lighter and also to take advantage of some other aspects illustrated in the paper.

Mining frequent items in the time fading model

We present FDCMSS, a new sketch-based algorithm for mining frequent items in data streams. The algorithm cleverly combines key ideas borrowed from forward decay, the Count-Min and the Space Saving algorithms. It works in the time fading model, mining data streams according to the cash register model. We formally prove its correctness and show, through extensive experimental results, that our algorithm outperforms λ-HCount, a recently developed algorithm, with regard to speed, space used, precision attained and error committed on both synthetic and real datasets.

NEMO Oceanic Model Optimization

On Frequency Estimation and Detection of Frequent Items in Time Faded Streams

We deal with the problem of detecting frequent items in a stream under the constraint that items are weighted, and recent items must be weighted more than older ones. This kind of problem naturally arises in a wide class of applications in which recent data is considered more useful and valuable with regard to older, stale data. The weight assigned to an item is therefore a function of its arrival timestamp. As a consequence, whilst in traditional frequent item mining applications we need to estimate frequency counts, we are instead required to estimate decayed counts. These applications are said to work in the time fading model. Two sketch-based algorithms for processing time-decayed streams have been recently published independently near the end of 2016. The FSSQ algorithm, besides a sketch, also uses an additional data structure called Quasi-Heap to maintain frequent items. FDCMSS, our algorithm, cleverly combines key ideas borrowed from forward decay, the Count-Min sketch and the Space Saving algorithm. Therefore, it makes sense to compare and contrast the two algorithms in order to fully understand their strengths and weaknesses. We show, through extensive experimental results, that FSSQ is better for detecting frequent items than for frequency estimation. The use of the Quasi-Heap data structure slows down the algorithm owing to the huge number of maintenance operations. Therefore, FSSQ may not be able to cope with high-speed data streams. FDCMSS is better suitable for frequency estimation; moreover, it is extremely fast and can be used in the context of high-speed data streams and for the detection of frequent items as well, since its recall is always greater than 99%, even when using an extremely tiny amount of space. Therefore, FDCMSS proves to be an overall good choice when considering jointly the recall, precision, average relative error and the speed.

Parallel space saving on multi- and many-core processors

Given an array A of n elements and a value 2≤k≤n, a frequent item or k-majority element is an element occurring in A more than n/k times. The k-majority problem requires finding all of the k-majority elements. In this paper, we deal with parallel shared-memory algorithms for frequent items; we present a shared-memory version of the Space Saving algorithm, and we study its behavior with regard to accuracy and performance on many and multi-core processors, including the Intel Phi accelerator. We also investigate a hybrid MPI/OpenMP version against a pure MPI-based version. Through extensive experimental results, we prove that the MPI/OpenMP parallel version of the algorithm significantly enhances the performance of the earlier pure MPI version of the same algorithm. Results also prove that for this algorithm the Intel Phi accelerator does not introduce any improvement with respect to the Xeon octa-core processor.

Performance Analysis of the COSMO-CLM Model

COSMO-CLM is a non-hydrostatic parallel atmospheric model, developed by the CLM-Community starting from the Local Model (LM) of the German Weather Service. Since 2005, it is the reference model used by the german researchers for the climate studies on different temporal scales (from few to hundreds of years) with a spatial resolution from 1 up to 50 kilometers. It is also used and developed from other meteorological research centres belonging to the Consortium for Small-scale Modelling (COSMO). The present work is focused on the analysis of the CCLM model from the computational point of view. The main goal is to verify if the model can be optimised by means of an appropriate tuning of the input parameters, to identify the performance bottlenecks and to suggest possible approaches for a further code optimisation. We started analysing if the strong scalability (which measures the improvement factor due to the parallelism given a fixed domain size) can be improved acting on some parameters such as the subdomain shape, the number of processes dedicated to the I/O operations, the output frequency and the communication strategies. Then we profiled the code to highlight the bottlenecks to the scalability and finally we performed a detailed performance analysis of the main kernels using the roofline model.

Performance and results of the high-resolution biogeochemical model PELAGOS025 v1.0 within NEMO v3.4

The present work aims at evaluating the scalabil- ity performance of a high-resolution global ocean biogeo- chemistry model (PELAGOS025) on massive parallel archi- tectures and the benefits in terms of the time-to-solution re- duction. PELAGOS025 is an on-line coupling between the Nucleus for the European Modelling of the Ocean (NEMO) physical ocean model and the Biogeochemical Flux Model (BFM) biogeochemical model. Both the models use a par- allel domain decomposition along the horizontal dimen- sion. The parallelisation is based on the message passing paradigm. The performance analysis has been done on two parallel architectures, an IBM BlueGene/Q at ALCF (Argonne Leadership Computing Facilities) and an IBM iDataPlex with Sandy Bridge processors at the CMCC (Euro Mediterranean Center on Climate Change). The outcome of the analysis demonstrated that the lack of scalability is due to several factors such as the I/O operations, the memory contention, the load unbalancing due to the memory structure of the BFM component and, for the BlueGene/Q, the absence of a hybrid parallelisation approach.

Prototype of Grid Environment for Earth System Models

Within the EU IS-ENES project, the deployment of an e-infrastructure providing climate scientists with an efficient virtual proximity to distributed data and distributed computing resources is required. The access point of this infrastructure is represented by the v.E.R.C. (virtual Earth system modelling Resource Centre) web portal. It allows the Earth System Models (ESMs) scientists to run complex distributed workflows for executing ESM experiments and accessing to ESM data. The work describes the deployment of a grid prototype environment for running multi-model ensembles experiments. Considering existing grid infrastructures and services, the design of this grid prototype has been lead by the necessity to build a framework that leverage the external services offered within the European HPC ecosystem, e.g. DEISA, PRACE. The prototype allows exploiting advanced grid services, namely GRB services, developed at the University of Salento, Italy, and basic grid services offered by the Globus Toolkit middleware for submitting and monitoring the ensemble runs. The prototype has been deployed involving three sites: CMCC, DKRZ and BSC. A case study related to the HRT159, a global coupled ocean-atmosphere general circulation model (AOGCM) developed by CMCC-INGV, has been considered.

The Nemo Oceanic Model: Improvement of Scalability on MareNostrum

The NEMO Oceanic Model: Computational Performance Analysis and Optimization

The NEMO (Nucleus for European Modeling of the Ocean) oceanic model is one of the most widely used by the climate community. It is exploited with different configurations in more than 50 research projects for both long and short-term simulations. Computational requirements of the model and its implementation limit the exploitation of the emerging computational infrastructure at peta and exascale. A deep revision and analysis of the model and its implementation were needed. The paper describes the performance evaluation of the last release of the model, based on MPI parallelization, on the MareNostrum platform at the Barcelona Supercomputing Centre. The analysis of the scalability has been carried out taking into account different factors, i.e. the I/O system available on the platform, the domain decomposition of the model and the level of the parallelism. The analysis highlighted different bottlenecks due to the communication overhead. The code has been optimized reducing the communication weight within some frequently called functions and the parallelization has been improved introducing a second level of parallelism based on the OpenMP shared memory paradigm.

The Performance Model of an Enhanced Parallel Algorithm for the SOR Method

The Successive Over Relaxation (SOR) is a variant of the iterative Gauss-Seidel method for solving a linear system of equations Ax = b. The SOR algorithm is used within the NEMO (Nucleus for European Modelling of the Ocean) ocean model for solving the ellip- tical equation for the barotropic stream function. The NEMO perfor- mance analysis shows that the SOR algorithm introduces a significant communication overhead. Its parallel implementation is based on the Red-Black method and foresees a communication step at each iteration. An enhanced parallel version of the algorithm has been developed by acting on the size of the overlap region to reduce the frequency of com- munications. The overlap size must be carefully tuned for reducing the communication overhead without increasing the computing time. This work describes an analytical performance model of the SOR algorithm that can be used for establishing the optimal size of the overlap region.

The performance model for a parallel SOR algorithm using the red-black scheme

The successive over relaxation (SOR) is a variant of the iterative Gauss-Seidel method for solving a linear system of equations Ax = b. The SOR algorithm is used within the Nucleus for European Modelling of the Ocean (NEMO) model for solving the elliptical equation for the barotropic stream function. The NEMO performance analysis shows that the SOR algorithm introduces a significant communication overhead. Its parallel implementation is based on the red-black method and foresees a communication step at each iteration. An enhanced parallel version of the algorithm has been developed by acting on the size of the overlap region to reduce the frequency of communications. The overlap size must be carefully tuned for reducing the communication overhead without increasing the computing time. This work describes an analytical performance model of the SOR algorithm that can be used for establishing the optimal size of the overlap region.

The Roofline Model for Oceanic Climate Applications

The present work describes the analysis and optimisation of the PELAGOS025 configuration based on the coupling of the NEMO physic component of the ocean dynamics and the BFM (Biogeochemical Flux Model), a sophisticated biogeochemical model that can simulate both pelagic and benthic processes. The methodology here followed is characterised by the performance analysis of the original parallel code, in terms of strong scalability, the definition of the bottlenecks limiting the scalability when the number of processes increases, the analysis of the features of the most computational intensive kernels through the Roofline model which provides an insightful visual performance model for multicore architectures and which allows to measure and compare the performance of one or more computational kernels run on different hardware architectures.

Ruolo

Organizzazione

Dipartimento

Area Scientifica

Settore Scientifico Disciplinare

Settore ERC 1° livello

Settore ERC 2° livello

Settore ERC 3° livello

Ruolo

Organizzazione

Dipartimento

Area Scientifica

Settore Scientifico Disciplinare

Settore ERC 1° livello

Settore ERC 2° livello

Settore ERC 3° livello

21 PUBBLICAZIONI

A Performance Evaluation Method for Climate Coupled Models

A web API framework for developing grid portals

An advanced system for portfolio optimisation

An Interoperable Grid Workflow Management System

Cloud Computing and Augmented Realty for Cultural Heritage

CUDA Based Parallel Implementations of Space-Saving on a...

Experience on the parallelization of the OASIS3 coupler

Fast and Accurate Mining of Correlated Heavy Hitters

GRB_WAPI, a RESTful Framework for Grid Portals

Mining frequent items in the time fading model

NEMO Oceanic Model Optimization

On Frequency Estimation and Detection of Frequent Items...

Parallel space saving on multi- and many-core processors

Performance Analysis of the COSMO-CLM Model

Performance and results of the high-resolution biogeochemical model...

Prototype of Grid Environment for Earth System Models

The Nemo Oceanic Model: Improvement of Scalability on...

The NEMO Oceanic Model: Computational Performance Analysis and...

The Performance Model of an Enhanced Parallel Algorithm...

The performance model for a parallel SOR algorithm...

The Roofline Model for Oceanic Climate Applications

0 PROGETTI

0 BREVETTI

0 SPINOFF