Effettua una ricerca
Marco Leo
Ruolo
III livello - Ricercatore
Organizzazione
Consiglio Nazionale delle Ricerche
Dipartimento
Non Disponibile
Area Scientifica
AREA 09 - Ingegneria industriale e dell'informazione
Settore Scientifico Disciplinare
ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore ERC 1° livello
PE - PHYSICAL SCIENCES AND ENGINEERING
Settore ERC 2° livello
PE6 Computer Science and Informatics: Informatics and information systems, computer science, scientific computing, intelligent systems
Settore ERC 3° livello
PE6_11 Machine learning, statistical data processing and applications using signal processing (e.g. speech, image, video)
This paper presents a new method to automatically locate pupils in images (even with low-resolution) containing human faces. In particular pupils are localized by a two steps procedure: at first self-similarity information is extracted by considering the appearance variability of local regions and then they are combined with an estimator of circular shapes based on a modified version of the Circular Hough Transform. Experimental evidence of the effectiveness of the method was achieved on challenging databases containing facial images acquired under different lighting conditions and with different scales and poses.
Face indexing is a very popular research topic and it has been investigated over the last 10 years. It can be used for a wide range of applications such as automatic video content analysis, data mining, video annotation and labeling, etc. In this work a fully automated framework that can detect how many people are present in a generic video (even having low resolution and/or taken from a mobile camera) is presented. It also extracts the intervals of frames in which each person appears. The main contributions of the proposed work are that no initializations neither a priory knowledge about the scene contents are required. Moreover, this approach introduces a generalized version of the k-means method that, through different statistical indices, automatically determines the number of people in the scene. Copyright © 2014 SCITEPRESS - Science and Technology Publications. All rights reserved.
It has been proved that Autism Spectrum Disorders (ASD) are associated with amplified emotional responses and poor emotional control. Underlying mechanisms and characteristics of these difficulties in using, sharing and responding to emotions are still not understood. Recent non-invasive technological frameworks based on computer vision can be applied to overcome this knowledge gap and this paper is right aimed at demonstrating how facial measurements from images can be exploited to compare how ASD children react to external stimuli with respect a control set of children.
Soft biometric systems have spread among recent years, both for powering classical biometrics, as well as stand alone solutions with several application scopes ranging from digital signage to human-robot interaction. Among all, in the recent years emerged the possibility to consider as a soft biometrics also the temporal evolution of the human gaze and some recent works in the literature explored this exciting research line by using expensive and (perhaps) unsafe devices which require user cooperation to be calibrated. This work is instead the first attempt to perform biometric identification of individuals on the basis of data acquired by a low-cost, non-invasive, safe and calibration-free gaze estimation framework consisting of two main components conveniently combined and performing user's head pose estimation and eyes' pupil localization on data acquired by a RGB-D device. The experimental evidence of the feasibility of using the proposed framework as soft-biometrics is given on a set of users watching three benchmark heterogeneous videos in an unconstrained environment.
Gender recognition is a topic of high interest especially in the growing field of audience measurement techniques for digital signage applications. Usually, supervised approaches are employed and they require a preliminary training phase performed on large datasets of annotated facial images that are expensive (e.g. MORPH) and, anyhow, they cannot be updated to keep track of the continuous mutation of persons' appearance due to changes of fashions and styles (e.g. hairstyles or makeup). The use of small-sized (and then updatable in a easier way) datasets is thus high desirable but, unfortunately, when few examples are used for training, the gender recognition performances dramatically decrease since the state-of-art classifiers are unable to handle, in a reliable way, the inherent data uncertainty by explicitly modeling encountered distortions. To face this drawback, in this work an innovative classification scheme for gender recognition has been introduced: its core is the Minimax approach, i.e. a smart classification framework that, including a number of existing regularized regression models, allows a robust classification even when few examples are used for training. This has been experimentally proved by comparing the proposed classification scheme with state of the art classifiers (SVM, kNN and Random Forests) under various pre-processing methods.
Automatic sport team discrimination, that is the correct assignment of each player to the relative team, is a fundamental step in high level sport video sequences analysis applications. In this work we propose a novel set of features based on a variation of classic color histograms called Positional Histograms: these features try to overcome the main drawbacks of classic histograms, first of all the weakness of any kind of relation between spectral and spatial contents of the image. The basic idea is to extract histograms as a function of the position of points in the image, with the goal of maintaining a relationship between the color distribution and the position: this is necessary because often the actors in a play field dress in a similar way, with just a different distribution of the same col-ors across the silhouettes. Further, different unsupervised classifiers and different feature sets are jointly evaluated with the goal of investigate toward the feasibility of unsupervised techniques in sport video analysis.
This paper presents a survey of soccer video analysis systems for different applications: video summarization, provision of augmented information, high-level analysis. Computer vision techniques have been adapted to be applicable in the challenging soccer context. Different semantic levels of interpretation are required according to the complexity of the corresponding applications. For each application area we analyze the computer vision methodologies, their strengths and weaknesses and we investigate whether these approaches can be applied to extensive and real time soccer video analysis.
Face indexing is a very popular research topic and it has been investigated over the last 10 years. It can be used for a wide range of applications such as automatic video content analysis, data mining, video annotation and labeling, etc. In this work a statistical approach to address this challenging issue is presented: the number of persons that are present in a generic video (even having low resolution and/or taken from a mobile camera) is automatically detected and also the intervals of frames in which each person appears are extracted. The main contributions of the proposed work are that no initializations neither a priory knowledge about the scene contents are required. Moreover, this approach introduces a generalized version of the k-means method that, through different statistical indices, automatically determines the number of people in the scene.
This paper presents a detailed study about different algorithmic configurations for estimating soft biometric traits. In particular, a recently introduced common framework is the starting point of the study: it includes an initial facial detection, the subsequent facial traits description, the data reduction step, and the final classification step. The algorithmic configurations are featured by different descriptors and different strategies to build the training dataset and to scale the data in input to the classifier. Experimental proofs have been carried out on both publicly available datasets and image sequences specifically acquired in order to evaluate the performance even under real-world conditions, i.e., in the presence of scaling and rotation.
In this work a first attempt to undertake the difficult challenge of embedding a technological level into a standardized protocol for Autism spectrum disorders (ASD) diagnose and assessment is introduced. In particular the Autism Diagnostic Observation Schedule (ADOS-2) is taken under consideration and a technological framework is introduced to compute, in an objective and automatic way, the evaluation scores for some of the involved tasks. The proposed technological framework makes use of a hidden RGB-D device for scene acquisition. Acquired data then feed a cascade of algorithmic steps by which people and objects are detected and temporally tracked and then extracted information is exploited by fitting a spatial and temporal model described by means of an ontology-based approach. The ontology metadata are finally processed to find a mapping between them and the behavioral tasks described in the protocol.
In the last decade, soccer video analysis has received a lot of attention from the scientific community. This increasing interest is motivated by the possible applications over a wide spectrum of topics: indexing, summarization, video enhancement, team and players statistics, tactics analysis, referee support, etc. The application of computer vision methodologies in the soccer context requires many problems to be faced: ball and players have to be detected in the images in any light and weather condition, they have to be localized in the field, tracked over time and finally their interactions have to be detected and analyzed. The latter task is fundamental, especially for statistic and referee decision support purposes, but, unfortunately, it has not received adequate attention from the scientific community and a lot of research remains to be done. In this paper a multicamera system is presented to detect the ball player interactions during soccer matches. The proposed method extracts, by triangulation from multiple cameras, the 3D ball and player trajectories and, by estimating the trajectory intersections, detects the ball-player interactions. An inference process is then introduced to determine the player kicking the ball and to estimate the interaction frame. The system was tested during several matches of the Italian first division football championship and experimental results demonstrated that the proposed method is robust and accurate.
Automatic sport video analysis has became one of the most attractive research fields in the areas of computer vision and multimedia technologies. In particular, there has been a boom in soccer video analysis research. This paper presents a new multi-step algorithm to automatically detect the soccer ball in image sequences acquired from static cameras. In each image, candidate ball regions are selected by analyzing edge circularity and then ball patterns are extracted representing locally affine invariant regions around distinctive points which have been highlighted automatically. The effectiveness of the proposed methodologies is demonstrated through a huge number of experiments using real balls under challenging conditions, as well as a favorable comparison with some of the leading approaches from the literature.
Mobility and multi-functionality have been recognized as being basic requirements for the development of fully automated surveillance systems in realistic scenarios. Nevertheless, problems such as active control of heterogeneous mobile agents, integration of information from fixed and moving sensors for high-level scene interpretation, and mission execution are open. This paper describes recent and current research of the authors concerning the design and implementation of a multi-agent surveillance system, using static cameras and mobile robots. The proposed solution takes advantage of a distributed control architecture that allows the agents to autonomously handle general-purpose tasks, as well as more complex surveillance issues. The various agents can either take decisions and act with some degree of autonomy, or cooperate with each other. This paper presents an overview of the system architecture and of the algorithms involved in developing such an autonomous, multi-agent surveillance system.
Recent improvements in the field of assistive technologies have led to innovative solutions aiming at increasing the capabilities of people with disability, helping them in daily activities with applications that span from cognitive impairments to developmental disabilities. In particular, in the case of Autism Spectrum Disorder (ASD), the need to obtain active feedback in order to extract subsequently meaningful data becomes of fundamental importance. In this work, a study about the possibility of understanding the visual exploration in children with ASD is presented. In order to obtain an automatic evaluation, an algorithm for free (i.e., without constraints, nor using additional hardware, infrared (IR) light sources or other intrusive methods) gaze estimation is employed. Furthermore, no initial calibration is required. It allows the user to freely rotate the head in the field of view of the sensor, and it is insensitive to the presence of eyeglasses, hats or particular hairstyles. These relaxations of the constraints make this technique particularly suitable to be used in the critical context of autism, where the child is certainly not inclined to employ invasive devices, nor to collaborate during calibration procedures.The evaluation of children's gaze trajectories through the proposed solution is presented for the purpose of an Early Start Denver Model (ESDM) program built on the child's spontaneous interests and game choice delivered in a natural setting.
In this paper, an artificial olfactory system (Electronic Nose) that mimics the biological olfactory system is introduced. The device consists of a Large-Scale Chemical Sensor Array (16; 384 sensors, made of 24 different kinds of conducting polymer materials) that supplies data to software modules, which perform advanced data processing. In particular, the paper concentrates on the software components consisting, at first, of acrucial step that normalizes the heterogeneous sensor data and reduces their inherent noise. Cleaned data are then supplied as input to a data reduction procedure that extracts the most informative and discriminant directions in order to get an efficient representation in a lowerdimensional space where it is possible to more easily find a robust mapping between the observed outputs and the characteristics of the odors in input to the device. Experimental qualitative proofs of the validity of the procedure are given by analyzing data acquired for two different pure analytes and their binary mixtures. Moreover, a classification task is performed in order to explore the possibility of automatically recognizing pure compoundsand to predict binary mixture concentrations.
This paper investigates the possibility of accurately detecting and tracking human gaze by using an unconstrained and noninvasive approach based on the head pose information extracted by an RGB-D device. The main advantages of the proposed solution are that it can operate in a totally unconstrained environment, it does not require any initial calibration and it can work in real-time. These features make it suitable for being used to assist human in everyday life (e.g., remote device control) or in specific actions (e.g., rehabilitation), and in general in all those applications where it is not possible to ask for user cooperation (e.g., when users with neurological impairments are involved). To evaluate gaze estimation accuracy, the proposed approach has been largely tested and results are then compared with the leading methods in the state of the art, which, in general, make use of strong constraints on the people movements, invasive/additional hardware and supervised pattern recognition modules. Experimental tests demonstrated that, in most cases, the errors in gaze estimation are comparable to the state of the art methods, although it works without additional constraints, calibration and supervised learning.
Automatic Facial Expression Recognition is a topic of high interest especially due to the growing diffusion of assistive computing applications, as Human Robot Interaction, where a robust awareness of the people emotion is a key point. This paper proposes a novel automatic pipeline for facial expression recognition based on the analysis of the gradients distribution, on a single image, in order to characterize the face deformation in different expressions. Firstly, an accurate investigation of optimal HOG parameters has been done. Successively, a wide experimental session has been performed demonstrating the higher detection rate with respect to other State-of-the-Art methods. Moreover, an on-line testing session has been added in order to prove the robustness of our approach in real environments.
Deep Learning has becoming a popular and effective way to address a large set of issues. In particular, in computer vision, it has been exploited to get satisfying recognition performance in unconstrained conditions. However, this wild race towards even better performance in extreme conditions has overshadowed an important step i.e. the assessment of the impact of this new methodology on traditional issues on which for years the researchers had worked. This is particularly true for biometrics applications where the evaluation of deep learning has been made directly on newest large and more challencing datasets. This lead to a pure data driven evaluation that makes difficult to analyze the relationships between network configurations, learning process and experienced outcomes. This paper tries to partially fill this gap by applying a DNN for gender recognition on the MORPH dataset and evaluating how a lower cardinality of examples used for learning can bias the recognition performance.
In this paper, a new technique to reduce the noise in a reconstructed hologram image is proposed. Unlike all the techniques in the literature, the proposed approach not only takes into account spatial information but also temporal statistics associated with the pixels. This innovative solution enables, at first, the automatic detection of the areas of the image containing the objects (foreground). This way, all the pixels not belonging to any objects are directly cleaned up and the contrast between objects and background is consistently increased. The remaining pixels are then processed with a spatio-temporal filtering which cancels out the effects of speckle noise, while preserving the structural details of the objects. The proposed approach has been compared with other common speckle denoising techniques and it is found to give better both visual and quantitative results.
Autism Spectrum Disorders (ASD) are a group of lifelong disabilities that affect people's communication and understanding social cues. The state of the art witnesses how technology, and in particular robotics, may offer promising tools to strengthen the research and therapy of ASD. This work represents the first attempt to use machine-learning strategies during robot-ASD children interactions, in terms of facial expression imitation, making possible an objective evaluation of children's behaviours and then giving the possibility to introduce a metric about the effectiveness of the therapy. In particular, the work focuses on the basic emotion recognition skills. In addition to the aforementioned applicative innovations this work contributes also to introduce a facial expression recognition (FER) engine that automatically detects and tracks the child's face and then recognize emotions on the basis of a machine learning pipeline based on HOG descriptor and Support Vector Machines. Two different experimental sessions were carried out: the first one tested the FER engine on publicly available datasets demonstrating that the proposed pipeline outperforms the existing strategies in terms of recognition accuracy. The second one involved ASD children and it was a preliminary exploration of how the introduction of the FER engine in the therapeutic protocol can be effectively used to monitor children's behaviours.
In this paper a system for automatic visual monitoring of welding process, in dry stainless steel kegs for food storage, is proposed. In the considered manufacturing process the upper and lower skirts are welded to the vessel by means of Tungsten Inert Gas (TIG) welding. During the process several problems can arise: 1) residuals on the bottom 2) darker weld 3) excessive/poor penetration and 4) outgrowths. The proposed system deals with all the four aforementioned problems and its inspection performances have been evaluated by using a large set of kegs demonstrating both the reliability in terms of defect detection and the suitability to be introduced in the manufacturing system in terms of computational costs.
Ball recognition in soccer matches is a critical issue for automatic soccer video analysis. Unfortunately, because of the difficulty in solving the problem, many efforts of numerous researchers have still not produced fully satisfactory results in terms of accuracy. This paper proposes a ball recognition approach that introduces a double level of innovation. Firstly, a randomized circle detection approach based on the local curvature information of the isophotes is used to identify the edge pixels belonging to the ball boundaries. Then, ball candidates are validated by a learning framework formulated into a three-layered model based on a variation of the conventional local binary pattern approach. Experimental results were obtained on a significant set of real soccer images, acquired under challenging lighting conditions during Italian "Serie A" matches. The results have been also favorably compared with the leading state-of-the-art methods.
This paper presents a new method to automatically locate pupils in images (even with low-resolution) containing human faces. In particular pupils are localized by a two steps procedure: at first self-similarity information is extracted by considering the appearance variability of local regions and then they are combined with an estimator of circular shapes based on a modified version of the Circular Hough Transform. Experimental evidence of the effectiveness of the method was achieved on challenging databases containing facial images acquired under different lighting conditions and with different scales and poses.
We show that a three-dimensional (3D) scene can be coded by joining different objects by combining multiple optically recorded color digital holograms. Recently, an adaptive scheme based on affine transformations, able to correct defocus aberrations of digital holograms, was demonstrated, that we name here Adaptive Transformation in Digital Holography (ATDH). We propose an effective framework to create dynamic color holographic 3D scene by using a generalization of such scheme aided by a speckle reduction method, called Multilevel Bi-dimensional Empirical Mode Decomposition (MBEMD), used for the first time in color holography. We also demonstrate its feasibility to the synthesis of multiple Color Computer Generated Holograms (CCGHs).
We propose a complete framework for the synthesis of 3D holographic scene, combining multiple color holograms of different objects by applying adaptive transformations. In particular, it has been demonstrated that affine transformation of digital holograms can be employed to defocus and chromatic aberrations. By combining these two features we are able to synthesize a color scene where multiple objects are jointly multiplexed. Since holograms transformation could be introduce artifacts in the holographic reconstructions, principally related to the presence of speckle noise, we also implement a denoising step where the Bi-dimensional Empirical Mode Decomposition (BEMD) algorithm is employed. We test the proposed framework in two different scenario, i.e. by coding color three-dimensional scenes and joining different objects that are (i) experimentally recorded and (ii) obtained as color computer generated holograms (CCGH).
In the last decades there has been a tremendous increase in demand for Assistive Technologies (AT) useful to overcome functional limitations of individuals and to improve their quality of life. As a consequence, different research papers addressing the development of assistive technologies have appeared into the literature pushing the need to organize and categorize them taking into account the application assistive aims. Several surveys address the categorization problem for works concerning a specific need, hence giving the overview on the state of the art technologies supporting the related function for the individual. Unfortunately, this "user-need oriented" way of categorization considers each technology as a whole and then a deep and critical explanation of the technical knowledge used to build the operative tasks as well as a discussion on their cross-contextual applicability is completely missing making thus existing surveys unlikely to be technically inspiring for functional improvements and to explore new technological frontiers. To overcome this critical drawback, in this paper an original "task oriented" way to categorize the state of the art of the {AT} works has been introduced: it relies on the split of the final assistive goals into tasks that are then used as pointers to the works in literature in which each of them has been used as a component. In particular this paper concentrates on a set of cross-application Computer Vision tasks that are set as the pivots to establish a categorization of the {AT} already used to assist some of the user's needs. For each task the paper analyzes the Computer Vision algorithms recently involved in the development of {AT} and, finally, it tries to catch a glimpse of the possible paths in the short and medium term that could allow a real improvement of the assistive outcomes. The potential impact on the assessment of {AT} considering users, medical, economical and social perspective is also addressed.
This paper revises the main advances in assistive computer vision recently fostered by deep learning. To this aim, we rst discuss how the application of deep learning in computer vision has contributed to the development of assistive techinologies, then analyze the recent advances in assistive technologies achieved in ve main areas, namely, object classication and localization, scene understanding, human pose estimation and tracking, action/event recognition and anticipation. The paper is concluded with a discussion and insights for future directions.
Context analysis is a research field that is attracting growing interest in recent years, especially due to the encouraging results carried out by the semantic-based approach. Anyway, semantic strategies entail the use of trackers capable to show robustness to long-term occlusions, viewpoint changes and identity swap that represent the main problem of many tracking-by-detection solutions. This paper proposes a robust tracking-by-detection framework based on dense SIFT descriptors in combination with an ad-hoc target appearance model update able to overtake the discussed issues. The obtained performances show how our tracker competes with state-of-the-art results and manages occlusions, clutter, changes of scale, rotation and appearance, better than competing tracking methods.
Automatic facial expression recognition (FER) is a topic of growing interest mainly due to the rapid spread of assistive technology applications, as human-robot interaction, where a robust emotional awareness is a key point to best accomplish the assistive task. This paper proposes a comprehensive study on the application of histogram of oriented gradients (HOG) descriptor in the FER problem, highlighting as this powerful technique could be effectively exploited for this purpose. In particular, this paper highlights that a proper set of the HOG parameters can make this descriptor one of the most suitable to characterize facial expression peculiarities. A large experimental session, that can be divided into three different phases, was carried out exploiting a consolidated algorithmic pipeline. The first experimental phase was aimed at proving the suitability of the HOG descriptor to characterize facial expression traits and, to do this, a successful comparison with most commonly used FER frameworks was carried out. In the second experimental phase, different publicly available facial datasets were used to test the system on images acquired in different conditions (e.g. image resolution, lighting conditions, etc.). As a final phase, a test on continuous data streams was carried out on-line in order to validate the system in real-world operating conditions that simulated a real-time human-machine interaction.
Moving object detection is a crucial step in many application contexts such as people detection, action recognition, and visual surveillance for safety and security. The recent advance in depth camera technology has suggested the possibility to exploit a multi-sensor information (color and depth) in order to achieve better results in video segmentation. In this paper, we present a technique that combines depth and color image information and demonstrate its effectiveness through experiments performed on real image sequences recorded by means of a stereo camera.
In order to perform automatic analysis of sport videos ac-quired from a multi-sensing environment, it is fundamental to face theproblem of automatic football team discrimination. A correct assignmentof each player to the relative team is a preliminary task that togetherwith player detection and tracking algorithms can strongly a®ect anyhigh level semantic analysis. Supervised approaches for object classi¯-cation, require the construction of ad hoc models before the processingand also a manual selection of di®erent player patches belonging to theteam classes. The idea of this paper is to collect the players patches com-ing from six di®erent cameras, and after a pre-processing step based onCBTF (Cumulative Brightness Transfer Function) studying and compar-ing di®erent unsupervised method for classi¯cation. The pre-processingstep based on CBTF has been implemented in order to mitigate di®er-ence in appearance between images acquired by di®erent cameras. Wetested three di®erent unsupervised classi¯cation algorithms (MBSAS - asequential clustering algorithm; BCLS - a competitive one; and k-means- a hard-clustering algorithm) on the transformed patches. Results ob-tained by comparing di®erent set of features with di®erent classi¯ers areproposed. Experimental results have been carried out on di®erent realmatches of the Italian Serie A.1
Iris segmentation is driven by three different quality factors: accuracy, usability and speed. Unfortunately the deeply analysis of the literature shows that the greatest efforts of the researchers mainly focus on accuracy and speed. Proposed solutions, in fact, do not meet the usability requirement since they are based on specific optimizations related to the operating context and they impose binding conditions on the sensors to be used for the acquisition of periocular images. This paper tries to fill this gap by introducing an innovative iris segmentation technique that can be used in unconstrained environments, under non-ideal imaging conditions and, above all, that does not require any interaction for adaptation to different operating conditions. Experimental results, carried out on challenging databases, demonstrate that the high usability of the proposed solution does not penalize segmentation accuracy which, in some respects, outperforms that of the leading approaches in the literature.
Automatic facial expression recognition is one of the most interesting problem as it impacts on important applications in human-computer interaction area. Many applications in this field require real-time performance but not all the approach are suitable to satisfy this requirement. Geometrical features are usually the most light in terms of computational load but sometimes they exploits a huge number of features and do not cover all the possible geometrical aspect. In order to face up this problem, we propose an automatic pipeline for facial expression recognition that exploits a new set of 32 geometric facial features from a single face side covering a wide set of geometrical peculiarities. As a results, the proposed approach showed a facial expression recognition accuracy of 95,46% with a six-class expression set and an accuracy of 94,24% with a seven-class expression set.
The detection of moving objects is a crucial step in many application contexts such as people detection, action recognition, and visual surveillance for safety and security. The recent advance in depth camera technology has suggested the possibility to exploit a multi-sensor information (color and depth) in order to achieve better results in video segmentation. In this paper, we present a technique that combines depth and color image information and demonstrate its effectiveness through experiments performed on real image sequences recorded by means of a stereo camera.
ASD diagnose and assessment make use of medical protocol validated by the scientific community that is still reluctant to new protocols introducing invasive technologies, as robots or wearable devices, whose influence on the theraphy has not been deeply investigated. This work attempts to undertake the difficult challenge of embedding a technological level into the standardized ASD protocol known as Autism Diagnostic Observation Schedule (ADOS-2). An intelligent video system is introduced to compute, in an objective and automatic way, the evaluation scores for some of the tasks involved in the protocol. It make use of a hidden RGB-D device for scene acquisition the data of which feed a cascade of algorithmic steps by which people and objects are detected and temporally tracked and then extracted information is exploited by fitting a spatial and temporal model described by means of an ontology approach. The ontology metadata are finally processed to find a mapping between them and the behavioral tasks described in the protocol.
Convolutional Neural Networks (CNNs) attracted growing interest in recent years thanks to their high generalization capabilities that are highly recommended especially for applications working in the wild context. However CNNs rely on a huge number of parameters that must be set during training sessions based on very large datasets in order to avoid over-fitting issues. As a consequence the lack in training data is one of the greatest limits for the applicability of deep networks. Another problem is represented by the fixed scale of the filter in the first convolutional layer that limits the analysis performed through the subsequent layers of the network.
The paper presents a new automatic technique for speckle reduction in the context of digital holography. Speckle noise is a superposition of unwanted spots over objects of interest, due to the behavior of a coherence source of radiation with the object surface characteristics. In the proposed denoising method, bidimensional empirical mode decomposition is used to decompose the image signal, which is then filtered through the Frost filter. The proposed technique was preliminarily tested on the "Lena" image for quality assessment in terms of peak signal-to-noise ratio. Then, its denoising capability was assessed on different holographic images on which also the comparison (using both blind metrics and visual inspection) with the leading strategies in the state of the art was favorably performed. (C) 2014 Society of Photo-Optical Instrumentation Engineers (SPIE)
Assistive technology is a generic system that is used to increase, help or improve the functional capabilities of people with disability. Recently, its employment has generated innovative solutions also in the field of Autism Spectrum Disorder (ASD), where it is extremely challenging to obtain feedback or to extract meaningful data. In this work, a study about the possibility to understand the visual exploration in children with ASD is presented. In order to obtain an automatic evaluation, an algorithm for free gaze estimation is employed. The proposed gaze estimation method can work without constrains nor using additional hardware, IR light sources or other intrusive methods. Furthermore, no initial calibration is required. These relaxations of the constraints makes the technique particularly suitable to be used in the critical context of autism, where the child is certainly not inclined to employ invasive devices. In particular, the technique is used in a scenario where a closet containing specific toys, that are neatly disposed from the therapist, is opened to the child. After a brief environment exploration, the child will freely choose the desired toy that will be subsequently used during therapy. The video acquisition have been accomplished by a Microsoft Kinect sensor hidden into the closet in order to obtain both RGB and depth images, that can be processed by the estimation algorithm, therefore computing gaze tracking by intersection with data coming from the well-known initial disposition of toys. The system has been tested with children with ASD, allowing to understand their choices and preferences, letting to optimize the toy disposition for cognitive-behavioural therapy.
In this paper, a real case study on a Goal Line Monitoringsystem is presented. The core of the paper is a re-fined ball detection algorithm that analyzes candidate ballregions to detect the ball. A decision making approach, bymeans of camera calibration, decides about the goal eventoccurrence. Differently from other similar approaches, theproposed one provides, as unquestionable proof, the imagesequence that records the goal event under consideration.Moreover, it is non-invasive: it does not require any changein the typical football devices (ball, goal posts, and so on).Extensive experiments were performed on both real matchesacquired during the Italian Serie A championship, and specificevaluation tests by means of an artificial impact walland a shooting machine for shot simulation. The encouragingexperimental results confirmed that the system couldhelp humans in ambiguous goal line event detection.
Deep Learning architectures have obtained significant results for human pose estimation in the last years. Studies of the state of the art usually focus their attention on the estimation of the human pose of adults people depicted in images. The estimation of the pose of child (infants, toddlers, children) is sparsely studied despite it can be very useful in different application domains, such as Assistive Computer Vision (e.g. for early detection of autism spectrum disorder). The monitoring of the pose of a child over time could reveal important information especially during clinical trials. Human pose estimation methods have been benchmarked on a variety of challenging conditions, but studies to highlight performance specifically on children's poses are still missing. Infants, toddlers and children are not only smaller than adults, but also significantly different in anatomical proportions. Also, in assistive context, the unusual poses assumed by children can be very challenging to infer. The objective of the study in this paper is to compare different state of art approaches for human pose estimation on a benchmark dataset useful to understand their performances when subjects are children. Results reveal that accuracy of the state of art methods drop significantly, opening new challenges for the research community.
Circle detection is a critical issue in image analysis and object detection. Although Hough transform based solvers are largely used, randomized approaches, based on the iterative sampling of the edge pixels,are object of research in order to provide solutions less computationally expensive. This work presents a randomized iterative work-flow, which exploits geometrical properties of isophotes in the image to select the most meaningful edge pixels and to classify them in subsets of equal isophote curvature. The analysis of candidate circles is then performed with a kernel density estimation based voting strategy, followed by a refinement algorithm based on linear error compensation. The method has been applied to a set of real images on which it has also been compared with two leading state of the art approaches and Hough transform based solutions. The achieved results show how, discarding u pto 57% of unnecessary edge pixels, it is able to accurately detect circles with in a limited number of iterations, maintaining a sub-pixel accuracy even in the presence of high level of noise.
In the last years, smart surveillance has been one of the most active research topics in computervision because of the wide spectrum of promising applications. Its main point is about the use of automatic videoanalysis technologies for surveillance purposes. In general, a processing framework for smart surveillanceconsists of a preliminary motion detection step in combination with high-level reasoning that allows automaticunderstanding of evolutions of observed scenes. In this paper, we propose a surveillance framework based on aset of reliable visual algorithms that perform different tasks: a motion analysis approach that segmentsforeground regions is followed by three procedures, which perform object tracking, homographic transformationsand edge matching, in order to achieve the real-time monitoring of forbidden areas and the detection ofabandoned or removed objects. Several experiments have been performed on different real image sequencesacquired from a Messapic museum (indoor context) and the nearby archaeological site (outdoor context) todemonstrate the effectiveness and the flexibility of the proposed approach.
This work introduces a real-time system able to lead humanoid robot behavior depending on the gender of the interacting person. It exploits Aldebaran NAO humanoid robot view capabilities by applying a gender prediction algorithm based on the face analysis. The system can also manage multiple persons at the same time, recognizing if the group is composed by men, women or is a mixed one and, in the latter case, to know the exact number of males and females, customizing its response in each case. The system can allow for applications of human-robot interaction requiring an high level of realism, like rehabilitation or artificial intelligence.
The measurement of object dimensions as well as the detection and localization of external defects are of large importance for many sectors in industry including agriculture, transportation and production. In this paper we investigate the feasibility of using commercial depth-sensing devices, based on a time-of-flight technology, such as the Kinect v2 camera, for the measurement and inspection of cuboidal objects (boxes). This paper presents a simplified system using only one Kinect sensor. At the beginning, object dimensions are roughly estimated by discovering the best-fit planes for a cloud of point based on a modified version of RANSAC (RANdom Sample Consensus). The precise geometry and morphology of the objects are then achieved by a transformation from depth to RGB representation of the points estimated as belonging to the object. RGB representation is finally processed (using scanlines on the RGB plane perpendicular to the initial edge estimate) to approximate at best the contour of the bounding box. In addition to the above, the paper proposes a method to automatically highlight defects on the objects' surfaces: this inspection task is performed through the analysis of both the 2D object contours and the histogram of the normalized depth values. The proposed methodology takes a few seconds to deliver the results for the monitored object and, it experienced encouraging results in terms of accuracy. Indeed, the system measured the dimensions of a set of cuboidal objects with an average error of 5 mm and it was able to identify and locate defects and holes on lateral and topmost surfaces. The experimental outcomes pointed out that the system could be effectively exploited within industrial inspection applications, even more so if the low cost of the system is taken under consideration.
Circle detection is a critical issue in image analysis: it is undoubtedly a fundamental step in different application contexts, among them one of the most challenging is the detection of the ball in soccer game. Hough Transform based circle detector are largely used but there is a large open research area that attempt to provide more effective and less computationally expensive solutions based on randomized approaches, i.e. based on iterative sampling of the edge pixels. To this end, this work presents an ad-hoc randomized iterative work-flow, which exploits geometrical properties of isophotes, the curvature, to identify edge pixels belonging to the ball boundaries; this allow to consider a large amount of edge pixels, but limiting most of the time-consuming computation only on a restricted subset given by pixels with an high probability to lie on a circular structure. The method, coupled with a background suppression algorithm, has been applied to a set of real images acquired by fixed camera providing performances higher than a standard circular Hough transform solver, with a detection rate > 86%.
Soft biometric systems have spread among recent years, both for powering classical biometrics, as well as stand alone solutions with several application scopes ranging from digital signage to human-robot interaction. Among all, in the recent years emerged the possibility to consider as a soft biometrics also the temporal evolution of the human gaze and some recent works in the literature explored this exciting research line by using expensive and (perhaps) unsafe devices which, moreover, require user cooperation to be calibrated. By our knowledge the use of a low-cost, non-invasive, safe and calibration-free gaze estimator to get soft-biometrics data has not been investigated yet. This paper fills this gap by analyzing the soft-biometrics performances obtained by modeling the series of gaze estimated by exploiting the combination of head poses and eyes' pupil locations on data acquired by an off-the-shelf RGB-D device.
In thiswork, a real-time system able to automatically recognize soft-biometric traits is introduced and used to improve the capability of a humanoid robot to interact with humans. In particular the proposed system is able to estimate gender and age of humans in images acquired from the embedded camera of the robot. This knowledge allows the robot to properly react with customized behaviors related to the gender/age of the interacting individuals. The system is able to handle multiple persons in the same acquired image, recognizing the age and gender of each person in the robot's field of view. These features make the robot particularly suitable to be used in socially assistive applications.
Digital holography (DH) has emerged as one of the most effective coherent imaging technologies. The technological developments of digital sensors and optical elements have made DH the primary approach in several research fields, from quantitative phase imaging to optical metrology and 3D display technologies, to name a few. Like many other digital imaging techniques, DH must cope with the issue of speckle artifacts, due to the coherent nature of the required light sources. Despite the complexity of the recently proposed de-speckling methods, many have not yet attained the required level of effectiveness. That is, a universal denoising strategy for completely suppressing holographic noise has not yet been established. Thus the removal of speckle noise from holographic images represents a bottleneck for the entire optics and photonics scientific community. This review article provides a broad discussion about the noise issue in DH, with the aim of covering the best-performing noise reduction approaches that have been proposed so far. Quantitative comparisons among these approaches will be presented.
Information and Communication Technologies (ICT) have been proved to have a great impact in enhancing social, communicative, and language development in children with Autism Spectrum Disorders (ASD) as demonstrated by plenty of effective technological tools reported in the literature for diagnosis, assessment and treatment of such neurological diseases. On the contrary, there are very few works exploiting ICT to study the mechanisms that trigger the behavioral patterns during the specialized sessions of treatment focused on social interaction stimulation. From the study of the literature it emerges that the behavioral outcomes are qualitatively evaluated by the therapists making this way impossible to assess, in a consistent manner, the worth of the supplied ASD treatments that should be based on quantitative metric not available for this purpose yet. Moreover, the rare attempts to use a methodological approach are limited to the study of one (of at least a couple) of the several behavioral cues involved. In order to fill this gap, in this paper a technological framework able to analyze and integrate multiple visual cues in order to capture the behavioral trend along a ASD treatment is introduced. It is based on an algorithmic pipeline involving face detection, landmark extraction, gaze estimation, head pose estimation and facial expression recognition and it has been used to detect behavioral features during the interaction among different children, affected by ASD, and a humanoid robot. Experimental results demonstrated the superiority of the proposed framework in the specific application context with respect to leading approaches in the literature, providing a reliable pathway to automatically build a quantitative report that could help therapists to better achieve either ASD diagnosis or assessment tasks.
A new method to automatically locate pupils in images (even with low resolution) containing near-frontal human faces is presented. In particular, pupils are localized by an unsupervised procedure consisting of two steps: at first, self-similarity information is extracted by considering the appearance variability of local regions, and then it is combined with an estimator of circular shapes based on a modified version of the circular Hough transform. Experimental evidence of the effectiveness of the method was achieved on challenging databases and video sequences containing facial images acquired under different lighting conditions and with different scales and poses. (C) 2013 SPIE and IS&T
The automatic detection and tracking of human eyes and, in particular, the precise localization of their centers (pupils), is a widely debated topic in the international scientific community. In fact, the extracted information can be effectively used in a large number of applications ranging from advanced interfaces to biometrics and including also the estimation of the gaze direction, the control of human attention and the early screening of neurological pathologies. Independently of the application domain, the detection and tracking of the eye centers are, currently, performed mainly using invasive devices. Cheaper and more versatile systems have been only recently introduced: they make use of image processing techniques working on periocular patches which can be specifically acquired or preliminarily cropped from facial images. In the latter cases the involved algorithms must work even in cases of non-ideal acquiring conditions (e.g in presence of noise, low spatial resolution, non-uniform lighting conditions, etc.) and without user's awareness (thus with possible variations of the eye in scale, rotation and/or translation). Getting satisfying results in pupils' localization in such a challenging operating conditions is still an open scientific topic in Computer Vision. Actually, the most performing solutions in the literature are, unfortunately, based on supervised machine learning algorithms which require initial sessions to set the working parameters and to train the embedded learning models of the eye: this way, experienced operators have to work on the system each time it is moved from an operational context to another. It follows that the use of unsupervised approaches is more and more desirable but, unfortunately, their performances are not still satisfactory and more investigations are required. To this end, this paper proposes a new unsupervised approach to automatically detect the center of the eye: its algorithmic core is a representation of the eye's shape that is obtained through a differential analysis of image intensities and the subsequent combination with the local variability of the appearance represented by self-similarity coefficients. The experimental evidence of the effectiveness of the method was demonstrated on challenging databases containing facial images. Moreover, its capabilities to accurately detect the centers of the eyes were also favourably compared with those of the leading state-of-the-art methods.
This work introduces biometrics as a way to improve human-robot interaction. In particular, gender and age estimation algorithms are used to provide awareness of the user biometrics to a humanoid robot (Aldebaran NAO), in order to properly react with a specific gender/age behavior. The system can also manage multiple persons at the same time, recognizing the age and gender of each participant. All the estimation algorithms employed have been validated through a k-fold test and successively practically tested in a real human-robot interaction environment, allowing for a better natural interaction. Our system is able to work at a frame rate of 13 fps with 640×480 images taken from NAO's embedded camera. The proposed application is well-suited for all assisted environments that consider the presence of a socially assistive robot like therapy with disable people, dementia, post-stroke rehabilitation, Alzheimer disease or autism.
This paper proposes an advanced technology framework that, through the use of UAVs, allows to monitor archaeological sites. In particular this paper focuses on the development of computer vision techniques such as super-resolution and mosaicking aiming at extracting detailed and panoramic views of the sites. In particular, super-resolution aims at providing imagery solutions (from aerial and remote sensing platforms) that create higher resolution views that make visible some details that are not perceivable in the acquired images. Mosaicking aims, instead, at creating a unique large still image from the sequence of video frames contained in a motion imagery clip. In this way large areas can be observed and a global analysis of their temporal changes can be performed. In general super-resolution and mosaicking can be exploited both for touristic or surveillance purposes. In particular they can be efficiently used to allow the enjoyment of the cultural heritage through a fascinating visual experience also eventually containing augmented information but also for surveillance tasks that can help to detect or prevent illegal activities.
The present invention refers to the problem of the automatic detection of events in sport field, in particular Goal/NoGoal events by signalling to the mach management, which can autonomously take the final decision upon the event. The system is not invasive for the field structures, neither it requires to interrupt the game or to modify the rules thereof, but it only aims at detecting objectively the event occurrence and at providing support in the referees' decisions by means of specific signalling of the detected events.
The present invention relates to a system for detecting and classifying events during motion actions, in particular "offside" event in the football game. The system allows determining such event in a real-time and semi-automatic context, by taking into account the variability of the environmental conditions and of the dynamics of the events which can be traced back to the offside. The present invention proposes itself with a not-invasive technique, compatible with the usual course of the match.
Condividi questo sito sui social