Effettua una ricerca
Giorgio Grillo
Ruolo
III livello - Tecnologo
Organizzazione
Consiglio Nazionale delle Ricerche
Dipartimento
Non Disponibile
Area Scientifica
AREA 05 - Scienze biologiche
Settore Scientifico Disciplinare
BIO/11 - Biologia Molecolare
Settore ERC 1° livello
LS - LIFE SCIENCES
Settore ERC 2° livello
LS2 Genetics, Genomics, Bioinformatics and Systems Biology: Molecular and population genetics, genomics, transcriptomics, proteomics, metabolomics, bioinformatics, computational biology, biostati stics, biological modelling and simulation, systems biology, genetic epidemiology
Settore ERC 3° livello
LS2_10 Bioinformatics
Cancer is a multi-stage process often driven by progressive accumulation of genomic rearrangements that can result in cells acquiring cancer properties such as tumor invasive and metastatic behavior. Many genes associated with cancer are the result of complex somatically and inherited chromosomal rearrangements, resulting in aberrant transcripts or defects in transcription [1-5]. The classical approach for the identification of genome rearrangements such as G-banded cytogenetics, spectral karyotyping and FISH, are poor in sensitivity, while copy number array can identify just imbalanced breakpoints and do not describe the resulted genome structure produced by the events, which may cause the breakpoints. The aim of this project is to obtain, by the paired-end mapping (PEM) approach applied to the massive parallel sequencing, an high resolution virtual karyotype of the genome of a breast-cancer-patient of which we obtained previously the transcriptomic portrait [6].The introduction of massively parallel high throughput sequencing (HTS) techniques have created a broad range of new and exciting research applications by increasing the output sequencing data dramatically. In recent years, the continuous technical improvements of next-generation sequencing technology have made RNA sequencing (RNA-seq) particularly effective for the detection of gene fusions, which are involved in several diseases. Gene fusions are found in many cancer types, and they have proved to be prognostic biomarkers in several studies [7-9]. In addition, gene fusions have often a direct functional impact on the molecular processes in the cell [10].Several analysis steps are needed to process the data provided by the sequencer and to use them for robust gene fusion detection.We propose a workflow to analyze NGS paired-end sequences in order to identify possible candidates to be the results of a fusion between different genes, looking for fusion events occurring on the same chromosome (intra-chromosomal rearrangement).The basic idea is to map the reads onto the reference genome and to study the insert size length distribution of the paired-end, looking at its peak and select all the mapping pairs having an insert size value quite far from the observed peak. In this way we are sure to select paired-end sequences mapping on different regions of the genome far from each other connecting different genes.
The huge amount of transcript data produced by high-throughput sequencing requires the development and implementation of suitable bioinformatic workflows for their analysis and interpretation. These analysis workflows, including different modules, should be specifically designed also based on the sequencing platform (Roche 454, Illumina, SOLiD) and the nature of the data (polyA or total RNA fraction, strand specificity). In the case of cDNA obtained from a total RNA preparation, in addition to polyadenylated protein coding mRNAs, a great variety of transcript sequences can be obtained, including ribosomal RNAs, mitochondrial transcripts and a large variety of functional non coding RNAs (ncRNAs). To deal with these data the analysis workflow should include specific modules to distinguish ncRNAs fractions from the large number of other functional proteincoding transcripts. To this aim we developed an analysis pipeline that, given as input a large collection of reads (particularly from Roche 454), provides the expression profile at qualitative and quantitative level of human mtDNA, ribosomal RNAs, ncRNAs and protein coding mRNAs.
When the reads obtained from high-throughput RNA sequencing are mapped against a reference database, a significant proportion of them - known as multireads - can map to more than one reference sequence. These multireads originate from gene duplications, repetitive regions or overlapping genes. Removing the multireads from the mapping results, in RNA-Seq analyses, causes an underestimation of the read counts, while estimating the real read count can lead to false positives during the detection of differentially expressed sequences.ResultsWe present an innovative approach to deal with multireads and evaluate differential expression events, entirely based on fuzzy set theory. Since multireads cause uncertainty in the estimation of read counts during gene expression computation, they can also influence the reliability of differential expression analysis results, by producing false positives. Our method manages the uncertainty in gene expression estimation by defining the fuzzy read counts and evaluates the possibility of a gene to be differentially expressed with three fuzzy concepts: over-expression, same-expression and under-expression. The output of the method is a list of differentially expressed genes enriched with information about the uncertainty of the results due to the multiread presence.We have tested the method on RNA-Seq data designed for case-control studies and we have compared the obtained results with other existing tools for read count estimation and differential expression analysis.ConclusionsThe management of multireads with the use of fuzzy sets allows to obtain a list of differential expression events which takes in account the uncertainty in the results caused by the presence of multireads. Such additional information can be used by the biologists when they have to select the most relevant differential expression events to validate with laboratory assays. Our method can be used to compute reliable differential expression events and to highlight possible false positives in the lists of differentially expressed genes computed with other tools.
Gene expression regulatory elements are scattered in gene promoters and pre-mRNAs. In particular, RNA elements lying in untranslated regions (5? and 3?UTRs) are poorly studied because of their peculiar features (i.e., a combination of primary and secondary structure elements) which also pose remarkable computational challenges. Several years ago, we began collecting experimentally characterized UTR regulatory elements, developing the specialized database UTRsite. This paper describes the detailed guidelines to annotate cis-regulatory elements in 5? and 3? UnTranslated Regions (UTRs) by computational analyses, retracing all main steps used by UTRsite curators.
It is known from recent studies that more than 90% of human multi-exon genes are subject toAlternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a singlegene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiationand pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to thestudy of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns.The exon array technology, with more than five million data points, can detect approximately one million exons,and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated userfriendlybioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a datawarehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases.Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it.Results: BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statisticalmethods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples andtuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samplesproduced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza.To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemicalpathways annotations are integrated with exon and gene level expression plots. The user can customize the resultschoosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samplesfor a multivariate AS analysis.Conclusions: Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysistools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendlyplatform for a comprehensive study of AS events in human diseases, displaying the analysis results with easilyinterpretable and interactive tables and graphics.
Multiple sclerosis (MS) is a complex disease of the CNS that usually affects young adults, although 3-5% of cases are diagnosed in childhood and adolescence (hence called pediatric MS, PedMS). Genetic predisposition, among other factors, seems to contribute to the risk of the onset, in pediatric as in adult ages, but few studies have investigated the genetic 'environmentally naïve' load of PedMS. The main goal of this study was to identify circulating markers (miRNAs), target genes (mRNAs) and functional pathways associated with PedMS; we also verified the impact of miRNAs on clinical features, i.e. disability and cognitive performances. The investigation was performed in 19 PedMS and 20 pediatric controls (PCs) using a High-Throughput Next-generation Sequencing (HT-NGS) approach followed by an integrated bioinformatics/biostatistics analysis. Twelve miRNAs were significantly upregulated (let-7a-5p, let-7b-5p, miR-25-3p, miR-125a-5p, miR-942-5p, miR-221-3p, miR-652-3p, miR-182-5p, miR-185-5p, miR-181a-5p, miR-320a, miR-99b-5p) and 1 miRNA was downregulated (miR-148b-3p) in PedMS compared with PCs. The interactions between the significant miRNAs and their targets uncovered predicted genes (i.e. TNFSF13B, TLR2, BACH2, KLF4) related to immunological functions, as well as genes involved in autophagy-related processes (i.e. ATG16L1, SORT1, LAMP2) and ATPase activity (i.e. ABCA1, GPX3). No significant molecular profiles were associated with any PedMS demographic/clinical features. Both miRNAs and mRNA expressions predicted the phenotypes (PedMS-PC) with an accuracy of 92% and 91%, respectively. In our view, this original strategy of contemporary miRNA/mRNA analysis may help to shed light in the genetic background of the disease, suggesting further molecular investigations in novel pathogenic mechanisms.
In the scientific biodiversity community, it is increasingly perceived the need to build a bridge between molecular and traditional biodiversity studies. We believe that the information technology could have a preeminent role in integrating the information generated by these studies with the large amount of molecular data we can find in bioinformatics public databases. This work is primarily aimed at building a bioinformatic infrastructure for the integration of public and private biodiversity data through the development of GIDL, an Intelligent Data Loader coupled with the Molecular Biodiversity Database. The system presented here organizes in an ontological way and locally stores the sequence and annotation data contained in the GenBank primary database.MethodsThe GIDL architecture consists of a relational database and of an intelligent data loader software. The relational database schema is designed to manage biodiversity information (Molecular Biodiversity Database) and it is organized in four areas: MolecularData, Experiment, Collection and Taxonomy. The MolecularData area is inspired to an established standard in Generic Model Organism Databases, the Chado relational schema. The peculiarity of Chado, and also its strength, is the adoption of an ontological schema which makes use of the Sequence Ontology.The Intelligent Data Loader (IDL) component of GIDL is an Extract, Transform and Load software able to parse data, to discover hidden information in the GenBank entries and to populate the Molecular Biodiversity Database. The IDL is composed by three main modules: the Parser, able to parse GenBank flat files; the Reasoner, which automatically builds CLIPS facts mapping the biological knowledge expressed by the Sequence Ontology; the DBFiller, which translates the CLIPS facts into ordered SQL statements used to populate the database. In GIDL Semantic Web technologies have been adopted due to their advantages in data representation, integration and processing.Results and conclusionsEntries coming from Virus (814,122), Plant (1,365,360) and Invertebrate (959,065) divisions of GenBank rel.180 have been loaded in the Molecular Biodiversity Database by GIDL. Our system, combining the Sequence Ontology and the Chado schema, allows a more powerful query expressiveness compared with the most commonly used sequence retrieval systems like Entrez or SRS.
MotivationAround 50% of all human tumours carry point mutations in the p53 tumour suppressor gene, which alter p53 DNA binding specificity. In tumours with p53 wild type, p53 is often rendered functionally inert by the inactivation of its positive modulators or by the activation of negative factors, which block p53 transcriptional activities [1]. We identified a new p53 direct target gene, TRIM8, belonging to the Tripartite Motif (TRIM) protein family, defined by the presence of a RING domain, one or two B-boxes and a Coiled-Coil region. We found that TRIM8 overexpression leads, through a positive feedback loop, to p53 stabilization and p53-mediated suppression of cell proliferation. In order to identify the pathways activated by TRIM8 leading to p53 stabilization we transiently transfected with TRIM8 the HCT116-p53 (wt) cell line, and sequenced the total transcriptome performing a NGS run on a 454 GS FLX platform. Here we report some statistics and the preliminary results of: i) reads mapping on the human genome and analysis of differential expressed genes; ii) functional analysis of differentially expressed genes. MethodTotal RNA was extracted from HCT116-p53 (wt) cell line 48h after transfection, depleted of rRNA, retro-transcribed, amplified and sequenced by using the pyrosequencer Roche GS FLX Titanium Series. Genome mapping, statistics and differential expression analyses were performed by using the "NGS-Trex" system (NGS Transcriptome profile Explorer) (Mignone F. et al., submitted), a automatic system designed for analyzing Next Generation Sequencing data generated from large-scale transcriptome studies. The overall procedure involves three steps: 1) creation of a project and upload of reads in a multi-fasta format; 2) reads mapping onto the reference genome after setup of appropriate parameters; 3) annotation of mapped reads; 3) data mining by using simple query forms. TRIM8 and FLAG data were submitted to NGS-Trex using default parameters that can briefly summarized as follows: reads were mapped onto human genome (min similarity 90% and min overlap 50 nt) discarding reads mapping onto more than 10 genomic regions. Mapped reads were compared to annotation to assign reads to genes and to identify new splice variants. Differentially expressed genes and splicing events were identified by computing a P-value associated to an hypergeometric distribution. Housekeeping genes were used to normalise reads count before identification of differentially expressed genes. The lists of genes showing a differential expression in the two samples were then analysed by using DAVID v(6.7), an integrated biological knowledgebase and analytic tools (text and pathway-mining tools) for large gene list functional annotation [2,3]. An additional analysis on TRIM8 and FLAG sequence samples was made for the detection and annotation of the ncRNA genome fraction. We used a bioinformatic analysis pipeline, developed by us, which is able to: 1) select ncRNA fro
A holistic understanding of environmental communities is the new challenge of metagenomics. Accordingly, the amplicon-based or metabarcoding approach, largely applied to investigate bacterial microbiomes, is moving to the eukaryotic world too. Indeed, the analysis of metabarcoding data may provide a comprehensive assessment of both bacterial and eukaryotic composition in a variety of environments, including human body. In this respect, whereas hypervariable regions of the 16S rRNA are the de facto standard barcode for bacteria, the Internal Transcribed Spacer 1 (ITS1) of ribosomal RNA gene cluster has shown a high potential in discriminating eukaryotes at deep taxonomic levels. As metabarcoding data analysis rely on the availability of a well-curated barcode reference resource, a comprehensive collection of ITS1 sequences supplied with robust taxonomies, is highly needed. To address this issue, we created ITSoneDB (available at http://itsonedb.cloud.ba.infn.it/) which in its current version hosts 985 240 ITS1 sequences spanning over 134 000 eukaryotic species. Each ITS1 is mapped on the NCBI reference taxonomy with its start and end positions precisely annotated. ITSoneDB has been developed in agreement to the FAIR guidelines by enabling the users to query and download its content through a simple web-interface and access relevant metadata by cross-linking to European Nucleotide Archive.
Motivations. Metagenomics is experiencing an explosive improvement from the advent of high-throughput next-generation sequencing (NGS) technologies which allows an unprecedented large-scale identification of microorganisms living in almost every environment. In particular, the use of amplicon-based metagenomic approach to explore the diversity of fungal environmental communities is increasingly expanding. At the species level, a number of studies have used the non-conserved internal transcribed spacers (ITS) 1 and 2 of the ribosomal RNA genes cluster as genetic markers to explore the fungal taxonomic diversity. Particularly, ITS1 is gaining an increasing popularity as better discriminating species marker in Fungi because of its higher variability compared to ITS2. Starting from the total DNA extracted from any environmental sample, this locus can be easily amplified with taxonomically universal primers and sequenced by means of high-throughput next generation platforms. Reference databases and robust supporting taxonomies are crucial in assigning phylogenetic affiliation to the huge amount of produced sequences. Even if a large number of ITS1 sequences are collected in public databases, a specialized resource focused particularly on this region, where sequences identity, boundaries and taxonomic assignment are validated, is still needed at present. In this work we present ITSoneDB, a new comprehensive collection of ITS1 sequences belonging to Fungi Kingdom.Methods. ITSoneDB has been generated and populated using a multi-step Python workflow. In the first step the ribosomal RNA gene cluster sequences of Fungi including the target ITS1 region were retrieved from Genbank. Then, ITS1 start and end boundaries were extracted from the Features Tables annotations, if available. In order to infer, validate and, eventually, redesign the ITS1 location, Hidden Markov Model (HMM) profiles of flanking genes for 18S and 5.8S ribosomal RNA, generated from their reference alignments stored in RFAM database, were mapped on the entire collection of retrieved nucleotide sequences, by means of the hmmsearch tool from HMMER 3.0 package.Results. At present, ITSoneDB includes 405,433 taxonomically arranged sequence entries provided with ITS1 both start and end positions defined by GenBank annotations and/or HMM based method. ITSoneDB front-end is a JAVA platform-based website for data browsing and downloading. The database can be queried by species or taxon name, GenBank accession ID or by "expanding" the target rank on a detailed fungal taxonomical tree. The complete ITS1 sequences dataset collected in ITSoneDB is available in Fasta format and the users can extract and locally save all or selected queried ITS1 sequences for further analysis.
When the reads obtained from high-throughput sequencing are mapped against a reference database, some of them - known as multireads - can map to more than one reference sequence. This event occurs because genomes contains many repeated portions and reads are generally shorter than reference sequences. Removing the multireads from the mapping results causes an underestimation of the read counts, while estimating the real read count can lead to false positives during the detection of differentially expressed sequences.
Biological activities are typically co-regulated by several factors and this feature is properly reflected by higher-order structures called cis-regulatory modules (CRM) and represented by non-random clusters of regulatory motifs. Several methods have been proposed for the de novo discovery of modules. We propose an alternative approach based on the discovery of rules which define strong spatial associations between single motifs and suggest the structure of a module. Rules are expressed in a first-order logic formalism and are mined by means of an inductive logic programming (ILP) system. We also propose computational solutions to two issues: the hard discretization of numerical inter-motif distances and the choice of a minimum support threshold. All methods have been implemented and integrated in a prototypal tool designed to support biologists in the discovery and characterization of cis-regulatory modules.
High-throughput technologies (HT), such as microarray and especially Next-Generation Sequencing (NGS) technologies, have provided tremendous potential for profiling protein-coding and non- protein coding RNAs (ncRNAs). Recent reports of the ENCODE project underline that while 80% of the human genome is transcribed, only 2% is protein coding, suggesting that the vast majority of the genome is transcribed as non-protein-coding RNA.We present the development of a web-based bioinformatics platform, nc-aReNA, for the mapping, classification and annotation of human and mouse ncRNAs from HT-NGS data. The platform is based on a data-warehouse approach and workflow environment that includes data quality control, genome and nc-RNAome sequence alignment, differential expression profiling analysis and statistics of classified data.MethodsThe nc-aReNA architecture is based on a modular analysis pipeline, flanked by a data-warehouse, for the classification and annotation of small-RNAseqdata. The pipeline takes in input the sequenced reads in FASTQ format. After the initial steps of adaptor removal and quality check, the input reads are mapped to an in-house non-redundant ncRNA reference database (http://ncRNAdb.ba.itb.cnr.it) which collects and integrates ncRNA gene lists, from MGI (Mouse Genome Informatics) and HGNC (Human Genome Nomenclature Committee), with sequences and biotype annotations from VEGA (Vertebrate Genome Annotation), ENSEMBL, RefSeq, RFam (for tRNA sequence) and miRBase (for miRNA). NGS reads mapped in this step are classified by using Sequence Ontology (SO) (Eilbeck K. et al., 2005). Unmapped reads are aligned to the reference genome and tagged to the corresponding genomic locus.Integrated statistics are used for RPM (Reads Per Million), fold changes and False Discovery Rate (FDR) corrected p-values calculation and differential expression analysis of all (or user-chosen) ncRNA classes, by comparing two or more experimental conditions or time-courses data.An additional module, called "miRNA identification", provides the analysis of all unmapped miRNA-like reads by mean of the miRDeep2 software.All the analysis results and annotation are stored in a data-warehouse implemented with Infobright (http://www.infobright.org). A user-friendly web-based Graphical User Interface (GUI), developed by using the JAVA platform, guides the user in the submission process and displays results in tables and graphs.ResultsThe main features of the nc-aReNA are:- identification and classification of reads in known functional ncRNA categories in SO;- identification and filtering of reads mapping to ribosomal RNAs and mtDNA transcripts;- RPMs calculation for each known ncRNA;- the export of user-selected classesof ncRNA for further specific investigation;- quantification of ncRNAs expression and differential expression analysis for all identified ncRNAclasses;- graphical visualization of sample expression profiles;- additional annot
MOTIVATION:The recent availability of next generation sequencing (NGS) technologies, has provided the scientific community with an unprecedented opportunity for large-scale analysis of genome in a large number of organisms. One of the most challenging task for bioinformaticians is to develop tools that provide biologists with an easy access to curated and non-redundant collections of sequence data.Non-coding RNAs, for a long time believed to be not-functional, are emerging as the most large and important family of gene regulators.METHODS:NonCode aReNA DataBase is a comprehensive and non-redundant source of manually curated and automatically annotated ncRNA transcripts collected from major public resources.The database is built through a set of ETL (Extraction Transformation Loading) automated processes which extracts and collects data from VEGA, ENSEMBL, RefSeq, miRBase, GtRNAdb and piRNABank. The automatic process guarantees also recurring updates.The identification of redundant sequences is made by analyzing both cross-link references and sequence similarity. Furthermore non-coding RNA sequences have been classified in diverse biotypes and associated to Sequence Ontology terms.NonCode aReNA DataBase is originally developed as a component of a bigger project, represented by a datawarehouse and an analysis workflow, for the functional annotation of ncRNAs from NGS data.RESULTS:NonCode aReNA Database is currently available as a web-resource at http://ncrnadb.ba.itb.cnr.it/. The database can be queried by using multi-criteria and ontological search, through an easy-to-use web interface. Query results can be exported as non-redundant collections of ncRNA transcripts.Currently NonCode aReNA DataBase contains 134,908 human ncRNAs classified in 24 biotypes, and next updates will include transcripts of Mus musculus and Arabidopsis thaliana
PlantPIs is a web querying system for a database collection of plant protease inhibitors data. Protease inhibitors in plants are naturally occurring proteins that inhibit the function of endogenous and exogenous proteases. In this paper the design and development of a web framework providing a clear and very flexible way of querying plant protease inhibitors data is reported. The web resource is based on a relational database, containing data of plants protease inhibitors publicly accessible, and a graphical user interface providing all the necessary browsing tools, including a data exporting function. PlantPIs contains information extracted principally from MEROPS database, filtered, annotated and compared with data stored in other protein and gene public databases, using both automated techniques and domain expert evaluations. The data are organized to allow a flexible and easy way to access stored information. The database is accessible at http://www.plantpis.ba.itb.cnr.it/.
Metagenomics is providing an unprecedented access to the environmental microbial diversity. The amplicon-basedmetagenomics approach involves the PCR-targeted sequencing of a genetic locus fitting different features. Namely,it must be ubiquitous in the taxonomic range of interest, variable enough to discriminate between different speciesbut flanked by highly conserved sequences, and of suitable size to be sequenced through next-generation platforms.The internal transcribed spacers 1 and 2 (ITS1 and ITS2) of the ribosomal DNA operon and one or morehyper-variable regions of 16S ribosomal RNA gene are typically used to identify fungal and bacterial species, respectively.In this context, reliable reference databases and taxonomies are crucial to assign amplicon sequence reads tothe correct phylogenetic ranks. Several resources provide consistent phylogenetic classification of publicly available16S ribosomal DNA sequences, whereas the state of ribosomal internal transcribed spacers reference databases isnotably less advanced. In this review, we aim to give an overview of existing reference resources for both types ofmarkers, highlighting strengths and possible shortcomings of their use for metagenomics purposes. Moreover, wepresent a new database, ITSoneDB, of well annotated and phylogenetically classified ITS1 sequences to be used asa reference collection in metagenomic studies of environmental fungal communities. ITSoneDB is available for downloadand browsing at http://itsonedb.ba.itb.cnr.it/.
MotivationBiodiversity research concerns with data coming from many different domains (e.g., Biology, Geography, Evolutionary Studies, Genomics, Taxonomy, Environmental Sciences, etc.) which need to be integrated for leading to valuable Biodiversity knowledge. Collecting and integrating data from so many heterogeneous resources is not a trivial task. Data are extremely scattered, heterogeneous in format and purpose, and protected in repositories of several research institutes. Driven by the widely diffused trend of the web of sharing information through aggregation of people with the same interests (social networks), and by the new type of database architecture defined as dynamic distributed federated database, we are proposing a new paradigm of data integration in the Biodiversity domain. Here we present a new approach for the development of a Knowledge Base aiming to the collection, integration and analysis of biodiversity data implemented as a product of the MBLab project.MethodsThe implementation of the Biodiversity Knowledge Base is based on the integration of several components: a robust Database Management System (IBM DB2) managing the large volume of information from public databases like GenBank, a set of GaianDB nodes [1] to manage remote private collections of biodiversity data; the IBM Federator Server to implement the general conceptual schema integrating all biodiversity databases available across remote nodes of MBLab project partners.ResultsGaianDB is a Dynamic Distributed Federated Database of sources whose growth is regulated by biologically inspired principles and graph theoretic methods. By means of the GaianDB network architecture data remains on the remote research group servers, and each database owner is responsible for its integrity, availability and sharing. Each vertex of this network is a suitable entry point receiving the user query and responding with an output aggregating different pieces of information retrieved from the different data sources spanned all over the network. To integrate GenBank molecular data in the MBLabDB we built an efficient and reliable ETL (Extraction, Transformation and Load) module, implemented with CLIPS Rule Based Programming Language. The ETL extracts information from the feature- based GenBank entries and fits them in the MBLabDB schema. Molecular data collections are structured following a Chado-like model [2], using Sequence Ontology entities and relations. This allows to retrieve data using the biological concepts expressed by the Sequence Ontology [3]. The main result of this work is the development of a standard conceptual schema and a knowledge base architecture tailored to biodiversity data collection, integration and analysis. The database is modeled on six main sections: Taxonomic, Individual, Collection, Supply chain, Experimental molecular data. Currently two biodiversity data collections have been integrated by using GaianDB: the ITEM Collection [4] located at the I
Biodiversity research concerns with data coming from many different domains (e.g., Biology, Geography, Evolutionary Studies, Genomics, Taxonomy, Environmental Sciences, etc.) which need to be integrated for leading to valuable Biodiversity knowledge. Collecting and integrating data from so many heterogeneous resources is not a trivial task. Data are extremely scattered, heterogeneous in format and purpose, and protected in repositories of several research institutes. Driven by the widely diffused trend of the web of sharing information through aggregation of people with the same interests (social networks), and by the new type of database architecture defined as dynamic distributed federated database, we are proposing a new paradigm of data integration in the Biodiversity domain. Here we present a new approach for the development of a Knowledge Base aiming to the collection, integration and analysis of biodiversity data implemented as a product of the MBLab project.
The recent availability of high throughput tech- nologies, like next generation sequencing (NGS) platforms, has providedthescientific community with an unprecedented opportunity for large- scale analysis of genome in a large number of organisms.However,among others, one of the most challenging task for bioinformaticians is to developtools that providebiologists withaneasy access to curated and non-redundant collec- tions of sequence data.Non-coding RNAs, for a long time believed tobe not-functional, are emerging as themost large and important family of gene regulators. NonCode aReNA Database is a comprehensive and non-redundant source ofmanually curated and automatically annotated ncRNA transcripts. Originally developed as a component of a big- ger project, composed by a datawarehouse for the functional annotation of ncRNAs fromNGS data, NonCode aReNA DB is currently availableas a web-resource at http://ncrnadb.ba.itb.cnr. it/. Sequences have been classified in diverse biotypes and associated to SequenceOntology terms. The database can be queried by using multi-criteria and ontological search, through an easy-to-use web interface, and data exported as non-redundant collections of transcripts an- notated in VEGA, ENSEMBL, RefSeq, miRBase, GtRNAdb and piRNABank. The database is up- dated through an automatic pipeline and last updatewasonJanuary 2015. PresentlyNonCode aReNA DB contains 134,908 human ncRNAs clas- sified in 24 biotypes, and next update will include transcripts ofMusmusculus and Arabidopsis thal- iana.AcknowledgementsThis work was supported by the Italian MIUR Flagship Project "Epigen".
The 5' and 3' untranslated regions of eukaryotic mRNAs (UTRs) play crucial roles in the post-trans-criptional regulation of gene expression through the modulation of nucleo-cytoplasmic mRNA transport, translation efficiency, subcellular localization and message stability. UTRdb is a curated database of 5' and 3' untranslated sequences of eukaryotic mRNAs, derived from several sources of primary data. Experimentally validated functional motifs are annotated and also collated as the UTRsite database where more specific information on the functional motifs and cross-links to interacting regulatory protein are provided. In the current update, the UTR entries have been organized in a gene-centric structure to better visualize and retrieve 5' and 3' UTR variants generated by alter-native initiation and termination of transcription and alternative splicing. Experimentally validated miRNA targets and conserved sequence elements are also annotated. The integration of UTRdb with genomic data has allowed the implementation of an efficient annotation system and a powerful retrieval resource for the selection and extraction of specific UTR subsets. All internet resources implemented for retrieval and functional analysis of 5' and 3' untranslated regions of eukaryotic mRNAs are accessible at http://utrdb.ba.itb.cnr.it/.
The structural and conformational organization of chromosomes is crucial for gene expression regulation in eukaryotes and prokaryotes as well. Up to date, gene expression data generated using either microarray or RNA-sequencing are available for many bacterial genomes. However, differential gene expression is usually investigated with methods considering each gene independently, thus not taking into account the physical localization of genes along a bacterial chromosome. Here, we present WoPPER, a web tool integrating gene expression and genomic annotations to identify differentially expressed chromosomal regions in bacteria. RNA-sequencing or microarray-based gene expression data are provided as input, along with gene annotations. The user can select genomic annotations from an internal database including 2780 bacterial strains, or provide custom genomic annotations. The analysis produces as output the lists of positionally related genes showing a coordinated trend of differential expression. Graphical representations, including a circular plot of the analyzed chromosome, allow intuitive browsing of the results. The analysis procedure is based on our previously published R-package PREDA. The release of this tool is timely and relevant for the scientific community, as WoPPER will fill an existing gap in prokaryotic gene expression data analysis and visualization tools. WoPPER is open to all users and can be reached at the following URL: https://WoPPER.ba.itb.cnr.it.
Condividi questo sito sui social