Effettua una ricerca
Vito Flavio Licciulli
Ruolo
III livello - Ricercatore
Organizzazione
Consiglio Nazionale delle Ricerche
Dipartimento
Non Disponibile
Area Scientifica
AREA 05 - Scienze biologiche
Settore Scientifico Disciplinare
BIO/11 - Biologia Molecolare
Settore ERC 1° livello
LS - LIFE SCIENCES
Settore ERC 2° livello
LS2 Genetics, Genomics, Bioinformatics and Systems Biology: Molecular and population genetics, genomics, transcriptomics, proteomics, metabolomics, bioinformatics, computational biology, biostati stics, biological modelling and simulation, systems biology, genetic epidemiology
Settore ERC 3° livello
LS2_10 Bioinformatics
Cancer is a multi-stage process often driven by progressive accumulation of genomic rearrangements that can result in cells acquiring cancer properties such as tumor invasive and metastatic behavior. Many genes associated with cancer are the result of complex somatically and inherited chromosomal rearrangements, resulting in aberrant transcripts or defects in transcription [1-5]. The classical approach for the identification of genome rearrangements such as G-banded cytogenetics, spectral karyotyping and FISH, are poor in sensitivity, while copy number array can identify just imbalanced breakpoints and do not describe the resulted genome structure produced by the events, which may cause the breakpoints. The aim of this project is to obtain, by the paired-end mapping (PEM) approach applied to the massive parallel sequencing, an high resolution virtual karyotype of the genome of a breast-cancer-patient of which we obtained previously the transcriptomic portrait [6].The introduction of massively parallel high throughput sequencing (HTS) techniques have created a broad range of new and exciting research applications by increasing the output sequencing data dramatically. In recent years, the continuous technical improvements of next-generation sequencing technology have made RNA sequencing (RNA-seq) particularly effective for the detection of gene fusions, which are involved in several diseases. Gene fusions are found in many cancer types, and they have proved to be prognostic biomarkers in several studies [7-9]. In addition, gene fusions have often a direct functional impact on the molecular processes in the cell [10].Several analysis steps are needed to process the data provided by the sequencer and to use them for robust gene fusion detection.We propose a workflow to analyze NGS paired-end sequences in order to identify possible candidates to be the results of a fusion between different genes, looking for fusion events occurring on the same chromosome (intra-chromosomal rearrangement).The basic idea is to map the reads onto the reference genome and to study the insert size length distribution of the paired-end, looking at its peak and select all the mapping pairs having an insert size value quite far from the observed peak. In this way we are sure to select paired-end sequences mapping on different regions of the genome far from each other connecting different genes.
The huge amount of transcript data produced by high-throughput sequencing requires the development and implementation of suitable bioinformatic workflows for their analysis and interpretation. These analysis workflows, including different modules, should be specifically designed also based on the sequencing platform (Roche 454, Illumina, SOLiD) and the nature of the data (polyA or total RNA fraction, strand specificity). In the case of cDNA obtained from a total RNA preparation, in addition to polyadenylated protein coding mRNAs, a great variety of transcript sequences can be obtained, including ribosomal RNAs, mitochondrial transcripts and a large variety of functional non coding RNAs (ncRNAs). To deal with these data the analysis workflow should include specific modules to distinguish ncRNAs fractions from the large number of other functional proteincoding transcripts. To this aim we developed an analysis pipeline that, given as input a large collection of reads (particularly from Roche 454), provides the expression profile at qualitative and quantitative level of human mtDNA, ribosomal RNAs, ncRNAs and protein coding mRNAs.
Recent studies have demonstrated an unexpected complexity of transcription in eukaryotes.Indeed the majority of the genome is transcribed and only a little fraction of these transcripts isannotated as protein coding genes and their splice variants. Therefore high throughput transcriptomesequencing continuously identifies novel RNAs and novel classes of RNAs, which are the result ofantisense, overlapping and non-coding RNA expression, demonstrating that the transcriptomecaptures a level of complexity that the simple genome sequence may not (1).Among next-generation sequencing platforms, the latest series of Roche 454 GS Sequencer, the GSFLX Titanium FLX+, allows to obtain in each run over a million reads, each with a length up to 700base. Sequences of such length, providing connectivity information among splicing sites, in additionto enabling accurate mapping and relative quantification of mRNAs, are particularly suitable for thecharacterization of full-length splicing variants that may be differently expressed inphysiopathological conditions (2). On the other hand the higher throughput of the Illumina HiSeq1000 (150 bp) and ABI SOLID (75 bp) platforms, makes them particularly suitable for transcriptslevel quantification and for small RNAs sequencing.Irrespectively of the NGS platform used, the first step required for transcriptome sequencing is theconstruction of a cDNA library. Several protocols have been developed so far to this aim and eachof them is suitable for sequencing on a specific platform exclusively.Here we describe a new fast and simple method (Patent pending RM2010A000293-PCT/IB2011/052369) to prepare and amplify a representative and strand-specific cDNA librarystarting from low input total RNA (500ng) for RNA-Seq applications, that may be implemented withall major platforms currently available (Roche 454, Illumina, ABI/Solid).Our method includes the following steps: a) rRNA removal from total RNA b) retrotranscription ofthe rRNA-depleted RNA to cDNA with 5' phosphorylated Tag-random-octamers custom designedcapable of preserving strand information; c) single-strand cDNAs purification; d) ligation andamplification of the purified cDNAs, thus obtaining high yield of concatamers around 20kb long.These DNA molecules can be equally sequenced both with Illumina and Roche 454 sequencingplatforms allowing not only the quantitative but also the qualitative assessment of the transcriptomecomplexity.Moreover, we developed a suitable bioinformatic pipeline for the analysis of the sequences producedupon application of this protocol. Indeed, we developed an in house python script, named Tag_Find(available upon request), able to recognize the position and the type of tag found within the readsequence. The program returns out two files, one containing the type of tags found and their readspositions and one fastq file with non-tagged reads, cleaned up from tags. The Tag_Find efficiency
Recent studies have demonstrated an unexpected complexity of transcription in eukaryotes. The majority of the genome is transcribed and only a little fraction of these transcripts is annotated as protein coding genes and their splice variants. Indeed, most transcripts are the result of antisense, overlapping and non-coding RNA expression. In this frame, one of the key aims of high throughput transcriptome sequencing is the detection of all RNA species present in the cell and the first crucial step for RNA-seq users is represented by the choice of the strategy for cDNA library construction. The protocols developed so far provide the utilization of the entire library for a single sequencing run with a specific platform.ResultsWe set up a unique protocol to generate and amplify a strand-specific cDNA library representative of all RNA species that may be implemented with all major platforms currently available on the market (Roche 454, Illumina, ABI/SOLiD). Our method is reproducible, fast, easy-to-perform and even allows to start from low input total RNA. Furthermore, we provide a suitable bioinformatics tool for the analysis of the sequences produced following this protocol.ConclusionWe tested the efficiency of our strategy, showing that our method is platform-independent, thus allowing the simultaneous analysis of the same sample with different NGS technologies, and providing an accurate quantitative and qualitative portrait of complex whole transcriptomes.
High throughput technologies have provided the scientific community an unprecedented opportunity for large-scale analysis of genomes. Non-coding RNAs (ncRNAs), for a long time believed to be non-functional, are emerging as one of the most important and large family of gene regulators and key elements for genome maintenance. Functional studies have been able to assign to ncRNAs a wide spectrum of functions in primary biological processes, and for this reason they are assuming a growing importance as a potential new family of cancer therapeutic targets. Nevertheless, the number of functionally characterized ncRNAs is still too poor if compared to the number of new discovered ncRNAs. Thus platforms able to merge information from available resources addressing data integration issues are necessary and still insufficient to elucidate ncRNAs biological roles.RESULTS:In this paper, we describe a platform called Arena-Idb for the retrieval of comprehensive and non-redundant annotated ncRNAs interactions. Arena-Idb provides a framework for network reconstruction of ncRNA heterogeneous interactions (i.e., with other type of molecules) and relationships with human diseases which guide the integration of data, extracted from different sources, via mapping of entities and minimization of ambiguity.CONCLUSIONS:Arena-Idb provides a schema and a visualization system to integrate ncRNA interactions that assists in discovering ncRNA functions through the extraction of heterogeneous interaction networks. The Arena-Idb is available at http://arenaidb.ba.itb.cnr.it.
Alternative splicing is emerging as a major mechanism for the expansion of the transcriptome and proteome diversity, particularly in human and other vertebrates. However, the proportion of alternative transcripts and proteins actually endowed with functional activity is currently highly debated. We present here a new release of ASPicDB which now provides a unique annotation resource of human protein variants generated by alternative splicing. A total of 256 939 protein variants from 17 191 multi-exon genes have been extensively annotated through state of the art machine learning tools providing information of the protein type (globular and transmembrane), localization, presence of PFAM domains, signal peptides, GPIanchor propeptides, transmembrane and coiledcoil segments. Furthermore, full-length variants can be now specifically selected based on the annotation of CAGE-tags and polyA signal and/or polyA sites, marking transcription initiation and termination sites, respectively. The retrieval can be carried out at gene, transcript, exon, protein or splice site level allowing the selection of data sets fulfilling one or more features settled by the user. The retrieval interface also enables the selection of protein variants showing specific differences in the annotated features. ASPicDB is available at http://www .caspur.it/ASPicDB/.
It is known from recent studies that more than 90% of human multi-exon genes are subject toAlternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a singlegene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiationand pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to thestudy of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns.The exon array technology, with more than five million data points, can detect approximately one million exons,and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated userfriendlybioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a datawarehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases.Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at http://beat.ba.itb.cnr.it.Results: BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statisticalmethods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples andtuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samplesproduced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza.To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemicalpathways annotations are integrated with exon and gene level expression plots. The user can customize the resultschoosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samplesfor a multivariate AS analysis.Conclusions: Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysistools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendlyplatform for a comprehensive study of AS events in human diseases, displaying the analysis results with easilyinterpretable and interactive tables and graphics.
Multiple sclerosis (MS) is a complex disease of the CNS that usually affects young adults, although 3-5% of cases are diagnosed in childhood and adolescence (hence called pediatric MS, PedMS). Genetic predisposition, among other factors, seems to contribute to the risk of the onset, in pediatric as in adult ages, but few studies have investigated the genetic 'environmentally naïve' load of PedMS. The main goal of this study was to identify circulating markers (miRNAs), target genes (mRNAs) and functional pathways associated with PedMS; we also verified the impact of miRNAs on clinical features, i.e. disability and cognitive performances. The investigation was performed in 19 PedMS and 20 pediatric controls (PCs) using a High-Throughput Next-generation Sequencing (HT-NGS) approach followed by an integrated bioinformatics/biostatistics analysis. Twelve miRNAs were significantly upregulated (let-7a-5p, let-7b-5p, miR-25-3p, miR-125a-5p, miR-942-5p, miR-221-3p, miR-652-3p, miR-182-5p, miR-185-5p, miR-181a-5p, miR-320a, miR-99b-5p) and 1 miRNA was downregulated (miR-148b-3p) in PedMS compared with PCs. The interactions between the significant miRNAs and their targets uncovered predicted genes (i.e. TNFSF13B, TLR2, BACH2, KLF4) related to immunological functions, as well as genes involved in autophagy-related processes (i.e. ATG16L1, SORT1, LAMP2) and ATPase activity (i.e. ABCA1, GPX3). No significant molecular profiles were associated with any PedMS demographic/clinical features. Both miRNAs and mRNA expressions predicted the phenotypes (PedMS-PC) with an accuracy of 92% and 91%, respectively. In our view, this original strategy of contemporary miRNA/mRNA analysis may help to shed light in the genetic background of the disease, suggesting further molecular investigations in novel pathogenic mechanisms.
In plants, which are particularly sensitive to changes of environmental conditions, modulation of DNA methylation is a crucial mechanism of regulation of gene expression in response to abiotic and biotic stresses. Monitoring plant's immune system in response to bacterial pathogen infection demonstrated that also dynamic DNA methylation changes, and not only gene imprinting, have regulatory effect in plant pathogen defense. Critical elements for epigenetic modifications of plant genomes are non-coding smallRNA e same RNA family is also a hallmark of plant reaction to virus infection. Interestingly, sRNA have a central role in both plant genome methylation and resistance upon virus infection, however, the interaction between sRNA expression and DNA methylation regulating the immune system in response to virus infection has not been investigated so far.To correlate dynamic DNA methylation and differential sRNA expression in response to virus infection, we have performed genome-wide methylation and sRNA expression profiling on Arabidopsis leaves systemically infected with either the DNA-genome virus Caulifower mosaic virus or the RNA virus Cucumber mosaic virus. We developed a software package to analyze the sRNA expression and the DNA methylation profile and deploy a genome wide comparison of control and infected samples to search regions significantly different either in the methylation profile or in the sRNA expression, or in both. In the regions where we observe significant correlation of methylation (mainly CHH methylation) and sRNA expression modifications, we found that both hypo- and hypermethylation correlated with downregulation of 21/24nt sRNAs. These regions mostly comprised transposons and few of them contained promoter or coding sequences of genes involved, according to gene ontology, in DNA-binding and DNA-dependent regulation of transcription and response to abiotic or biotic stimulus. This confirms virus-induced infection regulation of sRNA and DNA methylation. We are presently still in the process of data analysis and more details about correlation of virus-induced modification of sRNA and DNA methylation levelswill be reported.
The establishment and maintenance of DNA methylation are relatively well understood whereas little is known about their dynamics and biological relevance in innate immunity [1-2]. In plants, modulation of DNA methylation might be an effective mechanism to regulate gene expression in response to abiotic and biotic stresses. Recent evidences from large-scale epigenomic approaches indicate that dynamic DNA methylation changes are not limited to gene imprinting but can regulate the plant's immune system in response to pathogens.In plants, virus infections trigger the expression of non-coding small RNAs (smRNAs) by also influencing the epigenetic status of the host genome; however, the involvement of DNA methylation in regulation of plant immune system in response to virus infection has not been so far investigated. In this context, we are carrying out a study aiming to elucidate the impact of DNA and RNA virus infections on genomic DNA methylation in plants, and their correlation with also the expression of smallRNA, by integrating the analysis of multiple "omics" datasets obtained by using next-generation sequencing technologies.In this paper we present the results of the analysis on the methylation modifications induced by the viruses infection on the whole genome and on coding and non-coding gene regions.
Amyotrophic lateral sclerosis (ALS) is a progressive and fatal neurodegenerative disease. While genetics and other factors contribute to ALS pathogenesis, critical knowledge is still missing and validated biomarkers for monitoring the disease activity have not yet been identified. To address those aspects we carried out this study with the primary aim of identifying possible miRNAs/mRNAs dysregulation associated with the sporadic form of the disease (sALS). Additionally, we explored miRNAs as modulating factors of the observed clinical features. Study included 56 sALS and 20 healthy controls (HCs). We analyzed the peripheral blood samples of sALS patients and HCs with a high-throughput next-generation sequencing followed by an integrated bioinformatics/biostatistics analysis. Results showed that 38 miRNAs (let-7a-5p, let-7d-5p, let-7f-5p, let-7g-5p, let-7i-5p, miR-103a-3p, miR-106b-3p, miR-128-3p, miR-130a-3p, miR-130b-3p, miR-144-5p, miR-148a3p, miR-148b-3p, miR-15a-5p, miR-15b-5p, miR-151a-5p, miR-151b, miR-16-5p, miR-182-5p, miR-183-5p, miR-186-5p, miR-22-3p, miR-221-3p, miR-223-3p, miR23a- 3p, miR-26a-5p, miR-26b-5p, miR-27b-3p, miR-28-3p, miR-30b-5p, miR-30c-5p, miR-342-3p, miR-425-5p, miR-451a, miR-532-5p, miR-550a-3p, miR-584-5p, miR93- 5p) were significantly downregulated in sALS. We also found that different miRNAs profiles characterized the bulbar/spinal onset and the progression rate. This observation supports the hypothesis that miRNAs may impact the phenotypic expression of the disease. Genes known to be associated with ALS (e.g., PARK7, C9orf72, ALS2, MATR3, SPG11, ATXN2) were confirmed to be dysregulated in our study. We also identified other potential candidate genes like LGALS3 (implicated in neuroinflammation) and PRKCD (activated in mitochondrial-induced apoptosis). Some of the downregulated genes are involved in molecular bindings to ions (i.e., metals, zinc, magnesium) and in ions-related functions. The genes that we found upregulated were involved in the immune response, oxidation-reduction, and apoptosis. These findings may have important implication for the monitoring, e.g., of sALS progression and therefore represent a significant advance in the elucidation of the disease's underlying molecular mechanisms. The extensive multidisciplinary approach we applied in this study was critically important for its success, especially in complex disorders such as sALS, wherein access to genetic background is a major limitation.
MotivationAround 50% of all human tumours carry point mutations in the p53 tumour suppressor gene, which alter p53 DNA binding specificity. In tumours with p53 wild type, p53 is often rendered functionally inert by the inactivation of its positive modulators or by the activation of negative factors, which block p53 transcriptional activities [1]. We identified a new p53 direct target gene, TRIM8, belonging to the Tripartite Motif (TRIM) protein family, defined by the presence of a RING domain, one or two B-boxes and a Coiled-Coil region. We found that TRIM8 overexpression leads, through a positive feedback loop, to p53 stabilization and p53-mediated suppression of cell proliferation. In order to identify the pathways activated by TRIM8 leading to p53 stabilization we transiently transfected with TRIM8 the HCT116-p53 (wt) cell line, and sequenced the total transcriptome performing a NGS run on a 454 GS FLX platform. Here we report some statistics and the preliminary results of: i) reads mapping on the human genome and analysis of differential expressed genes; ii) functional analysis of differentially expressed genes. MethodTotal RNA was extracted from HCT116-p53 (wt) cell line 48h after transfection, depleted of rRNA, retro-transcribed, amplified and sequenced by using the pyrosequencer Roche GS FLX Titanium Series. Genome mapping, statistics and differential expression analyses were performed by using the "NGS-Trex" system (NGS Transcriptome profile Explorer) (Mignone F. et al., submitted), a automatic system designed for analyzing Next Generation Sequencing data generated from large-scale transcriptome studies. The overall procedure involves three steps: 1) creation of a project and upload of reads in a multi-fasta format; 2) reads mapping onto the reference genome after setup of appropriate parameters; 3) annotation of mapped reads; 3) data mining by using simple query forms. TRIM8 and FLAG data were submitted to NGS-Trex using default parameters that can briefly summarized as follows: reads were mapped onto human genome (min similarity 90% and min overlap 50 nt) discarding reads mapping onto more than 10 genomic regions. Mapped reads were compared to annotation to assign reads to genes and to identify new splice variants. Differentially expressed genes and splicing events were identified by computing a P-value associated to an hypergeometric distribution. Housekeeping genes were used to normalise reads count before identification of differentially expressed genes. The lists of genes showing a differential expression in the two samples were then analysed by using DAVID v(6.7), an integrated biological knowledgebase and analytic tools (text and pathway-mining tools) for large gene list functional annotation [2,3]. An additional analysis on TRIM8 and FLAG sequence samples was made for the detection and annotation of the ncRNA genome fraction. We used a bioinformatic analysis pipeline, developed by us, which is able to: 1) select ncRNA fro
IntroductionNon-coding RNAs (ncRNAs) serve as regulatory molecules for a variety of biological processes. They are roughly classified into two major categories, small non-coding RNAs (sncRNAs), such as microRNAs (miRNAs), and long non-coding RNAs (lncRNAs) according to their size. The lncRNAs have a broader spectrum of functions and are, therefore, a potential new class of cancer therapeutic target [1,2]. In addition there are other different types of ncRNAs whose role is not yet clear: circular-RNA, lincRNA, scRNA, sense-intronic and vault-RNA. New advances in translational research will require an accurate understanding of the functional relationships between protein- coding and ncRNA categories, as well as sponge regulatory networks [3,4]. To achieve this goal, we have built an integrated bioinformatics knowledge base, collecting non-redundant annotations of human ncRNAs, sequences and interactors, which provides a comprehensive access to all the knowledge available concerning ncRNAs, their interaction with other molecules and associated diseases. As key characteristics, the database overcomes the problem of different nomenclatures used by different sources and provides new clues about ncRNA functions throughout interactions inferred by network reconstruction [5].MethodsncRNA interactions include physical (i.e. molecular bindings between ncRNAs and DNA, RNAs or proteins) and functional relationships (i.e., co-expression, regulation, associated diseases, statistical and functional associations). Interactions stored in the database are in the form 'ncRNAs-mate', where the mate entity belongs to one of the following types: ncRNA, protein coding RNA (pcRNA), gene, protein, pseudogene and phenotype. In order to ensure the data quality of our interaction database we have developed a series of Extraction Transformation and Loading (ETL) modules able to extract, collect and integrate primary annotations, sequences and interactions from different public biological resources.The biological extracted entities and their relations are modelled as a network, a mathematical object composed by nodes (entities) and edges (relations) [5]. Entities redundancy has been identified by cross-link references and sequence similarity using the Cleanup software [6]. Non- coding RNAs are classified in biotypes, associated to Sequence Ontology terms [7] and integrated with data of protein coding RNAs (pcRNAs), gene, protein, pseudogene and phenotype. Furthermore, we extended the cross-reference network with data provided by Ensembl [8], using the biomaRt library of BioConductor [9].ResultsTotal amount of different entities collected in our interaction database are: 168.058 ncRNA , 5.009 pcRNA, 52.811 genes, 1.999 proteins, 15.940 pseudogenes and 849 phenotype.Moreover, total amount of interactions, based on mate type cardinalities, include: 130.383 ncRNA- ncRNA, 55.048 ncRNA-pcRNA, 1.458.925 ncRNA-gene, 99.653 ncRNA-protein, 70.482 ncRNA-phenotype, 17.217 ncR
A holistic understanding of environmental communities is the new challenge of metagenomics. Accordingly, the amplicon-based or metabarcoding approach, largely applied to investigate bacterial microbiomes, is moving to the eukaryotic world too. Indeed, the analysis of metabarcoding data may provide a comprehensive assessment of both bacterial and eukaryotic composition in a variety of environments, including human body. In this respect, whereas hypervariable regions of the 16S rRNA are the de facto standard barcode for bacteria, the Internal Transcribed Spacer 1 (ITS1) of ribosomal RNA gene cluster has shown a high potential in discriminating eukaryotes at deep taxonomic levels. As metabarcoding data analysis rely on the availability of a well-curated barcode reference resource, a comprehensive collection of ITS1 sequences supplied with robust taxonomies, is highly needed. To address this issue, we created ITSoneDB (available at http://itsonedb.cloud.ba.infn.it/) which in its current version hosts 985 240 ITS1 sequences spanning over 134 000 eukaryotic species. Each ITS1 is mapped on the NCBI reference taxonomy with its start and end positions precisely annotated. ITSoneDB has been developed in agreement to the FAIR guidelines by enabling the users to query and download its content through a simple web-interface and access relevant metadata by cross-linking to European Nucleotide Archive.
Motivations. Metagenomics is experiencing an explosive improvement from the advent of high-throughput next-generation sequencing (NGS) technologies which allows an unprecedented large-scale identification of microorganisms living in almost every environment. In particular, the use of amplicon-based metagenomic approach to explore the diversity of fungal environmental communities is increasingly expanding. At the species level, a number of studies have used the non-conserved internal transcribed spacers (ITS) 1 and 2 of the ribosomal RNA genes cluster as genetic markers to explore the fungal taxonomic diversity. Particularly, ITS1 is gaining an increasing popularity as better discriminating species marker in Fungi because of its higher variability compared to ITS2. Starting from the total DNA extracted from any environmental sample, this locus can be easily amplified with taxonomically universal primers and sequenced by means of high-throughput next generation platforms. Reference databases and robust supporting taxonomies are crucial in assigning phylogenetic affiliation to the huge amount of produced sequences. Even if a large number of ITS1 sequences are collected in public databases, a specialized resource focused particularly on this region, where sequences identity, boundaries and taxonomic assignment are validated, is still needed at present. In this work we present ITSoneDB, a new comprehensive collection of ITS1 sequences belonging to Fungi Kingdom.Methods. ITSoneDB has been generated and populated using a multi-step Python workflow. In the first step the ribosomal RNA gene cluster sequences of Fungi including the target ITS1 region were retrieved from Genbank. Then, ITS1 start and end boundaries were extracted from the Features Tables annotations, if available. In order to infer, validate and, eventually, redesign the ITS1 location, Hidden Markov Model (HMM) profiles of flanking genes for 18S and 5.8S ribosomal RNA, generated from their reference alignments stored in RFAM database, were mapped on the entire collection of retrieved nucleotide sequences, by means of the hmmsearch tool from HMMER 3.0 package.Results. At present, ITSoneDB includes 405,433 taxonomically arranged sequence entries provided with ITS1 both start and end positions defined by GenBank annotations and/or HMM based method. ITSoneDB front-end is a JAVA platform-based website for data browsing and downloading. The database can be queried by species or taxon name, GenBank accession ID or by "expanding" the target rank on a detailed fungal taxonomical tree. The complete ITS1 sequences dataset collected in ITSoneDB is available in Fasta format and the users can extract and locally save all or selected queried ITS1 sequences for further analysis.
Extracellular vesicles (EVs), nanoparticles originated from different cell types, seem to be implicated in several cellular activities. In the Central Nervous System (CNS), glia and neurons secrete EVs and recent studies have demonstrated that the intercellular communication mediated by EVs has versatile functional impact in the cerebral homeostasis. This essential role may be due to their proteins and RNAs cargo that possibly modify the phenotypes of the targeted cells. Despite the increasing importance of EVs, little is known about their fluctuations in physiological as well as in pathological conditions. Furthermore, only few studies have investigated the contents of contemporary EVs subgroups (microvesicles, MVs and exosomes, EXOs) with the purpose of discriminating between their features and functional roles. In order to possibly shed light on these issues, we performed a pilot study in which MVs and EXOs extracted from serum samples of a little cohort of subjects (patients with the first clinical evidence of CNS demyelination, also known as Clinically Isolated Syndrome and Healthy Controls) were submitted to deep small-RNA sequencing. Data were analysed by an in-home bioinformatics platform. In line with previous reports, distinct classes of non-coding RNAs have been detected in both the EVs subsets, offering interesting suggestions on their origins and functions. We also verified the feasibility of this extensive molecular approach, thus supporting its valuable use for the analysis of circulating biomarkers (e.g., microRNAs) in order to investigate and monitor specific diseases.
RNA-Seq by massively parallel sequencing is a potent way to perform transcriptome and small non-coding RNA (ncRNAs) analyses. Recent reports of the ENCODE project underline that while 80% of the human genome is transcribed, only 2% is protein coding, suggesting that the vast majority of the genome is transcribed as non-protein-coding RNA. We have developed an automated web-based platform, nc-aReNA, for the mapping, classification and annotation of human and mouse ncRNAs from RNA-Seq data. The platform isbased on a data-warehouse approach and workflow environment that includes data quality control, genome and nc-RNAome sequence alignment, differential expression profiling analysis and statistics of classified data.
High-throughput technologies (HT), such as microarray and especially Next-Generation Sequencing (NGS) technologies, have provided tremendous potential for profiling protein-coding and non- protein coding RNAs (ncRNAs). Recent reports of the ENCODE project underline that while 80% of the human genome is transcribed, only 2% is protein coding, suggesting that the vast majority of the genome is transcribed as non-protein-coding RNA.We present the development of a web-based bioinformatics platform, nc-aReNA, for the mapping, classification and annotation of human and mouse ncRNAs from HT-NGS data. The platform is based on a data-warehouse approach and workflow environment that includes data quality control, genome and nc-RNAome sequence alignment, differential expression profiling analysis and statistics of classified data.MethodsThe nc-aReNA architecture is based on a modular analysis pipeline, flanked by a data-warehouse, for the classification and annotation of small-RNAseqdata. The pipeline takes in input the sequenced reads in FASTQ format. After the initial steps of adaptor removal and quality check, the input reads are mapped to an in-house non-redundant ncRNA reference database (http://ncRNAdb.ba.itb.cnr.it) which collects and integrates ncRNA gene lists, from MGI (Mouse Genome Informatics) and HGNC (Human Genome Nomenclature Committee), with sequences and biotype annotations from VEGA (Vertebrate Genome Annotation), ENSEMBL, RefSeq, RFam (for tRNA sequence) and miRBase (for miRNA). NGS reads mapped in this step are classified by using Sequence Ontology (SO) (Eilbeck K. et al., 2005). Unmapped reads are aligned to the reference genome and tagged to the corresponding genomic locus.Integrated statistics are used for RPM (Reads Per Million), fold changes and False Discovery Rate (FDR) corrected p-values calculation and differential expression analysis of all (or user-chosen) ncRNA classes, by comparing two or more experimental conditions or time-courses data.An additional module, called "miRNA identification", provides the analysis of all unmapped miRNA-like reads by mean of the miRDeep2 software.All the analysis results and annotation are stored in a data-warehouse implemented with Infobright (http://www.infobright.org). A user-friendly web-based Graphical User Interface (GUI), developed by using the JAVA platform, guides the user in the submission process and displays results in tables and graphs.ResultsThe main features of the nc-aReNA are:- identification and classification of reads in known functional ncRNA categories in SO;- identification and filtering of reads mapping to ribosomal RNAs and mtDNA transcripts;- RPMs calculation for each known ncRNA;- the export of user-selected classesof ncRNA for further specific investigation;- quantification of ncRNAs expression and differential expression analysis for all identified ncRNAclasses;- graphical visualization of sample expression profiles;- additional annot
MOTIVATION:The recent availability of next generation sequencing (NGS) technologies, has provided the scientific community with an unprecedented opportunity for large-scale analysis of genome in a large number of organisms. One of the most challenging task for bioinformaticians is to develop tools that provide biologists with an easy access to curated and non-redundant collections of sequence data.Non-coding RNAs, for a long time believed to be not-functional, are emerging as the most large and important family of gene regulators.METHODS:NonCode aReNA DataBase is a comprehensive and non-redundant source of manually curated and automatically annotated ncRNA transcripts collected from major public resources.The database is built through a set of ETL (Extraction Transformation Loading) automated processes which extracts and collects data from VEGA, ENSEMBL, RefSeq, miRBase, GtRNAdb and piRNABank. The automatic process guarantees also recurring updates.The identification of redundant sequences is made by analyzing both cross-link references and sequence similarity. Furthermore non-coding RNA sequences have been classified in diverse biotypes and associated to Sequence Ontology terms.NonCode aReNA DataBase is originally developed as a component of a bigger project, represented by a datawarehouse and an analysis workflow, for the functional annotation of ncRNAs from NGS data.RESULTS:NonCode aReNA Database is currently available as a web-resource at http://ncrnadb.ba.itb.cnr.it/. The database can be queried by using multi-criteria and ontological search, through an easy-to-use web interface. Query results can be exported as non-redundant collections of ncRNA transcripts.Currently NonCode aReNA DataBase contains 134,908 human ncRNAs classified in 24 biotypes, and next updates will include transcripts of Mus musculus and Arabidopsis thaliana
The establishment and maintenance of DNA methylation are relatively well understood whereas little is known about their dynamics and biological relevance in innate immunity. In plants, modulation of DNA methylation might be an effective mechanism to regulate gene expression in response to abiotic and biotic stresses. Recent evidence through large-scale epigenomicapproaches indicate that dynamic DNA methylation changes are not limited to gene imprinting but can regulate the plant's immune system in response to pathogens. In plants, virus infections trigger expression and regulation of non-coding smallRNAs, and genomic regions are epigenetically modified through the action of the same molecules; however, the involvement of DNA methylation in regulation of plant immune system in response to virus infection was not investigated before. We have examined for the first time the impact of virus infections on genomic DNA methylation and the correlation with smallRNA regulation and gene expression by integrating together analysis of multiple "omics" datasets based on next-generation sequencing platforms. To investigate the possibility that DNA methylation dynamically responds to virus infection, we performed whole-genome bisulfite sequencing on Arabidopsis leaves systemically infected with either the DNA genome virus Cauliflower mosaic virus (CaMV-Arabidopsis) or the RNA virus Cucumber mosaic virus (CMV-Arabidopsis). Single-base resolution methylome analysis revealed more than 3.7million methyl-cytosines (mCs) for the control plant. Interestingly in CMV Arabidopsis we found 300.000 more mCs (hypermethylated) and in CaMV-Arabidopsis 700.000 mCs less (hypomethylated). Focusing on differentially methylated regions (DMR, 250nt in length) we observed a balanced distribution of hyper- and hypomethylation in CG and CHH context in CMV-Arabidopsis (total DMRs 2700) but in CaMV-Arabidopsis we have predominantly hypomethylated DMRs in CHH context (total DMRs 5600). Gene features including coding, non-coding and promoter sequences were assigned to unique gene identifiers according to the TAIR nomenclature. Among differentially methylated gene features, promoter regions were the vast majority, accounting, in specific mCs contexts, for up to 80% of the total. The whole gene ID dataset was subjected to gene functional enrichment analysis by using the DAVID package tool. Interestingly, definite functional categories such as "plant defense" and "auxin signalling pathway" resulted significantly enriched. The correlation between the DNA methylation status and the transcriptional modulation of those genes is under investigation. A comparison between methylation profiles induced by either CaMV or CMV infections revealed conspicuous qualitative and quantitative differences. Taken together our results indicate that RNA- and DNA-genome virus infection induce different regulation of DNA methylation and, at least in part, different immune response in Arabidopsis.
PlantPIs is a web querying system for a database collection of plant protease inhibitors data. Protease inhibitors in plants are naturally occurring proteins that inhibit the function of endogenous and exogenous proteases. In this paper the design and development of a web framework providing a clear and very flexible way of querying plant protease inhibitors data is reported. The web resource is based on a relational database, containing data of plants protease inhibitors publicly accessible, and a graphical user interface providing all the necessary browsing tools, including a data exporting function. PlantPIs contains information extracted principally from MEROPS database, filtered, annotated and compared with data stored in other protein and gene public databases, using both automated techniques and domain expert evaluations. The data are organized to allow a flexible and easy way to access stored information. The database is accessible at http://www.plantpis.ba.itb.cnr.it/.
Metagenomics is providing an unprecedented access to the environmental microbial diversity. The amplicon-basedmetagenomics approach involves the PCR-targeted sequencing of a genetic locus fitting different features. Namely,it must be ubiquitous in the taxonomic range of interest, variable enough to discriminate between different speciesbut flanked by highly conserved sequences, and of suitable size to be sequenced through next-generation platforms.The internal transcribed spacers 1 and 2 (ITS1 and ITS2) of the ribosomal DNA operon and one or morehyper-variable regions of 16S ribosomal RNA gene are typically used to identify fungal and bacterial species, respectively.In this context, reliable reference databases and taxonomies are crucial to assign amplicon sequence reads tothe correct phylogenetic ranks. Several resources provide consistent phylogenetic classification of publicly available16S ribosomal DNA sequences, whereas the state of ribosomal internal transcribed spacers reference databases isnotably less advanced. In this review, we aim to give an overview of existing reference resources for both types ofmarkers, highlighting strengths and possible shortcomings of their use for metagenomics purposes. Moreover, wepresent a new database, ITSoneDB, of well annotated and phylogenetically classified ITS1 sequences to be used asa reference collection in metagenomic studies of environmental fungal communities. ITSoneDB is available for downloadand browsing at http://itsonedb.ba.itb.cnr.it/.
Biodiversity research concerns with data coming from many different domains (e.g., Biology, Geography, Evolutionary Studies, Genomics, Taxonomy, Environmental Sciences, etc.) which need to be integrated for leading to valuable Biodiversity knowledge. Collecting and integrating data from so many heterogeneous resources is not a trivial task. Data are extremely scattered, heterogeneous in format and purpose, and protected in repositories of several research institutes. Driven by the widely diffused trend of the web of sharing information through aggregation of people with the same interests (social networks), and by the new type of database architecture defined as dynamic distributed federated database, we are proposing a new paradigm of data integration in the Biodiversity domain. Here we present a new approach for the development of a Knowledge Base aiming to the collection, integration and analysis of biodiversity data implemented as a product of the MBLab project.
The recent availability of high throughput tech- nologies, like next generation sequencing (NGS) platforms, has providedthescientific community with an unprecedented opportunity for large- scale analysis of genome in a large number of organisms.However,among others, one of the most challenging task for bioinformaticians is to developtools that providebiologists withaneasy access to curated and non-redundant collec- tions of sequence data.Non-coding RNAs, for a long time believed tobe not-functional, are emerging as themost large and important family of gene regulators. NonCode aReNA Database is a comprehensive and non-redundant source ofmanually curated and automatically annotated ncRNA transcripts. Originally developed as a component of a big- ger project, composed by a datawarehouse for the functional annotation of ncRNAs fromNGS data, NonCode aReNA DB is currently availableas a web-resource at http://ncrnadb.ba.itb.cnr. it/. Sequences have been classified in diverse biotypes and associated to SequenceOntology terms. The database can be queried by using multi-criteria and ontological search, through an easy-to-use web interface, and data exported as non-redundant collections of transcripts an- notated in VEGA, ENSEMBL, RefSeq, miRBase, GtRNAdb and piRNABank. The database is up- dated through an automatic pipeline and last updatewasonJanuary 2015. PresentlyNonCode aReNA DB contains 134,908 human ncRNAs clas- sified in 24 biotypes, and next update will include transcripts ofMusmusculus and Arabidopsis thal- iana.AcknowledgementsThis work was supported by the Italian MIUR Flagship Project "Epigen".
The 5' and 3' untranslated regions of eukaryotic mRNAs (UTRs) play crucial roles in the post-trans-criptional regulation of gene expression through the modulation of nucleo-cytoplasmic mRNA transport, translation efficiency, subcellular localization and message stability. UTRdb is a curated database of 5' and 3' untranslated sequences of eukaryotic mRNAs, derived from several sources of primary data. Experimentally validated functional motifs are annotated and also collated as the UTRsite database where more specific information on the functional motifs and cross-links to interacting regulatory protein are provided. In the current update, the UTR entries have been organized in a gene-centric structure to better visualize and retrieve 5' and 3' UTR variants generated by alter-native initiation and termination of transcription and alternative splicing. Experimentally validated miRNA targets and conserved sequence elements are also annotated. The integration of UTRdb with genomic data has allowed the implementation of an efficient annotation system and a powerful retrieval resource for the selection and extraction of specific UTR subsets. All internet resources implemented for retrieval and functional analysis of 5' and 3' untranslated regions of eukaryotic mRNAs are accessible at http://utrdb.ba.itb.cnr.it/.
The structural and conformational organization of chromosomes is crucial for gene expression regulation in eukaryotes and prokaryotes as well. Up to date, gene expression data generated using either microarray or RNA-sequencing are available for many bacterial genomes. However, differential gene expression is usually investigated with methods considering each gene independently, thus not taking into account the physical localization of genes along a bacterial chromosome. Here, we present WoPPER, a web tool integrating gene expression and genomic annotations to identify differentially expressed chromosomal regions in bacteria. RNA-sequencing or microarray-based gene expression data are provided as input, along with gene annotations. The user can select genomic annotations from an internal database including 2780 bacterial strains, or provide custom genomic annotations. The analysis produces as output the lists of positionally related genes showing a coordinated trend of differential expression. Graphical representations, including a circular plot of the analyzed chromosome, allow intuitive browsing of the results. The analysis procedure is based on our previously published R-package PREDA. The release of this tool is timely and relevant for the scientific community, as WoPPER will fill an existing gap in prokaryotic gene expression data analysis and visualization tools. WoPPER is open to all users and can be reached at the following URL: https://WoPPER.ba.itb.cnr.it.
Condividi questo sito sui social