nc-aReNA: an integrated bioinformatics platform for non-coding RNA-seq data classification and annotation

Abstract

High-throughput technologies (HT), such as microarray and especially Next-Generation Sequencing (NGS) technologies, have provided tremendous potential for profiling protein-coding and non- protein coding RNAs (ncRNAs). Recent reports of the ENCODE project underline that while 80% of the human genome is transcribed, only 2% is protein coding, suggesting that the vast majority of the genome is transcribed as non-protein-coding RNA.We present the development of a web-based bioinformatics platform, nc-aReNA, for the mapping, classification and annotation of human and mouse ncRNAs from HT-NGS data. The platform is based on a data-warehouse approach and workflow environment that includes data quality control, genome and nc-RNAome sequence alignment, differential expression profiling analysis and statistics of classified data.MethodsThe nc-aReNA architecture is based on a modular analysis pipeline, flanked by a data-warehouse, for the classification and annotation of small-RNAseqdata. The pipeline takes in input the sequenced reads in FASTQ format. After the initial steps of adaptor removal and quality check, the input reads are mapped to an in-house non-redundant ncRNA reference database (http://ncRNAdb.ba.itb.cnr.it) which collects and integrates ncRNA gene lists, from MGI (Mouse Genome Informatics) and HGNC (Human Genome Nomenclature Committee), with sequences and biotype annotations from VEGA (Vertebrate Genome Annotation), ENSEMBL, RefSeq, RFam (for tRNA sequence) and miRBase (for miRNA). NGS reads mapped in this step are classified by using Sequence Ontology (SO) (Eilbeck K. et al., 2005). Unmapped reads are aligned to the reference genome and tagged to the corresponding genomic locus.Integrated statistics are used for RPM (Reads Per Million), fold changes and False Discovery Rate (FDR) corrected p-values calculation and differential expression analysis of all (or user-chosen) ncRNA classes, by comparing two or more experimental conditions or time-courses data.An additional module, called "miRNA identification", provides the analysis of all unmapped miRNA-like reads by mean of the miRDeep2 software.All the analysis results and annotation are stored in a data-warehouse implemented with Infobright (http://www.infobright.org). A user-friendly web-based Graphical User Interface (GUI), developed by using the JAVA platform, guides the user in the submission process and displays results in tables and graphs.ResultsThe main features of the nc-aReNA are:- identification and classification of reads in known functional ncRNA categories in SO;- identification and filtering of reads mapping to ribosomal RNAs and mtDNA transcripts;- RPMs calculation for each known ncRNA;- the export of user-selected classesof ncRNA for further specific investigation;- quantification of ncRNAs expression and differential expression analysis for all identified ncRNAclasses;- graphical visualization of sample expression profiles;- additional annot


Tutti gli autori

  • De Caro G.; Tulipano A.; Consiglio A.; D'Elia D.; Grillo G.; Marinaro M.; Liuni S.; Licciulli F.; Gisel A.

Titolo volume/Rivista

Non Disponibile


Anno di pubblicazione

2014

ISSN

Non Disponibile

ISBN

Non Disponibile


Numero di citazioni Wos

Nessuna citazione

Ultimo Aggiornamento Citazioni

Non Disponibile


Numero di citazioni Scopus

Non Disponibile

Ultimo Aggiornamento Citazioni

Non Disponibile


Settori ERC

Non Disponibile

Codici ASJC

Non Disponibile