Dataset Information

Geoseq: a tool for dissecting deep-sequencing datasets.

ABSTRACT: BACKGROUND: Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. RESULTS: Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. CONCLUSIONS: Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

SUBMITTER: Gurtowski J

PROVIDER: S-EPMC2972303 | biostudies-literature | 2010

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Geoseq: a tool for dissecting deep-sequencing datasets.

Gurtowski James J Cancio Anthony A Shah Hardik H Levovitz Chaya C George Ajish A Homann Robert R Sachidanandam Ravi R

BMC bioinformatics 20101012

<h4>Background</h4>Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest.<h4>Results</h4>Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experime ...[more]

PMID: 20939882

Similar Datasets

Project description:Rice represents one the most important foods all over the world. In Europe, Italy is the first rice producer and Italian production is driven by tradition and quality. All main rice grain quality traits, like cooking properties, texture, gelatinization temperature, chalkiness and yield, are related to the content and composition of starch and seed-storage proteins in the endosperm and to grain shape. In addition, a number of nutraceutical compounds and allergens are known to have a significant effect on grain quality determination. To investigate the genetic bases underlying the qualitative differences that characterize traditional Italian rice cultivars, a comparative RNA-Seq-based transcriptomic analysis of developing caryopsis was conducted at 14 days after flowering on six popular Italian varieties (Carnaroli, Arborio, Balilla, Vialone Nano, Gigante Vercelli and Volano) phenotypically differing for qualitative grain-related traits.Co-regulation analyses of differentially expressed genes showing the same expression patterns in the six genotypes highlighted clusters of loci up or down-regulated in specific varieties, with respect to the others. Among them, we detected loci involved in cell wall biosynthesis, protein metabolism and redox homeostasis, classes of genes affecting in chalkiness determination. Moreover, loci encoding for seed-storage proteins, allergens or involved in the biosynthesis of specific nutraceutical compounds were also present and specifically regulated in the different clusters. A wider investigation of all the DEGs detected in pair-wise comparisons revealed transcriptional variation, among the six genotypes, for quality-related loci involved in starch biosynthesis (e.g. GBSSI, starch synthases and AGPase), genes encoding for transcription factors, additional seed storage proteins, allergens or belonging to additional nutraceutical compounds biosynthetic pathways and loci affecting grain size. Putative functional SNPs associated to amylose content in starch, gelatinization temperature and grain size were also identified.The present work represents a more extended phenotypic characterization of a set of rice accessions that present a wider genetic variability than described nowadays in literature. The results provide the first transcriptional picture for several of the grain quality differences observed among the Italian rice varieties analyzed and reveal that each variety is characterized by the over-expression of a peculiar set of loci affecting grain appearance and quality. A list of candidates and SNPs affecting specific grain properties has been identified offering a starting point for further works aimed to characterize genes and molecular markers for breeding programs.

Project description:BACKGROUND:A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57-74, 2012; Nat 507:462-70, 2014; Nat 507:455-61, 2014; Nat 518:317-30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. RESULTS:We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563-5, 2007; Nat Protoc 5:323-34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. CONCLUSIONS:TrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au .

Project description:Nanopore sequencing is regarded as one of the most promising third-generation sequencing (TGS) technologies. Since 2014, Oxford Nanopore Technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, with an expected impact on genomics. However, the nanopore sequencing reads are susceptible to a fairly high error rate owing to the difficulty in identifying the DNA bases from the complex electrical signals. Although several basecalling tools have been developed for nanopore sequencing over the past years, it is still challenging to correct the sequences after applying the basecalling procedure. In this study, we developed an open-source DNA basecalling reviser, NanoReviser, based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers. By employing convolution neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our results showed NanoReviser, as a post-basecalling reviser, significantly improving the basecalling quality. After being trained on standard ONT sequencing reads from public E. coli and human NA12878 datasets, NanoReviser reduced the sequencing error rate by over 5% for both the E. coli dataset and the human dataset. The performance of NanoReviser was found to be better than those of all current basecalling tools. Furthermore, we analyzed the modified bases of the E. coli dataset and added the methylation information to train our module. With the methylation annotation, NanoReviser reduced the error rate by 7% for the E. coli dataset and specifically reduced the error rate by over 10% for the regions of the sequence rich in methylated bases. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of building the consensus sequence. The NanoReviser package is freely available at https://github.com/pkubioinformatics/NanoReviser.

Project description:We conducted an unbiased metagenomics survey using plasma from patients with chronic hepatitis B, chronic hepatitis C, autoimmune hepatitis (AIH), non-alcoholic steatohepatitis (NASH), and patients without liver disease (control). RNA and DNA libraries were sequenced from plasma filtrates enriched in viral particles to catalog virus populations. Hepatitis viruses were readily detected at high coverage in patients with chronic viral hepatitis B and C, but only a limited number of sequences resembling other viruses were found. The exception was a library from a patient diagnosed with hepatitis C virus (HCV) infection that contained multiple sequences matching GB virus C (GBV-C). Abundant GBV-C reads were also found in plasma from patients with AIH, whereas Torque teno virus (TTV) was found at high frequency in samples from patients with AIH and NASH. After taxonomic classification of sequences by BLASTn, a substantial fraction in each library, ranging from 35% to 76%, remained unclassified. These unknown sequences were assembled into scaffolds along with virus, phage and endogenous retrovirus sequences and then analyzed by BLASTx against the non-redundant protein database. Nearly the full genome of a heretofore-unknown circovirus was assembled and many scaffolds that encoded proteins with similarity to plant, insect and mammalian viruses. The presence of this novel circovirus was confirmed by PCR. BLASTx also identified many polypeptides resembling nucleo-cytoplasmic large DNA viruses (NCLDV) proteins. We re-evaluated these alignments with a profile hidden Markov method, HHblits, and observed inconsistencies in the target proteins reported by the different algorithms. This suggests that sequence alignments are insufficient to identify NCLDV proteins, especially when these alignments are only to small portions of the target protein. Nevertheless, we have now established a reliable protocol for the identification of viruses in plasma that can also be adapted to other patient samples such as urine, bile, saliva and other body fluids.

Dataset Information

Geoseq: a tool for dissecting deep-sequencing datasets.

Publications

Geoseq: a tool for dissecting deep-sequencing datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets