Unknown

Dataset Information

0

SEWAL: an open-source platform for next-generation sequence analysis and visualization.


ABSTRACT: Next-generation DNA sequencing platforms provide exciting new possibilities for in vitro genetic analysis of functional nucleic acids. However, the size of the resulting data sets presents computational and analytical challenges. We present an open-source software package that employs a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run (? 10(8) sequences). The algorithm results in quasilinear time processing of entire Illumina lanes (? 10(7) sequences) on a desktop computer in minutes. To facilitate visual analysis of sequencing data, the software produces three-dimensional scatter plots similar in concept to Sewall Wright and John Maynard Smith's adaptive or fitness landscape. The software also contains functions that are particularly useful for doped selections such as mutation frequency analysis, information content calculation, multivariate statistical functions (including principal component analysis), sequence distance metrics, sequence searches and sequence comparisons across multiple Illumina data sets. Source code, executable files and links to sample data sets are available at http://www.sourceforge.net/projects/sewal.

SUBMITTER: Pitt JN 

PROVIDER: S-EPMC3001052 | biostudies-literature | 2010 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

SEWAL: an open-source platform for next-generation sequence analysis and visualization.

Pitt Jason N JN   Rajapakse Indika I   Ferré-D'Amaré Adrian R AR  

Nucleic acids research 20100806 22


Next-generation DNA sequencing platforms provide exciting new possibilities for in vitro genetic analysis of functional nucleic acids. However, the size of the resulting data sets presents computational and analytical challenges. We present an open-source software package that employs a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run (∼ 10(8) sequences). The algorithm results in quasilinear time processing of entire Illumina lanes (∼ 10  ...[more]

Similar Datasets

| S-EPMC6223365 | biostudies-literature
2018-02-15 | ST001074 | MetabolomicsWorkbench
| S-EPMC6516198 | biostudies-literature
| S-EPMC3855844 | biostudies-literature
| S-EPMC7063475 | biostudies-literature
2019-08-20 | GSE135950 | GEO
| PRJNA560769 | ENA
| S-EPMC6247818 | biostudies-literature
| S-EPMC2784303 | biostudies-literature
| S-EPMC7815964 | biostudies-literature