Unknown

Dataset Information

0

Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models.


ABSTRACT: Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5'-untranslated regions.

SUBMITTER: Shepard SS 

PROVIDER: S-EPMC3367190 | biostudies-literature | 2012 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models.

Shepard Samuel S SS   McSweeny Andrew A   Serpen Gursel G   Fedorov Alexei A  

Nucleic acids research 20120216 11


Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction scheme  ...[more]

Similar Datasets

| S-EPMC5963474 | biostudies-other
| S-EPMC5731601 | biostudies-literature
| S-EPMC2762791 | biostudies-literature
| S-EPMC3356369 | biostudies-literature
| S-EPMC2779198 | biostudies-literature
| S-EPMC4143468 | biostudies-literature
| S-EPMC5404901 | biostudies-other
| S-EPMC4067306 | biostudies-literature
| S-EPMC3157587 | biostudies-literature
| S-EPMC5148114 | biostudies-literature