Unknown

Dataset Information

0

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.


ABSTRACT: Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/ .

SUBMITTER: Vineetha V 

PROVIDER: S-EPMC6488671 | biostudies-other | 2019 Apr

REPOSITORIES: biostudies-other

altmetric image

Publications

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.

Vineetha V V   Biji C L CL   Nair Achuthsankar S AS  

Scientific reports 20190429 1


Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features wh  ...[more]

Similar Datasets

| S-EPMC6113509 | biostudies-literature
| S-EPMC8022636 | biostudies-literature
| S-EPMC7537910 | biostudies-literature
| S-EPMC8756192 | biostudies-literature
| S-EPMC6334396 | biostudies-literature
| S-EPMC7199472 | biostudies-literature
| S-EPMC383317 | biostudies-literature
| S-EPMC6805285 | biostudies-literature
| S-EPMC4820126 | biostudies-literature
| S-EPMC1904114 | biostudies-literature