Unknown

Dataset Information

0

Optimizing high performance computing workflow for protein functional annotation.


ABSTRACT: Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

SUBMITTER: Stanberry L 

PROVIDER: S-EPMC4194055 | biostudies-other | 2014 Sep

REPOSITORIES: biostudies-other

altmetric image

Publications

Optimizing high performance computing workflow for protein functional annotation.

Stanberry Larissa L   Rekepalli Bhanu B   Liu Yuan Y   Giblock Paul P   Higdon Roger R   Montague Elizabeth E   Broomall William W   Kolker Natali N   Kolker Eugene E  

Concurrency and computation : practice & experience 20140901 13


Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we pres  ...[more]

Similar Datasets

| S-EPMC3940597 | biostudies-literature
| S-EPMC9319598 | biostudies-literature
| S-EPMC4076281 | biostudies-literature
| S-EPMC6299036 | biostudies-literature
| S-EPMC4420499 | biostudies-literature
| S-EPMC7194241 | biostudies-literature
| S-EPMC4895710 | biostudies-literature
| S-EPMC7579964 | biostudies-literature
| S-EPMC3380734 | biostudies-literature
| S-EPMC4385699 | biostudies-other