Unknown

Dataset Information

0

Large language models improve annotation of viral proteins.


ABSTRACT: Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.

SUBMITTER: Flamholz ZN 

PROVIDER: S-EPMC10187409 | biostudies-literature | 2023 May

REPOSITORIES: biostudies-literature

altmetric image

Publications

Large language models improve annotation of viral proteins.

Flamholz Zachary N ZN   Biller Steve J SJ   Biller Steve J SJ   Kelly Libusha L  

Research square 20230502


Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence an  ...[more]

Similar Datasets

| S-EPMC11311208 | biostudies-literature
| S-EPMC11326577 | biostudies-literature
| S-EPMC10689442 | biostudies-literature
| S-EPMC10246080 | biostudies-literature
| S-EPMC10589311 | biostudies-literature
| S-EPMC10834163 | biostudies-literature
| S-EPMC10449915 | biostudies-literature
| S-EPMC10846950 | biostudies-literature
| S-EPMC10909174 | biostudies-literature
| S-EPMC11335796 | biostudies-literature