Dataset Information

Learning supervised embeddings for large scale sequence comparisons.

ABSTRACT: Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.

SUBMITTER: Kimothi D

PROVIDER: S-EPMC7069636 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Learning supervised embeddings for large scale sequence comparisons.

Kimothi Dhananjay D Biyani Pravesh P Hogan James M JM Soni Akshay A Kelly Wayne W

PloS one 20200313 3

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches-SuperVec and SuperVecX-to learn sequence embeddings. These method ...[more]

PMID: 32168338

Similar Datasets

Project description:MotivationMetagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.ResultsWe propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning the method involves training machine learning models on about 10(8) samples in 10(7) dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2-17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.Availability and implementationData and codes are available at http://cbio.ensmp.fr/largescalemetagenomicsContactpierre.mahe@biomerieux.comSupplementary informationSupplementary data are available at Bioinformatics online.

Project description:BackgroundPretraining labeled datasets, like ImageNet, have become a technical standard in advanced medical image analysis. However, the emergence of self-supervised learning (SSL), which leverages unlabeled data to learn robust features, presents an opportunity to bypass the intensive labeling process. In this study, we explored if SSL for pretraining on non-medical images can be applied to chest radiographs and how it compares to supervised pretraining on non-medical images and on medical images.MethodsWe utilized a vision transformer and initialized its weights based on the following: (i) SSL pretraining on non-medical images (DINOv2), (ii) supervised learning (SL) pretraining on non-medical images (ImageNet dataset), and (iii) SL pretraining on chest radiographs from the MIMIC-CXR database, the largest labeled public dataset of chest radiographs to date. We tested our approach on over 800,000 chest radiographs from 6 large global datasets, diagnosing more than 20 different imaging findings. Performance was quantified using the area under the receiver operating characteristic curve and evaluated for statistical significance using bootstrapping.ResultsSSL pretraining on non-medical images not only outperformed ImageNet-based pretraining (p < 0.001 for all datasets) but, in certain cases, also exceeded SL on the MIMIC-CXR dataset. Our findings suggest that selecting the right pretraining strategy, especially with SSL, can be pivotal for improving diagnostic accuracy of artificial intelligence in medical imaging.ConclusionsBy demonstrating the promise of SSL in chest radiograph analysis, we underline a transformative shift towards more efficient and accurate AI models in medical imaging.Relevance statementSelf-supervised learning highlights a paradigm shift towards the enhancement of AI-driven accuracy and efficiency in medical imaging. Given its promise, the broader application of self-supervised learning in medical imaging calls for deeper exploration, particularly in contexts where comprehensive annotated datasets are limited.

Dataset Information

Learning supervised embeddings for large scale sequence comparisons.

Publications

Learning supervised embeddings for large scale sequence comparisons.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets