Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Detecting and phasing minor single-nucleotide variants from long-read sequencing data.

ABSTRACT: Cellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.

SUBMITTER: Feng Z

PROVIDER: S-EPMC8144375 | biostudies-literature | 2021 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Publications

Detecting and phasing minor single-nucleotide variants from long-read sequencing data.

Feng Zhixing Z Clemente Jose C JC Wong Brandon B Schadt Eric E EE

Nature communications 20210524 1

Cellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error ...[more]

PMID: 34031367

Similar Datasets

Detecting PKD1 variants in polycystic kidney disease patients by single-molecule long-read sequencing.

Project description:A genetic diagnosis of autosomal-dominant polycystic kidney disease (ADPKD) is challenging due to allelic heterogeneity, high GC content, and homology of the PKD1 gene with six pseudogenes. Short-read next-generation sequencing approaches, such as whole-genome sequencing and whole-exome sequencing, often fail at reliably characterizing complex regions such as PKD1. However, long-read single-molecule sequencing has been shown to be an alternative strategy that could overcome PKD1 complexities and discriminate between homologous regions of PKD1 and its pseudogenes. In this study, we present the increased power of resolution for complex regions using long-read sequencing to characterize a cohort of 19 patients with ADPKD. Our approach provided high sensitivity in identifying PKD1 pathogenic variants, diagnosing 94.7% of the patients. We show that reliable screening of ADPKD patients in a single test without interference of PKD1 homologous sequences, commonly introduced by residual amplification of PKD1 pseudogenes, by direct long-read sequencing is now possible. This strategy can be implemented in diagnostics and is highly suitable to sequence and resolve complex genomic regions that are of clinical relevance.

| S-EPMC5488171 | biostudies-literature

Pitfalls of haplotype phasing from amplicon-based long-read sequencing.

Project description:The long-read sequencers from Pacific Bioscience (PacBio) and Oxford Nanopore Technologies (ONT) offer the opportunity to phase mutations multiple kilobases apart directly from sequencing reads. In this study, we used long-range PCR with ONT and PacBio sequencing to phase two variants 9 kb apart in the RET gene. We also re-analysed data from a recent paper which had apparently successfully used ONT to phase clinically important haplotypes at the CYP2D6 and HLA loci. From these analyses, we demonstrate PCR-chimera formation during PCR amplification and reference alignment bias are pitfalls that need to be considered when attempting to phase variants using amplicon-based long-read sequencing technologies. These methodological pitfalls need to be avoided if the opportunities provided by long-read sequencers are to be fully exploited.

| S-EPMC4756330 | biostudies-literature

Variant phasing and haplotypic expression from long-read sequencing in maize.

Project description:Haplotype phasing maize genetic variants is important for genome interpretation, population genetic analysis and functional analysis of allelic activity. We performed an isoform-level phasing study using two maize inbred lines and their reciprocal crosses, based on single-molecule, full-length cDNA sequencing. To phase and analyze transcripts between hybrids and parents, we developed IsoPhase. Using this tool, we validated the majority of SNPs called against matching short-read data from embryo, endosperm and root tissues, and identified allele-specific, gene-level and isoform-level differential expression between the inbred parental lines and hybrid offspring. After phasing 6907 genes in the reciprocal hybrids, we annotated the SNPs and identified large-effect genes. In addition, we identified parent-of-origin isoforms, distinct novel isoforms in maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase accuracy in studies of allelic expression.

| S-EPMC7028979 | biostudies-literature

Long-read-based single sperm genome sequencing for chromosome-wide haplotype phasing of both SNPs and SVs.

Project description:Although localized haploid phasing can be achieved using long read genome sequencing without parental data, reliable chromosome-scale phasing remains a great challenge. Given that sperm is a natural haploid cell, single-sperm genome sequencing can provide a chromosome-wide phase signal. Due to the limitation of read length, current short-read-based single-sperm genome sequencing methods can only achieve SNP haplotyping and come with difficulties in detecting and haplotyping structural variations (SVs) in complex genomic regions. To overcome these limitations, we developed a long-read-based single-sperm genome sequencing method and a corresponding data analysis pipeline that can accurately identify crossover events and chromosomal level aneuploidies in single sperm and efficiently detect SVs within individual sperm cells. Importantly, without parental genome information, our method can accurately conduct de novo phasing of heterozygous SVs as well as SNPs from male individuals at the whole chromosome scale. The accuracy for phasing of SVs was as high as 98.59% using 100 single sperm cells, and the accuracy for phasing of SNPs was as high as 99.95%. Additionally, our method reliably enabled deduction of the repeat expansions of haplotype-resolved STRs/VNTRs in single sperm cells. Our method provides a new opportunity for studying haplotype-related genetics in mammals.

| S-EPMC10450174 | biostudies-literature

A Long-Read Sequencing Approach for Direct Haplotype Phasing in Clinical Settings.

Project description:The reconstruction of individual haplotypes can facilitate the interpretation of disease risks; however, high costs and technical challenges still hinder their assessment in clinical settings. Second-generation sequencing is the gold standard for variant discovery but, due to the production of short reads covering small genomic regions, allows only indirect haplotyping based on statistical methods. In contrast, third-generation methods such as the nanopore sequencing platform developed by Oxford Nanopore Technologies (ONT) generate long reads that can be used for direct haplotyping, with fewer drawbacks. However, robust standards for variant phasing in ONT-based target resequencing efforts are not yet available. In this study, we presented a streamlined proof-of-concept workflow for variant calling and phasing based on ONT data in a clinically relevant 12-kb region of the APOE locus, a hotspot for variants and haplotypes associated with aging-related diseases and longevity. Starting with sequencing data from simple amplicons of the target locus, we demonstrated that ONT data allow for reliable single-nucleotide variant (SNV) calling and phasing from as little as 60 reads, although the recognition of indels is less efficient. Even so, we identified the best combination of ONT read sets (600) and software (BWA/Minimap2 and HapCUT2) that enables full haplotype reconstruction when both SNVs and indels have been identified previously using a highly-accurate sequencing platform. In conclusion, we established a rapid and inexpensive workflow for variant phasing based on ONT long reads. This allowed for the analysis of multiple samples in parallel and can easily be implemented in routine clinical practice, including diagnostic testing.

| S-EPMC7731377 | biostudies-literature

CFTR haplotype phasing using long-read genome sequencing from ultralow input DNA.

Project description:PurposeNewborn screening identifies rare diseases that result from the recessive inheritance of pathogenic variants in both copies of a gene. Long-read genome sequencing (LRS) is used for identifying and phasing genomic variants, but further efforts are needed to develop LRS for applications using low-yield DNA samples.MethodsIn this study, genomic DNA with high molecular weight was obtained from 2 cystic fibrosis patients, comprising a whole-blood sample (CF1) and a newborn dried blood spot sample (CF2). Library preparation and genome sequencing (30-fold coverage) were performed using 20 ng of DNA input on both the PacBio Revio system and the Illumina NovaSeq short-read sequencer. Single-nucleotide variants, small indels, and structural variants were identified for each data set.ResultsOur results indicated that the genotype concordance between long- and short-read genome sequencing data was higher for single-nucleotide variants than for small indels. Both technologies accurately identified known pathogenic variants in the CFTR gene (CF1: p.(Met607_Gln634del), p.(Phe508del); CF2: p.(Phe508del), p.(Ala455Glu)) with complete concordance for the polymorphic poly-TG and consecutive poly-T tracts. Using PacBio read-based haplotype phasing, we successfully determined the allelic phase and identified compound heterozygosity of pathogenic variants at genomic distances of 32.4 kb (CF1) and 10.8 kb (CF2).ConclusionHaplotype phasing of rare pathogenic variants from minimal DNA input is achieved through LRS. This approach has the potential to eliminate the need for parental testing, thereby shortening the time to diagnosis in genetic disease screening.

| S-EPMC11869909 | biostudies-literature

JAFFAL: detecting fusion genes with long-read transcriptome sequencing.

Project description:In cancer, fusions are important diagnostic markers and targets for therapy. Long-read transcriptome sequencing allows the discovery of fusions with their full-length isoform structure. However, due to higher sequencing error rates, fusion finding algorithms designed for short reads do not work. Here we present JAFFAL, to identify fusions from long-read transcriptome sequencing. We validate JAFFAL using simulations, cell lines, and patient data from Nanopore and PacBio. We apply JAFFAL to single-cell data and find fusions spanning three genes demonstrating transcripts detected from complex rearrangements. JAFFAL is available at https://github.com/Oshlack/JAFFA/wiki .

| S-EPMC8739696 | biostudies-literature

Haplotype phasing in single-cell DNA-sequencing data.

Project description:MotivationCurrent technologies for single-cell DNA sequencing require whole-genome amplification (WGA), as a single cell contains too little DNA for direct sequencing. Unfortunately, WGA introduces biases in the resulting sequencing data, including non-uniformity in genome coverage and high rates of allele dropout. These biases complicate many downstream analyses, including the detection of genomic variants.ResultsWe show that amplification biases have a potential upside: long-range correlations in rates of allele dropout provide a signal for phasing haplotypes at the lengths of amplicons from WGA, lengths which are generally longer than than individual sequence reads. We describe a statistical test to measure concurrent allele dropout between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. We use results of this test to perform haplotype assembly across a collection of single cells. We demonstrate that the algorithm predicts phasing between pairs of SNPs with higher accuracy than phasing from reads alone. Using whole-genome sequencing data from only seven neural cells, we obtain haplotype blocks that are orders of magnitude longer than with sequence reads alone (median length 10.2 kb versus 312 bp), with error rates <2%. We demonstrate similar advantages on whole-exome data from 16 cells, where we obtain haplotype blocks with median length 9.2 kb-comparable to typical gene lengths-compared with median lengths of 41 bp with sequence reads alone, with error rates <4%. Our algorithm will be useful for haplotyping of rare alleles and studies of allele-specific somatic aberrations.Availability and implementationSource code is available at https://www.github.com/raphael-group.Supplementary informationSupplementary data are available at Bioinformatics online.

| S-EPMC6022575 | biostudies-literature

Sequencing and phasing cancer mutations in lung cancers using a long-read portable sequencer.

Project description:Here, we employed cDNA amplicon sequencing using a long-read portable sequencer, MinION, to characterize various types of mutations in cancer-related genes, namely, EGFR, KRAS, NRAS and NF1. For homozygous SNVs, the precision and recall rates were 87.5% and 91.3%, respectively. For previously reported hotspot mutations, the precision and recall rates reached 100%. The precise junctions of EML4-ALK, CCDC6-RET and five other gene fusions were also detected. Taking advantages of long-read sequencing, we conducted phasing of EGFR mutations and elucidated the mutational allelic backgrounds of anti-tumor drug-sensitive and resistant mutations, which could provide useful information for selecting therapeutic approaches. In the H1975 cells, 72% of the reads harbored both L858R and T790M mutations, and 22% of the reads harbored neither mutation. To ensure that the clinical requirements can be met in potentially low cancer cell populations, we further conducted a serial dilution analysis of the template for EGFR mutations. Several percent of the mutant alleles could be detected depending on the yield and quality of the sequencing data. Finally, we characterized the mutation genotypes in eight clinical samples. This method could be a convenient long-read sequencing-based analytical approach and thus may change the current approaches used for cancer genome sequencing.

| S-EPMC5726485 | biostudies-literature

DNA read count calibration for single-molecule, long-read sequencing.

Project description:There are many applications in which quantitative information about DNA mixtures with different molecular lengths is important. Gene therapy vectors are much longer than can be sequenced individually via short-read NGS. However, vector preparations may contain smaller DNAs that behave differently during sequencing. We have used two library preparations each for Pacific Biosystems (PacBio) and Oxford Nanopore Technologies NGS to determine their suitability for quantitative assessment of varying sized DNAs. Equimolar length standards were generated from E. coli genomic DNA. Both PacBio library preparations provided a consistent length dependence though with a complex pattern. This method is sufficiently sensitive that differences in genomic copy number between DNA from E. coli grown in exponential and stationary phase conditions could be detected. The transposase-based Oxford Nanopore library preparation provided a predictable length dependence, but the random sequence starts caused the loss of original length information. The ligation-based approach retained length information but read frequency was more variable. Modeling of E. coli versus lambda read frequency via cubic spline smoothing showed that the shorter genome could be used as a suitable internal spike-in for DNAs in the 200 bp to 10 kb range, allowing meaningful QC to be carried out with AAV preparations.

| S-EPMC9626564 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data