Dataset Information

Reducing Sanger confirmation testing through false positive prediction algorithms.

ABSTRACT:

Purpose

Clinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.

Methods

We sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives.

Results

After training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%.

Conclusion

Our results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE .

SUBMITTER: Holt JM

PROVIDER: S-EPMC8257489 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Many proteins work together with others in groups called complexes in order to achieve a specific function. Discovering protein complexes is important for understanding biological processes and predict protein functions in living organisms. Large-scale and throughput techniques have made possible to compile protein-protein interaction networks (PPI networks), which have been used in several computational approaches for detecting protein complexes. Those predictions might guide future biologic experimental research. Some approaches are topology-based, where highly connected proteins are predicted to be complexes; some propose different clustering algorithms using partitioning, overlaps among clusters for networks modeled with unweighted or weighted graphs; and others use density of clusters and information based on protein functionality. However, some schemes still require much processing time or the quality of their results can be improved. Furthermore, most of the results obtained with computational tools are not accompanied by an analysis of false positives. We propose an effective and efficient mining algorithm for discovering highly connected subgraphs, which is our base for defining protein complexes. Our representation is based on transforming the PPI network into a directed acyclic graph that reduces the number of represented edges and the search space for discovering subgraphs. Our approach considers weighted and unweighted PPI networks. We compare our best alternative using PPI networks from Saccharomyces cerevisiae (yeast) and Homo sapiens (human) with state-of-the-art approaches in terms of clustering, biological metrics and execution times, as well as three gold standards for yeast and two for human. Furthermore, we analyze false positive predicted complexes searching the PDBe (Protein Data Bank in Europe) database in order to identify matching protein complexes that have been purified and structurally characterized. Our analysis shows that more than 50 yeast protein complexes and more than 300 human protein complexes found to be false positives according to our prediction method, i.e., not described in the gold standard complex databases, in fact contain protein complexes that have been characterized structurally and documented in PDBe. We also found that some of these protein complexes have recently been classified as part of a Periodic Table of Protein Complexes. The latest version of our software is publicly available at http://doi.org/10.6084/m9.figshare.5297314.v1.

Project description:Objectives: With the worldwide spread of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), various antibody detection kits have been developed to test for SARS-CoV-2- specific IgG, IgM, and total antibody. However, the use of different testing methods under various heat-inactivation conditions might affect the COVID-19 detection results. Methods: Seven different antibody detection kits produced by four manufacturers for detection of SARS-CoV-2 IgG, IgM, and total antibody were tested at Wuhan Huoshenshan Hospital, China. Most of the kits used the indirect immunity, capture, and double-antigen sandwich methods. The effects of various heat-inactivation conditions on SARS-CoV-2-specific IgG, IgM, and total antibody detection were analyzed for the different test methods. Results: Using the indirect immunity method, values for SARS-CoV-2 IgG antibody significantly increased and those for IgM antibody decreased with increasing temperature of heat-inactivation using indirect immunity method. However, values for SARS-CoV-2 IgM and total antibody showed no change when the capture and double-antigen sandwich methods were used. The changes in IgG and IgM antibody values with the indirect immunity method indicated that heat-inactivation could affect COVID-19 detection results obtained using this method. In particular, 18 (22.2%) SARS-CoV-2 IgM positive samples were detected as negative with heat-inactivation at 65°C for 30 min, and one (25%) IgG negative sample was detected as positive after heat-inactivation at 56°C for 60 min and 60°C for 30 min. Conclusions: Heat-inactivation could increase SARS-CoV-2 IgG antibody values, and decrease IgM antibody values, causing potential false-positive or false-negative results for COVID-19 antibody detection using the indirect immunity method. Thus, before conducting antibody testing, the testing platforms should be evaluated in accordance with the relevant requirements to ensure accurate COVID-19 detection results.

Project description:BackgroundThe strength of selective constraints operating on amino acid sites of proteins has a multifactorial nature. In fact, amino acid sites within proteins coevolve due to their functional and/or structural relationships. Different methods have been developed that attempt to account for the evolutionary dependencies between amino acid sites. Researchers have invested a significant effort to increase the sensitivity of such methods. However, the difficulty in disentangling functional co-dependencies from historical covariation has fuelled the scepticism over their power to detect biologically meaningful results. In addition, the biological parameters connecting linear sequence evolution to structure evolution remain elusive. For these reasons, most of the evolutionary studies aimed at identifying functional dependencies among protein domains have focused on the structural properties of proteins rather than on the information extracted from linear multiple sequence alignments (MSA). Non-parametric methods to detect coevolution have been reported to be especially susceptible to produce false positive results based on the properties of MSAs. However, no formal statistical analysis has been performed to definitively test the differential effects of these properties on the sensitivity of such methods.ResultsHere we test the effect that variations on the MSA properties have over the sensitivity of non-parametric methods to detect coevolution. We test the effect that the size of the MSA (number of sequences), mean pairwise amino acid distance per site and the strength of the coevolution signal have on the ability of non-parametric methods to detect coevolution. Our results indicate that all three factors have significant effects on the accuracy of non-parametric methods. Further, introducing statistical filters improves the sensitivity and increases the statistical power of the methods to detect functional coevolution. Statistical analysis of the physico-chemical properties of amino acid sites in the context of the protein structure reveals striking dependencies among amino acid sites. Results indicate a covariation trend in the hydrophobicities and molecular weight characteristics of amino acid sites when analysing a non-redundant set of 8000 protein structures. Using this biological information as filter in coevolutionary analyses minimises the false positive rate of these methods. Application of these filters to three different proteins with known functional domains supports the importance of using biological filters to detect coevolution.ConclusionCoevolutionary analyses using non-parametric methods have proved difficult and highly prone to provide spurious results depending on the properties of MSAs and on the strength of coevolution between amino acid sites. The application of statistical filters to the number of pairs detected as coevolving reduces significantly the number of artifactual results. Analysis of the physico-chemical properties of amino acid sites in the protein structure context reveals their structure-dependent covariation. The application of this known biological information to the analysis of covariation greatly enhances the functional coevolutionary signal and removes historical covariation. Simultaneous use of statistical and biological data is instrumental in the detection of functional amino acid sites dependencies and compensatory changes at the protein level.