Project description:Disease gene discovery has been transformed by affordable sequencing of exomes and genomes. Identification of disease-causing mutations requires sifting through a large number of sequence variants. A subset of the variants are unlikely to be good candidates for disease causation based on one or more of the following criteria: (1) being located in genomic regions known to be highly polymorphic, (2) having characteristics suggesting assembly misalignment, and/or (3) being labeled as variants based on misleading reference genome information. We analyzed exome sequence data from 118 individuals in 29 families seen in the NIH Undiagnosed Diseases Program (UDP) to create lists of variants and genes with these characteristics. Specifically, we identified several groups of genes that are candidates for provisional exclusion during exome analysis: 23,389 positions with excess heterozygosity suggestive of alignment errors and 1,009 positions in which the hg18 human genome reference sequence appeared to contain a minor allele. Exclusion of such variants, which we provide in supplemental lists, will likely enhance identification of disease-causing mutations using exome sequence data.
Project description:Errors in botanical surveying are a common problem. The presence of a species is easily overlooked, leading to false-absences; while misidentifications and other mistakes lead to false-positive observations. While it is common knowledge that these errors occur, there are few data that can be used to quantify and describe these errors. Here we characterise false-positive errors for a controlled set of surveys conducted as part of a field identification test of botanical skill. Surveys were conducted at sites with a verified list of vascular plant species. The candidates were asked to list all the species they could identify in a defined botanically rich area. They were told beforehand that their final score would be the sum of the correct species they listed, but false-positive errors counted against their overall grade. The number of errors varied considerably between people, some people create a high proportion of false-positive errors, but these are scattered across all skill levels. Therefore, a person's ability to correctly identify a large number of species is not a safeguard against the generation of false-positive errors. There was no phylogenetic pattern to falsely observed species; however, rare species are more likely to be false-positive as are species from species rich genera. Raising the threshold for the acceptance of an observation reduced false-positive observations dramatically, but at the expense of more false negative errors. False-positive errors are higher in field surveying of plants than many people may appreciate. Greater stringency is required before accepting species as present at a site, particularly for rare species. Combining multiple surveys resolves the problem, but requires a considerable increase in effort to achieve the same sensitivity as a single survey. Therefore, other methods should be used to raise the threshold for the acceptance of a species. For example, digital data input systems that can verify, feedback and inform the user are likely to reduce false-positive errors significantly.
Project description:The false positive rates (FPR) for surface-based group analysis of cortical thickness, surface area, and volume were evaluated for parametric and non-parametric clusterwise correction for multiple comparisons for a range of smoothing levels and cluster-forming thresholds (CFT) using real data under group assignments that should not yield significant results. For whole cortical surface analysis, thickness showed modest inflation in parametric FPRs above the nominal level (10% versus 5%). Surface area and volume FPRs were much higher (20-30%). In the analysis of interhemispheric thickness asymmetries, FPRs were well controlled by parametric correction, but FPRs for surface area and volume asymmetries were still inflated. In all cases, non-parametric permutation adequately controlled the FPRs. It was found that inflated parametric FPRs were caused by violations in the parametric assumptions, namely a heavier-than-Gaussian spatial correlation. The non-Gaussian spatial correlation originates from anatomical features unique to individuals (e.g., a patch of cortex slightly thicker or thinner than average) and is not a by-product of scanning or processing. Thickness performed better than surface area and volume because thickness does not require a Jacobian correction.
Project description:Recent reports of inflated false-positive rates (FPRs) in FMRI group analysis tools by Eklund and associates in 2016 have become a large topic within (and outside) neuroimaging. They concluded that existing parametric methods for determining statistically significant clusters had greatly inflated FPRs ("up to 70%," mainly due to the faulty assumption that the noise spatial autocorrelation function is Gaussian shaped and stationary), calling into question potentially "countless" previous results; in contrast, nonparametric methods, such as their approach, accurately reflected nominal 5% FPRs. They also stated that AFNI showed "particularly high" FPRs compared to other software, largely due to a bug in 3dClustSim. We comment on these points using their own results and figures and by repeating some of their simulations. Briefly, while parametric methods show some FPR inflation in those tests (and assumptions of Gaussian-shaped spatial smoothness also appear to be generally incorrect), their emphasis on reporting the single worst result from thousands of simulation cases greatly exaggerated the scale of the problem. Importantly, FPR statistics depends on "task" paradigm and voxelwise p value threshold; as such, we show how results of their study provide useful suggestions for FMRI study design and analysis, rather than simply a catastrophic downgrading of the field's earlier results. Regarding AFNI (which we maintain), 3dClustSim's bug effect was greatly overstated-their own results show that AFNI results were not "particularly" worse than others. We describe further updates in AFNI for characterizing spatial smoothness more appropriately (greatly reducing FPRs, although some remain >5%); in addition, we outline two newly implemented permutation/randomization-based approaches producing FPRs clustered much more tightly about 5% for voxelwise p ≤ 0.01.
Project description:High-field asymmetric waveform ion mobility spectrometry (FAIMS) separates glycopeptides in the gas phase prior to mass spectrometry (MS) analysis, thus offering the potential to analyze glycopeptides without prior enrichment. Several studies have demonstrated the ability of FAIMS to enhance glycopeptide detection but have primarily focused on N-glycosylation. Here, we evaluated FAIMS for O-glycoprotein and mucin-domain glycoprotein analysis using samples of varying complexity. We demonstrated that FAIMS was useful in increasingly complex samples as it allowed for the identification of more glycosylated species. However, during our analyses, we observed a phenomenon called "in FAIMS fragmentation" (IFF) akin to in source fragmentation but occurring during FAIMS separation. FAIMS experiments showed a 2- to 5-fold increase in spectral matches from IFF compared with control experiments. These results were also replicated in previously published data, indicating that this is likely a systemic occurrence when using FAIMS. Our study highlights that although there are potential benefits to using FAIMS separation, caution must be exercised in data analysis because of prevalent IFF, which may limit its applicability in the broader field of O-glycoproteomics.
Project description:Clinical sequencing is expanding, but causal variants are still not identified in the majority of cases. These unsolved cases can aid in gene discovery when individuals with similar phenotypes are identified in systems such as the Matchmaker Exchange. We describe risks for gene discovery in this growing set of unsolved cases. In a set of rare disease cases with the same phenotype, it is not difficult to find two individuals with the same phenotype that carry variants in the same gene. We quantify the risk of false-positive association in a cohort of individuals with the same phenotype, using the prior probability of observing a variant in each gene from over 60,000 individuals (Exome Aggregation Consortium). Based on the number of individuals with a genic variant, cohort size, specific gene, and mode of inheritance, we calculate a P value that the match represents a true association. A match in two of 10 patients in MECP2 is statistically significant (P = 0.0014), whereas a match in TTN would not reach significance, as expected (P > 0.999). Finally, we analyze the probability of matching in clinical exome cases to estimate the number of cases needed to identify genes related to different disorders. We offer Rare Disease Match, an online tool to mitigate the uncertainty of false-positive associations.
Project description:Two-point linkage analyses of whole genome sequence data are a promising approach to identify rare variants that segregate with complex diseases in large pedigrees because, in theory, the causal variants have been genotyped. We used whole genome sequence data and simulated traits provided by Genetic Analysis Workshop 18 to evaluate the proportion of false-positive findings in a binary trait using classic two-point parametric linkage analysis. False-positive genome-wide significant log of odds (LOD) scores were identified in more than 80% of 50 replicates for a binary phenotype generated by dichotomizing a quantitative trait that was simulated with a polygenic component (that was not based on any of the provided whole genome sequence genotypes). In contrast, when the trait was truly nongenetic (created by randomly assigning affected-unaffected status), the number of false-positive results was well controlled. These results suggest that when using two-point linkage analyses on whole genome sequence data, one should carefully examine regions yielding significant two-point LOD scores with multipoint analysis and that a more stringent significance threshold may be needed.
Project description:Virtual screening of the Maybridge library of ca. 70 000 compounds was performed using a similarity filter, docking, and molecular mechanics-generalized Born/surface area postprocessing to seek potential non-nucleoside inhibitors of human immunodeficiency virus-1 (HIV-1) reverse transcriptase (NNRTIs). Although known NNRTIs were retrieved well, purchase and assaying of representative, top-scoring compounds from the library failed to yield any active anti-HIV agents. However, the highest-ranked library compound, oxadiazole 1, was pursued as a potential "near-miss" with the BOMB program to seek constructive modifications. Subsequent synthesis and assaying of several polychloro-analogs did yield anti-HIV agents with EC50 values as low as 310 nM. The study demonstrates that it is possible to learn from a formally unsuccessful virtual-screening exercise and, with the aid of computational analyses, to efficiently evolve a false positive into a true active.
Project description:PurposeClinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.MethodsWe sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives.ResultsAfter training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%.ConclusionOur results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE .
Project description:BackgroundWhen evaluating cancer screening it is important to estimate the cumulative risk of false positives from periodic screening. Because the data typically come from studies in which the number of screenings varies by subject, estimation must take into account dropouts. A previous approach to estimate the probability of at least one false positive in n screenings unrealistically assumed that the probability of dropout does not depend on prior false positives.MethodBy redefining the random variables, we obviate the unrealistic dropout assumption. We also propose a relatively simple logistic regression and extend estimation to the expected number of false positives in n screenings.ResultsWe illustrate our methodology using data from women ages 40 to 64 who received up to four annual breast cancer screenings in the Health Insurance Program of Greater New York study, which began in 1963. Covariates were age, time since previous screening, screening number, and whether or not a previous false positive occurred. Defining a false positive as an unnecessary biopsy, the only statistically significant covariate was whether or not a previous false positive occurred. Because the effect of screening number was not statistically significant, extrapolation beyond 4 screenings was reasonable. The estimated mean number of unnecessary biopsies in 10 years per woman screened is.11 with 95% confidence interval of (.10,.12). Defining a false positive as an unnecessary work-up, all the covariates were statistically significant and the estimated mean number of unnecessary work-ups in 4 years per woman screened is.34 with 95% confidence interval (.32,.36).ConclusionUsing data from multiple cancer screenings with dropouts, and allowing dropout to depend on previous history of false positives, we propose a logistic regression model to estimate both the probability of at least one false positive and the expected number of false positives associated with n cancer screenings. The methodology can be used for both informed decision making at the individual level, as well as planning of health services.