Project description:Detection of SARS-CoV-2 using RT–PCR and other advanced methods can achieve high accuracy. However, their application is limited in countries that lack sufficient resources to handle large-scale testing during the COVID-19 pandemic. Here, we describe a method to detect SARS-CoV-2 in nasal swabs using matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) and machine learning analysis. This approach uses equipment and expertise commonly found in clinical laboratories in developing countries. We obtained mass spectra from a total of 362 samples (211 SARS-CoV-2-positive and 151 negative by RT–PCR) without prior sample preparation from three different laboratories. We tested two feature selection methods and six machine learning approaches to identify the top performing analysis approaches and determine the accuracy of SARS-CoV-2 detection. The support vector machine model provided the highest accuracy (93.9%), with 7% false positives and 5% false negatives. Our results suggest that MALDI-MS and machine learning analysis can be used to reliably detect SARS-CoV-2 in nasal swab samples.
Project description:MicroRNAs (miRs) function primarily as post-transcriptional negative regulators of gene expression through binding to their mRNA targets. Reliable prediction of a miR’s targets is a considerable bioinformatic challenge of great importance for inferring the miR’s function. Sequence-based prediction algorithms have high false-positive rates, are not in agreement, and are not biological context specific. Here we introduce CoSMic (Context-Specific MicroRNA analysis), an algorithm that combines sequence-based prediction with miR and mRNA expression data. CoSMic differs from existing methods—it identifies miRs that play active roles in the specific biological system of interest and predicts with less false positives their functional targets. We applied CoSMic to search for miRs that regulate the migratory response of human mammary cells to epidermal growth factor (EGF) stimulation. Several such miRs, whose putative targets were significantly enriched by migration processes were identified. We tested three of these miRs experimentally, and showed that they indeed affected the migratory phenotype; we also tested three negative controls. In comparison to other algorithms CoSMic indeed filters out false positives and allows improved identification of context-specific targets. CoSMic can greatly facilitate miR research in general and, in particular, advance our understanding of individual miRs’ function in a specific context.
Project description:RNA-Seq data from 17 wild-type biological replicates of Arabidopsis thaliana used to explore read count measurements across replicates along with the False Discovery Rate of Differential Gene Expression tools. Although A. thaliana has a relatively small genome, its transcriptome is similar in scale and complexity to that of model mammal species and its genome is extensively annotated and the conclusions presented here provide useful guidance for work in other complex eukaryotes. The findings show that the negative binomial and log-normal distributions are both good choices as models for the cross-replicate variability of RNA-seq read counts. 6 of 9 DGE tools controlled their identification of false positives well even with only 3 replicates. Our results reinforce the conclusions reached by Schurch et. al. (2015 RNA) in yeast.
Project description:Adenosine-to-inosine (A-to-I) RNA editing is a post-transcriptional processing event involved in diversifying the transcriptome responsible for various biological processes. In this context, we developed a new biochemical method that enriches the inosine-containing RNA. The objective is the accurate identification of A-to-I editing sites, eliminating false positives caused by RNA-DNA differences. This method was applied to three neurological diseases, demonstrating that A-to-I editing sites significantly decreased in neuronal activity genes.
Project description:Recent years has witnessed rapid progress of the field epitranscriptomics. Functional interpretation of epitranscriptome relies on mapping technologies which determine the localization and stoichiometry of various RNA modifications. However, contradictory results are derived from different studies, questioning the biological impacts of certain RNA modifications. Here, we develop an approach for the generation of synthetic RNA library resembling the endogenous transcriptome but lacking modifications. Incorporating this modification-free RNA library as a negative control into established techniques, we obtain precise and quantitative maps of m6A and m5C after removing the pervasive false positives resulted from other elements such as specific sequence context and RNA secondary structure.
Project description:Detection of SARS-CoV-2 using RT-PCR and other advanced methods can achieve high accuracy. However, their application is limited in countries that lack sufficient resources to handle large-scale testing during the COVID-19 pandemic. Here, we describe a method to detect SARS-CoV-2 in nasal swabs using matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) and machine learning analysis. This approach uses equipment and expertise commonly found in clinical laboratories in developing countries. We obtained mass spectra from a total of 362 samples (211 SARS-CoV-2-positive and 151 negative by RT-PCR) without prior sample preparation from three different laboratories. We tested two feature selection methods and six machine learning approaches to identify the top performing analysis approaches and determine the accuracy of SARS-CoV-2 detection. The support vector machine model provided the highest accuracy (93.9%), with 7% false positives and 5% false negatives. Our results suggest that MALDI-MS and machine learning analysis can be used to reliably detect SARS-CoV-2 in nasal swab samples.
| MSV000086175 | MassIVE
Project description:Metabarcoding of a mock community of soil invertebrates: DNA extraction, false-positives, and data filtration
Project description:Enterotoxin-producing C. perfringens type A is a common cause of food poisonings. The cpe encoding the enterotoxin can be chromosomal (genotype IS1470) or plasmid-borne (genotypes IS1470-like-cpe or IS1151-cpe). The chromosomal cpe-carrying C. perfringens are a more common cause of food poisonings than plasmid-borne cpe-genotypes. The chromosomal cpe-carrying C. perfringens type A strains are generally more resistant to most food-processing conditions than plasmid-borne cpe-carrying strains. On the other hand, the plasmid-borne cpe-positive genotypes are more commonly found in human feces than chromosomal cpe-positive genotypes, and humans seem to be a reservoir for plasmid-borne cpe-carrying strains. Thus, it is possible that the epidemiology of C. perfringes type A food poisonings caused by plasmid-borne and chromosomal cpe-carrying strains is different. A DNA microarray was designed for analysis of genetic relatedness between the different cpe-positive and cpe-negative genotypes of C. perfringens strains isolated from human, animal, environmental and food samples. The DNA microarray contained two probes for all protein-coding sequences in the three genome-sequenced strains (C. perfringens type A strains 13, ATCC13124, and SM101). The chromosomal and plasmid-borne C. perfringens genotypes were grouped into two distinct clusters, one consisting of the chromosomal cpe-genotypes and the other consisting of plasmid-borne cpe-genotypes. Analysis of the variable gene pool complemented with the growth studies demonstrate different carbohydrate and amine metabolism in the chromosomal and plasmid-borne cpe-carrying strains, suggesting different epidemiology of the cpe-positive C. perfringens strain groups.
Project description:To unbiasedly evaluate the quantitative performance of different quantitative methods, and compare different popular proteomics data processing workflows, we prepared a benchmark dataset where the various levels of spikeed-in E. Coli proteome that true fold change (i.e. 1 fold, 1.5 fold, 2 fold, 2.5 fold and 3 fold) and true identities of positives/negatives (i.e. E.Coli proteins are true positives while Human proteins are true negatives) are known. To best mimic the proteomics application in comparison of multiple replicates, each fold change group contains 4 replicates, so there are 20 LC-MS/MS analysis in this benchmark dataset. To our knowledge, this spike-in benchmark dataset is largest-scale ever that encompasses 5 different spike level, >500 true positive proteins, and >3000 true negative proteins (2peptide criteria, 1% protein FDR), with a wide concentration dynamic range. The dataset is ideal to test quantitative accuracy, precision, false-positive biomarker discovery and missing data level.