Project description:In contrast to its common definition and calculating process, interpretations of p-value differ among statisticians. Since p-value is the basis of various methodologies, this divergence has led to distinct test methodologies as well as various opinions for evaluating test results, producing a chaotic situation. Here the origin of the divergence is found in the differences among Pr(H0 = true), which gives a prior probability in the definition of p-values. Effects of differences in the prior probability on the character of p-values are investigated by comparing microarray data and random numbers as subjects. The summarized levels the genes are presented in the matrix files (linked below as supplementary files). Student t-test was applied between the two groups (0h and 14d): p-values presented in the matrix files.
Project description:BackgroundUp to now, microarray data are mostly assessed in context with only one or few parameters characterizing the experimental conditions under study. More explicit experiment annotations, however, are highly useful for interpreting microarray data, when available in a statistically accessible format.ResultsWe provide means to preprocess these additional data, and to extract relevant traits corresponding to the transcription patterns under study. We found correspondence analysis particularly well-suited for mapping such extracted traits. It visualizes associations both among and between the traits, the hereby annotated experiments, and the genes, revealing how they are all interrelated. Here, we apply our methods to the systematic interpretation of radioactive (single channel) and two-channel data, stemming from model organisms such as yeast and drosophila up to complex human cancer samples. Inclusion of technical parameters allows for identification of artifacts and flaws in experimental design.ConclusionBiological and clinical traits can act as landmarks in transcription space, systematically mapping the variance of large datasets from the predominant changes down toward intricate details.
Project description:ObjectiveTo develop and validate Medicare claims-based approaches for identifying abnormal screening mammography interpretation.Data sourcesMammography data and linked Medicare claims for 387,709 mammograms performed from 1999 to 2005 within the Breast Cancer Surveillance Consortium (BCSC).Study designSplit-sample validation of algorithms based on claims for breast imaging or biopsy following screening mammography.Data extraction methodsMedicare claims and BCSC mammography data were pooled at a central Statistical Coordinating Center.Principal findingsPresence of claims for subsequent imaging or biopsy had sensitivity of 74.9 percent (95 percent confidence interval [CI], 74.1-75.6) and specificity of 99.4 percent (95 percent CI, 99.4-99.5). A classification and regression tree improved sensitivity to 82.5 percent (95 percent CI, 81.9-83.2) but decreased specificity (96.6 percent, 95 percent CI, 96.6-96.8).ConclusionsMedicare claims may be a feasible data source for research or quality improvement efforts addressing high rates of abnormal screening mammography.
Project description:MotivationStudying the interplay between gene expression and metabolite levels can yield important information on the physiology of stress responses and adaptation strategies. Performing transcriptomics and metabolomics in parallel during time-series experiments represents a systematic way to gain such information. Several combined profiling datasets have been added to the public domain and they form a valuable resource for hypothesis generating studies. Unfortunately, detecting coresponses between transcript levels and metabolite abundances is non-trivial: they cannot be assumed to overlap directly with underlying biochemical pathways and they may be subject to time delays and obscured by considerable noise.ResultsOur aim was to predict pathway comemberships between metabolites and genes based on their coresponses to applied stress. We found that in the presence of strong noise and time-shifted responses, a hidden Markov model-based similarity outperforms the simpler Pearson correlation but performs comparably or worse in their absence. Therefore, we propose a supervised method that applies pathway information to summarize similarity statistics to a consensus statistic that is more informative than any of the single measures. Using four combined profiling datasets, we show that comembership between metabolites and genes can be predicted for numerous KEGG pathways; this opens opportunities for the detection of transcriptionally regulated pathways and novel metabolically related genes.AvailabilityA command-line software tool is available at http://www.cin.ufpe.br/~igcf/Metabolites.Contacthenning@psc.riken.jp; igcf@cin.ufpe.br
Project description:For a long time, NMR chemical shifts have been used to identify protein secondary structures. Currently, this is accomplished through comparing the observed (1)H(alpha), (13)C(alpha), (13)C(beta), or (13)C' chemical shifts with the random coil values. Here, we present a new protocol, which is based on the joint probability of each of the three secondary structural types (beta-strand, alpha-helix, and random coil) derived from chemical-shift data, to identify the secondary structure. In combination with empirical smooth filters/functions, this protocol shows significant improvements in the accuracy and the confidence of identification. Updated chemical-shift statistics are reported, on the basis of which the reliability of using chemical shift to identify protein secondary structure is evaluated for each nucleus. The reliability varies greatly among the 20 amino acids, but, on average, is in the order of: (13)C(alpha)>(13)C'>(1)H(alpha)>(13)C(beta)>(15)N>(1)H(N) to distinguish an alpha-helix from a random coil; and (1)H(alpha)>(13)C(beta) >(1)H(N) approximately (13)C(alpha) approximately (13)C' approximately (15)N for a beta-strand from a random coil. Amide (15)N and (1)H(N) chemical shifts, which are generally excluded from the application, in fact, were found to be helpful in distinguishing a beta-strand from a random coil. In addition, the chemical-shift statistical data are compared with those reported previously, and the results are discussed. A JAVA User Interface program has been developed to make the entire procedure fully automated and is available via http://ccsr3150-p3.stanford.edu.
Project description:Background:Flow cytometry analysis is the method of choice for the differential diagnosis of hematologic disorders. It is typically performed by a trained hematopathologist through visual examination of bidimensional plots, making the analysis time-consuming and sometimes too subjective. Here, a pilot study applying genetic algorithms to flow cytometry data from normal and acute myeloid leukemia subjects is described. Subjects and Methods:Initially, Flow Cytometry Standard files from 316 normal and 43 acute myeloid leukemia subjects were transformed into multidimensional FITS image metafiles. Training was performed through introduction of FITS metafiles from 4 normal and 4 acute myeloid leukemia in the artificial intelligence system. Results:Two mathematical algorithms termed 018330 and 025886 were generated. When tested against a cohort of 312 normal and 39 acute myeloid leukemia subjects, both algorithms combined showed high discriminatory power with a receiver operating characteristic (ROC) curve of 0.912. Conclusions:The present results suggest that machine learning systems hold a great promise in the interpretation of hematological flow cytometry data.
Project description:BackgroundMost machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem.ObjectivesThe aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities.MethodsTwo random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians.ResultsSimulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software.ConclusionsRandom forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.
Project description:Single-cell RNA sequencing (scRNA-seq) has revealed an unprecedented degree of immune cell diversity. However, consistent definition of cell subtypes and cell states across studies and diseases remains a major challenge. Here we generate reference T cell atlases for cancer and viral infection by multi-study integration, and develop ProjecTILs, an algorithm for reference atlas projection. In contrast to other methods, ProjecTILs allows not only accurate embedding of new scRNA-seq data into a reference without altering its structure, but also characterizing previously unknown cell states that "deviate" from the reference. ProjecTILs accurately predicts the effects of cell perturbations and identifies gene programs that are altered in different conditions and tissues. A meta-analysis of tumor-infiltrating T cells from several cohorts reveals a strong conservation of T cell subtypes between human and mouse, providing a consistent basis to describe T cell heterogeneity across studies, diseases, and species.
Project description:Mechanistic understanding of dynamic membrane proteins such as transporters, receptors, and channels requires accurate depictions of conformational ensembles, and the manner in which they interchange as a function of environmental factors including substrates, lipids, and inhibitors. Spectroscopic techniques such as electron spin resonance (ESR) pulsed electron-electron double resonance (PELDOR), also known as double electron-electron resonance (DEER), provide a complement to atomistic structures obtained from x-ray crystallography or cryo-EM, since spectroscopic data reflect an ensemble and can be measured in more native solvents, unperturbed by a crystal lattice. However, attempts to interpret DEER data are frequently stymied by discrepancies with the structural data, which may arise due to differences in conditions, the dynamics of the protein, or the flexibility of the attached paramagnetic spin labels. Recently, molecular simulation techniques such as EBMetaD have been developed that create a conformational ensemble matching an experimental distance distribution while applying the minimal possible bias. Moreover, it has been proposed that the work required during an EBMetaD simulation to match an experimentally determined distribution could be used as a metric with which to assign conformational states to a given measurement. Here, we demonstrate the application of this concept for a sodium-coupled transport protein, BetP. Because the probe, protein, and lipid bilayer are all represented in atomic detail, the different contributions to the work, such as the extent of protein backbone movements, can be separated. This work therefore illustrates how ranking simulations based on EBMetaD can help to bridge the gap between structural and biophysical data and thereby enhance our understanding of membrane protein conformational mechanisms.