Project description:We introduce GUARDIAN, a near-real-time (NRT) ionospheric monitoring software for natural hazards warning. GUARDIAN's ultimate goal is to use NRT total electronic content (TEC) time series to (1) allow users to explore ionospheric TEC perturbations due to natural and anthropogenic events on earth, (2) automatically detect those perturbations, and (3) characterize potential natural hazards. The main goal of GUARDIAN is to provide an augmentation to existing natural hazards early warning systems (EWS). This contribution focuses mainly on objective (1): collecting GNSS measurements in NRT, computing TEC time series, and displaying them on a public website (https://guardian.jpl.nasa.gov). We validate the time series obtained in NRT using well-established post-processing methods. Furthermore, we present an inverse modeling proof of concept to obtain tsunami wave parameters from TEC time series, contributing significantly to objective (3). Note that objectives (2) and (3) are only introduced here as parts of the general architecture, and are not currently operational. In its current implementation, the GUARDIAN system uses more than 70 GNSS ground stations distributed around the Pacific Ring of Fire, and monitoring four GNSS constellations (GPS, Galileo, BDS, and GLONASS). As of today, and to the best of our knowledge, GUARDIAN is the only software available and capable of providing multi-GNSS NRT TEC time series over the Pacific region to the general public and scientific community.Supplementary informationThe online version contains supplementary material available at 10.1007/s10291-022-01365-6.
Project description:Enhancers and silencers often depend on the same transcription factors (TFs) and are imperfectly distinguished from each other by genomic assays of TF binding or chromatin state. To identify sequence features that define enhancers and silencers, we assayed massively parallel reporter libraries of genomic sequences targeted by the photoreceptor TF CRX and found instances of enhancer, silencer, or no activity. Both enhancers and silencers contain more TF motifs than inactive sequences, but enhancers contain motifs from a more diverse collection of TFs. We developed a measure of information content that describes the number and diversity of motifs in a sequence and found that, while both enhancers and silencers depend on CRX motifs, enhancers have higher information content. Our results indicate that enhancers contain motifs for a diverse but degenerate collection of TFs, while silencers depend on a smaller and less diverse collection of TFs.
Project description:BackgroundExtracting relevant biological information from large data sets is a major challenge in functional genomics research. Different aspects of the data hamper their biological interpretation. For instance, 5000-fold differences in concentration for different metabolites are present in a metabolomics data set, while these differences are not proportional to the biological relevance of these metabolites. However, data analysis methods are not able to make this distinction. Data pretreatment methods can correct for aspects that hinder the biological interpretation of metabolomics data sets by emphasizing the biological information in the data set and thus improving their biological interpretability.ResultsDifferent data pretreatment methods, i.e. centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation, were tested on a real-life metabolomics data set. They were found to greatly affect the outcome of the data analysis and thus the rank of the, from a biological point of view, most important metabolites. Furthermore, the stability of the rank, the influence of technical errors on data analysis, and the preference of data analysis methods for selecting highly abundant metabolites were affected by the data pretreatment method used prior to data analysis.ConclusionDifferent pretreatment methods emphasize different aspects of the data and each pretreatment method has its own merits and drawbacks. The choice for a pretreatment method depends on the biological question to be answered, the properties of the data set and the data analysis method selected. For the explorative analysis of the validation data set used in this study, autoscaling and range scaling performed better than the other pretreatment methods. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis).In conclusion, selecting a proper data pretreatment method is an essential step in the analysis of metabolomics data and greatly affects the metabolites that are identified to be the most important.
Project description:The spectrum of modern molecular high-throughput assaying includes diverse technologies such as microarray gene expression, miRNA expression, proteomics, DNA methylation, among many others. Now that these technologies have matured and become increasingly accessible, the next frontier is to collect "multi-modal" data for the same set of subjects and conduct integrative, multi-level analyses. While multi-modal data does contain distinct biological information that can be useful for answering complex biology questions, its value for predicting clinical phenotypes and contributions of each type of input remain unknown. We obtained 47 datasets/predictive tasks that in total span over 9 data modalities and executed analytic experiments for predicting various clinical phenotypes and outcomes. First, we analyzed each modality separately using uni-modal approaches based on several state-of-the-art supervised classification and feature selection methods. Then, we applied integrative multi-modal classification techniques. We have found that gene expression is the most predictively informative modality. Other modalities such as protein expression, miRNA expression, and DNA methylation also provide highly predictive results, which are often statistically comparable but not superior to gene expression data. Integrative multi-modal analyses generally do not increase predictive signal compared to gene expression data.
Project description:With the remarkable increase in genomic sequence data from various organisms, novel tools are needed for comprehensive analyses of available big sequence data. We previously developed a Batch-Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotype solely dependent on oligonucleotide composition and applied to genome and metagenomic studies. BLSOM is suitable for high-performance parallel-computing and can analyze big data simultaneously, but a large-scale BLSOM needs a large computational resource. We have developed Self-Compressing BLSOM (SC-BLSOM) for reduction of computation time, which allows us to carry out comprehensive analysis of big sequence data without the use of high-performance supercomputers. The strategy of SC-BLSOM is to hierarchically construct BLSOMs according to data class, such as phylotype. The first-layer BLSOM was constructed with each of the divided input data pieces that represents the data subclass, such as phylotype division, resulting in compression of the number of data pieces. The second BLSOM was constructed with a total of weight vectors obtained in the first-layer BLSOMs. We compared SC-BLSOM with the conventional BLSOM by analyzing bacterial genome sequences. SC-BLSOM could be constructed faster than BLSOM and cluster the sequences according to phylotype with high accuracy, showing the method's suitability for efficient knowledge discovery from big sequence data.
Project description:Cases of a novel coronavirus were first reported in Wuhan, Hubei province, China, in December 2019 and have since spread across the world. Epidemiological studies have indicated human-to-human transmission in China and elsewhere. To aid the analysis and tracking of the COVID-19 epidemic we collected and curated individual-level data from national, provincial, and municipal health reports, as well as additional information from online reports. All data are geo-coded and, where available, include symptoms, key dates (date of onset, admission, and confirmation), and travel history. The generation of detailed, real-time, and robust data for emerging disease outbreaks is important and can help to generate robust evidence that will support and inform public health decision making.
Project description:Long-range NMR data, namely residual dipolar couplings (RDCs) from external alignment and paramagnetic data, are becoming increasingly popular for the characterization of conformational heterogeneity of multidomain biomacromolecules and protein complexes. The question addressed here is how much information is contained in these averaged data. We have analyzed and compared the information content of conformationally averaged RDCs caused by steric alignment and of both RDCs and pseudocontact shifts caused by paramagnetic alignment, and found that, despite the substantial differences, they contain a similar amount of information. Furthermore, using several synthetic tests we find that both sets of data are equally good towards recovering the major state(s) in conformational distributions.
Project description:The complex of central problems in data analysis consists of three components: (1) detecting the dependence of variables using quantitative measures, (2) defining the significance of these dependence measures, and (3) inferring the functional relationships among dependent variables. We have argued previously that an information theory approach allows separation of the detection problem from the inference of functional form problem. We approach here the third component of inferring functional forms based on information encoded in the functions. We present here a direct method for classifying the functional forms of discrete functions of three variables represented in data sets. Discrete variables are frequently encountered in data analysis, both as the result of inherently categorical variables and from the binning of continuous numerical variables into discrete alphabets of values. The fundamental question of how much information is contained in a given function is answered for these discrete functions, and their surprisingly complex relationships are illustrated. The all-important effect of noise on the inference of function classes is found to be highly heterogeneous and reveals some unexpected patterns. We apply this classification approach to an important area of biological data analysis-that of inference of genetic interactions. Genetic analysis provides a rich source of real and complex biological data analysis problems, and our general methods provide an analytical basis and tools for characterizing genetic problems and for analyzing genetic data. We illustrate the functional description and the classes of a number of common genetic interaction modes and also show how different modes vary widely in their sensitivity to noise.
Project description:BackgroundUnsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses.ResultsWe compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities.ConclusionsThere is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Project description:The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.