Dataset Information

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

ABSTRACT: Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

SUBMITTER: Marino S

PROVIDER: S-EPMC7455041 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Marino Simeone S Zhao Yi Y Zhou Nina N Zhou Yiwang Y Toga Arthur W AW Zhao Lu L Jian Yingsi Y Yang Yichen Y Chen Yehu Y Wu Qiucheng Q Wild Jessica J Cummings Brandon B Dinov Ivo D ID

PloS one 20200828 8

Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area- ...[more]

PMID: 32857775

Similar Datasets

Project description:IntroductionEstimating PM2.5 concentrations and their prediction uncertainties at a high spatiotemporal resolution is important for air pollution health effect studies. This is particularly challenging for California, which has high variability in natural (e.g, wildfires, dust) and anthropogenic emissions, meteorology, topography (e.g. desert surfaces, mountains, snow cover) and land use.MethodsUsing ensemble-based deep learning with big data fused from multiple sources we developed a PM2.5 prediction model with uncertainty estimates at a high spatial (1 km × 1 km) and temporal (weekly) resolution for a 10-year time span (2008-2017). We leveraged autoencoder-based full residual deep networks to model complex nonlinear interrelationships among PM2.5 emission, transport and dispersion factors and other influential features. These included remote sensing data (MAIAC aerosol optical depth (AOD), normalized difference vegetation index, impervious surface), MERRA-2 GMI Replay Simulation (M2GMI) output, wildfire smoke plume dispersion, meteorology, land cover, traffic, elevation, and spatiotemporal trends (geo-coordinates, temporal basis functions, time index). As one of the primary predictors of interest with substantial missing data in California related to bright surfaces, cloud cover and other known interferences, missing MAIAC AOD observations were imputed and adjusted for relative humidity and vertical distribution. Wildfire smoke contribution to PM2.5 was also calculated through HYSPLIT dispersion modeling of smoke emissions derived from MODIS fire radiative power using the Fire Energetics and Emissions Research version 1.0 model.ResultsEnsemble deep learning to predict PM2.5 achieved an overall mean training RMSE of 1.54 μg/m3 (R2: 0.94) and test RMSE of 2.29 μg/m3 (R2: 0.87). The top predictors included M2GMI carbon monoxide mixing ratio in the bottom layer, temporal basis functions, spatial location, air temperature, MAIAC AOD, and PM2.5 sea salt mass concentration. In an independent test using three long-term AQS sites and one short-term non-AQS site, our model achieved a high correlation (>0.8) and a low RMSE (<3 μg/m3). Statewide predictions indicated that our model can capture the spatial distribution and temporal peaks in wildfire-related PM2.5. The coefficient of variation indicated highest uncertainty over deciduous and mixed forests and open water land covers.ConclusionOur method can be generalized to other regions, including those having a mix of major urban areas, deserts, intensive smoke events, snow cover and complex terrains, where PM2.5 has previously been challenging to predict. Prediction uncertainty estimates can also inform further model development and measurement error evaluations in exposure and health studies.

Project description:Introduction: Migraine is a common and debilitating pain disorder associated with dysfunction of the central nervous system. Advanced magnetic resonance imaging (MRI) studies have reported relevant pathophysiologic states in migraine. However, its molecular mechanistic processes are still poorly understood in vivo. This study examined migraine patients with a novel machine learning (ML) method based on their central μ-opioid and dopamine D2/D3 profiles, the most critical neurotransmitters in the brain for pain perception and its cognitive-motivational interface. Methods: We employed compressive Big Data Analytics (CBDA) to identify migraineurs and healthy controls (HC) in a large positron emission tomography (PET) dataset. 198 PET volumes were obtained from 38 migraineurs and 23 HC during rest and thermal pain challenge. 61 subjects were scanned with the selective μ-opioid receptor (μOR) radiotracer [11C]Carfentanil, and 22 with the selective dopamine D2/D3 receptor (DOR) radiotracer [11C]Raclopride. PET scans were recast into a 1D array of 510,340 voxels with spatial and intensity filtering of non-displaceable binding potential (BPND), representing the receptor availability level. We then performed data reduction and CBDA to power rank the predictive brain voxels. Results: CBDA classified migraineurs from HC with accuracy, sensitivity, and specificity above 90% for whole-brain and region-of-interest (ROI) analyses. The most predictive ROIs for μOR were the insula (anterior), thalamus (pulvinar, medial-dorsal, and ventral lateral/posterior nuclei), and the putamen. The latter, putamen (anterior), was also the most predictive for migraine regarding DOR D2/D3 BPND levels. Discussion: CBDA of endogenous μ-opioid and D2/D3 dopamine dysfunctions in the brain can accurately identify a migraine patient based on their receptor availability across key sensory, motor, and motivational processing regions. Our ML-based findings in the migraineur's brain neurotransmission partly explain the severe impact of migraine suffering and associated neuropsychiatric comorbidities.

Dataset Information

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Publications

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets