Dataset Information

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

ABSTRACT:

Motivation

The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method.

Results

By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn.

Conclusions

Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.

SUBMITTER: Lotsch J

PROVIDER: S-EPMC8341664 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

Lötsch Jörn J Malkusch Sebastian S Ultsch Alfred A

PloS one 20210805 8

<h4>Motivation</h4>The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the cur ...[more]

PMID: 34352006

Similar Datasets

Project description:BackgroundMulti-dimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a non-metric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N2), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data.ResultsWe developed a new rapid metric MDS method with a low computational complexity, making metric MDS applicable for large data sets. Computer simulation showed that the new method of split-and-combine MDS (SC-MDS) is fast, accurate and efficient. Our empirical studies using microarray data on the yeast cell cycle showed that the performance of K-means in the reduced dimensional space is similar to or slightly better than that of K-means in the original space, but about three times faster to obtain the clustering results. Our clustering results using SC-MDS are more stable than those in the original space. Hence, the proposed SC-MDS is useful for analyzing whole genome data.ConclusionOur new method reduces the computational complexity from O(N3) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent high dimensional large data set in a low dimensional space not only efficiently but also effectively.

Project description:BackgroundThree dimensional biomedical image sets are becoming ubiquitous, along with the canonical atlases providing the necessary spatial context for analysis. To make full use of these 3D image sets, one must be able to present views for 2D display, either surface renderings or 2D cross-sections through the data. Typical display software is limited to presentations along one of the three orthogonal anatomical axes (coronal, horizontal, or sagittal). However, data sets precisely oriented along the major axes are rare. To make fullest use of these datasets, one must reasonably match the atlas' orientation; this involves resampling the atlas in planes matched to the data set. Traditionally, this requires the atlas and browser reside on the user's desktop; unfortunately, in addition to being monolithic programs, these tools often require substantial local resources. In this article, we describe a network-capable, client-server framework to slice and visualize 3D atlases at off-axis angles, along with an open client architecture and development kit to support integration into complex data analysis environments.ResultsHere we describe the basic architecture of a client-server 3D visualization system, consisting of a thin Java client built on a development kit, and a computationally robust, high-performance server written in ANSI C++. The Java client components (NetOStat) support arbitrary-angle viewing and run on readily available desktop computers running Mac OS X, Windows XP, or Linux as a downloadable Java Application. Using the NeuroTerrain Software Development Kit (NT-SDK), sophisticated atlas browsing can be added to any Java-compatible application requiring as little as 50 lines of Java glue code, thus making it eminently re-useable and much more accessible to programmers building more complex, biomedical data analysis tools. The NT-SDK separates the interactive GUI components from the server control and monitoring, so as to support development of non-interactive applications. The server implementation takes full advantage of data center's high-performance hardware, where it can be co-localized with centrally-located, 3D dataset repositories, extending access to the researcher community throughout the Internet.ConclusionThe combination of an optimized server and modular, platform-independent client provides an ideal environment for viewing complex 3D biomedical datasets, taking full advantage of high-performance servers to prepare images and subsets of associated meta-data for viewing, as well as the graphical capabilities in Java to actually display the data.

Dataset Information

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

Motivation

Results

Conclusions

Publications

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets