Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Efficient Database Search via Tensor Distribution Bucketing

ABSTRACT: In mass spectrometry-based proteomics, one needs to search billions of mass spectra against the human proteome with billions of amino acids, where many of the amino acids go through post-translational modifications. In order to account for novel modifications, we need to search all the spectra against all the peptides using a joint probabilistic model that can be learned from training data. Assuming M spectra and N possible peptides, currently the state of the art search methods have runtime of O(MN). Here, we propose a novel bucketing method that sends pairs with high likelihood under the joint probabilistic model to the same bucket with higher probability than those pairs with low likelihood. We demonstrate that the runtime of this method grows sub-linearly with the data size, and our results show that our method is orders of magnitude faster than methods from the locality sensitive hashing literature. Electronic supplementary material The online version of this chapter (10.1007/978-3-030-47436-2_26) contains supplementary material, which is available to authorized users.

SUBMITTER: Lauw H

PROVIDER: S-EPMC7206332 | biostudies-literature | 2020 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Similar Datasets

Diffusion tensor distribution imaging.

Project description:Conventional diffusion MRI yields voxel-averaged parameters that suffer from ambiguities for heterogeneous anisotropic materials such as brain tissue. Using principles from solid-state NMR spectroscopy, we have previously introduced the shape of the diffusion encoding tensor as a separate acquisition dimension that disentangles isotropic and anisotropic contributions to the observed diffusivities, thereby allowing for unconstrained data inversion into diffusion tensor distributions with "size," "shape," and orientation dimensions. Here we combine our recent non-parametric data inversion algorithm and data acquisition protocol with an imaging pulse sequence to demonstrate spatial mapping of diffusion tensor distributions using a previously developed composite phantom with multiple isotropic and anisotropic components. We propose a compact format for visualizing two-dimensional arrays of the distributions, new scalar parameters quantifying intra-voxel heterogeneity, and a binning procedure giving maps of all relevant parameters for each of the components resolved in the multidimensional distribution space.

| S-EPMC6593682 | biostudies-literature

BS-CP: Efficient streaming Bayesian tensor decomposition method via assumed density filtering.

Project description:Tensor data is common in real-world applications, such as recommendation system and air quality monitoring. But such data is often sparse, noisy, and fast produced. CANDECOMP/PARAFAC (CP) is a popular tensor decomposition model, which is both theoretically advantageous and numerically stable. However, learning the CP model in a Bayesian framework, though promising to handle data sparsity and noise, is computationally challenging, especially with fast produced data streams. The fundamental problem addressed by the paper is mainly tackles the efficient processing of streaming tensor data. In this work, we propose BS-CP, a quick and accurate structure to dynamically update the posterior of latent factors when a new observation tensor is received. We first present the BS-CP1 algorithm, which is an efficient implementation using assumed density filtering (ADF). In addition, we propose BS-CP2 algorithm, using Gauss-Laguerre quadrature method to integrate the noise effect which shows better empirical result. We tested BS-CP1 and BS-CP2 on generic real recommendation system datasets, including Beijing-15k, Beijing-20k, MovieLens-1m and Fit Record. Compared with state-of-the-art methods, BS-CP1 achieve 31.8% and 33.3% RMSE improvement in the last two datasets, with a similar trend observed for BS-CP2. This evidence proves that our algorithm has better results on large datasets and is more suitable for real-world scenarios. Compared with most other comparison methods, our approach has demonstrated an improvement of over 10% and exhibits superior stability.

| S-EPMC11611110 | biostudies-literature

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.

Project description:Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

| S-EPMC10153118 | biostudies-literature

Global Identification of Protein PTMs in a Single-pass Database Search

2015-07-31 | GSE59956 | GEO

Cycloquest: identification of cyclopeptides via database search of their mass spectra against genome databases.

Project description:Hundreds of ribosomally synthesized cyclopeptides have been isolated from all domains of life, the vast majority having been reported in the last 15 years. Studies of cyclic peptides have highlighted their exceptional potential both as stable drug scaffolds and as biomedicines in their own right. Despite this, computational techniques for cyclopeptide identification are still in their infancy, with many such peptides remaining uncharacterized. Tandem mass spectrometry has occupied a niche role in cyclopeptide identification, taking over from traditional techniques such as nuclear magnetic resonance spectroscopy (NMR). MS/MS studies require only picogram quantities of peptide (compared to milligrams for NMR studies) and are applicable to complex samples, abolishing the requirement for time-consuming chromatographic purification. While database search tools such as Sequest and Mascot have become standard tools for the MS/MS identification of linear peptides, they are not applicable to cyclopeptides, due to the parent mass shift resulting from cyclization and different fragmentation patterns of cyclic peptides. In this paper, we describe the development of a novel database search methodology to aid in the identification of cyclopeptides by mass spectrometry and evaluate its utility in identifying two peptide rings from Helianthus annuus, a bacterial cannibalism factor from Bacillus subtilis, and a ?-defensin from Rhesus macaque.

| S-EPMC3242011 | biostudies-literature

Global Identification of Protein PTMs in a Single-pass Database Search

Project description:Bottom-up proteomics database search algorithms used for peptide identification cannot comprehensively identify posttranslational modifications (PTMs) in a single-pass because of high false discovery rates (FDRs). A new approach to database searching enables Global PTM (G-PTM) identification by exclusively looking for curated PTMs, thereby avoiding the FDR penalty experienced during conventional variable modification searches. We identified nearly 2500 unique, high-confidence modified peptides comprising 31 different PTM types in single-pass database searches. Male C57BL/6J (B6) and CAST/EiJ (CAST) mice were purchased from The Jackson Laboratories (Bar Harbor, Maine) and housed in an environmentally controlled vivarium at the University of Wisconsin Biochemistry Department. Mice were provided standard rodent chow (Purina no. 5008) and water ad libitum, and maintained on a 12-hour light/dark cycle (6 AM – 6 PM). At 10 weeks of age, mice were sacrificed by CO2 asphyxiation. All animal procedures were preapproved by the University of Wisconsin Animal Care and Use Committee.

2015-07-31 | E-GEOD-59956 | biostudies-arrayexpress

A new framework for MR diffusion tensor distribution.

Project description:The ability to characterize heterogeneous and anisotropic water diffusion processes within macroscopic MRI voxels non-invasively and in vivo is a desideratum in biology, neuroscience, and medicine. While an MRI voxel may contain approximately a microliter of tissue, our goal is to examine intravoxel diffusion processes on the order of picoliters. Here we propose a new theoretical framework and efficient experimental design to describe and measure such intravoxel structural heterogeneity and anisotropy. We assume that a constrained normal tensor-variate distribution (CNTVD) describes the variability of positive definite diffusion tensors within a voxel which extends its applicability to a wide range of b-values while preserving the richness of diffusion tensor distribution (DTD) paradigm unlike existing models. We introduce a new Monte Carlo (MC) scheme to synthesize realistic 6D DTD numerical phantoms and invert the MR signal. We show that the signal inversion is well-posed and estimate the CNTVD parameters parsimoniously by exploiting the different symmetries of the mean and covariance tensors of CNTVD. The robustness of the estimation pipeline is assessed by adding noise to calculated MR signals and compared with the ground truth. A family of invariant parameters and glyphs which characterize microscopic shape, size and orientation heterogeneity within a voxel are also presented.

| S-EPMC7854653 | biostudies-literature

In depth search of the Sequence Read Archive database reveals global distribution of the emerging pathogenic fungus Scedosporium aurantiacum.

Project description:Scedosporium species are emerging opportunistic fungal pathogens causing various infections mainly in immunocompromised patients, but also in immunocompetent individuals, following traumatic injuries. Clinical manifestations range from local infections, such as subcutaneous mycetoma or bone and joint infections, to pulmonary colonization and severe disseminated diseases. They are commonly found in soil and other environmental sources. To date S. aurantiacum has been reported only from a handful of countries. To identify the worldwide distribution of this species we screened publicly available sequencing data from fungal metabarcoding studies in the Sequence Read Archive (SRA) of The National Centre for Biotechnology Information (NCBI) by multiple BLAST searches. S. aurantiacum was found in 26 countries and two islands, throughout every climatic region. This distribution is like that of other Scedosporium species. Several new environmental sources of S. aurantiacum including human and bovine milk, chicken and canine gut, freshwater, and feces of the giant white-tailed rat (Uromys caudimaculatus) were identified. This study demonstrated that raw sequence data stored in the SRA database can be repurposed using a big data analysis approach to answer biological questions of interest.Lay summaryTo understand the distribution and natural habitat of S. aurantiacum, species-specific DNA sequences were searched in the SRA database. Our large-scale data analysis illustrates that S. aurantiacum is more widely distributed than previously thought and new environmental sources were identified.

| S-EPMC8994208 | biostudies-literature

WGDB: Wood Gene Database with search interface.

Project description:UnlabelledWood quality can be defined in terms of particular end use with the involvement of several traits. Over the last fifteen years researchers have assessed the wood quality traits in forest trees. The wood quality was categorized as: cell wall biochemical traits, fibre properties include the microfibril angle, density and stiffness in loblolly pine [1]. The user friendly and an open-access database has been developed named Wood Gene Database (WGDB) for describing the wood genes along the information of protein and published research articles. It contains 720 wood genes from species namely Pinus, Deodar, fast growing trees namely Poplar, Eucalyptus. WGDB designed to encompass the majority of publicly accessible genes codes for cellulose, hemicellulose and lignin in tree species which are responsive to wood formation and quality. It is an interactive platform for collecting, managing and searching the specific wood genes; it also enables the data mining relate to the genomic information specifically in Arabidopsis thaliana, Populus trichocarpa, Eucalyptus grandis, Pinus taeda, Pinus radiata, Cedrus deodara, Cedrus atlantica. For user convenience, this database is cross linked with public databases namely NCBI, EMBL & Dendrome with the search engine Google for making it more informative and provides bioinformatics tools named BLAST,COBALT.AvailabilityThe database is freely available on www.wgdb.in.

| S-EPMC3916818 | biostudies-literature

ProteoStorm: An Ultrafast Metaproteomics Database Search Framework.

Project description:Shotgun metaproteomics has the potential to reveal the functional landscape of microbial communities but lacks appropriate methods for complex samples with unknown compositions. In the absence of prior taxonomic information, tandem mass spectra would be searched against large pan-microbial databases, which requires heavy computational workload and reduces sensitivity. We present ProteoStorm, an efficient database search framework for large-scale metaproteomics studies, which identifies high-confidence peptide-spectrum matches (PSMs) while achieving a two-to-three orders-of-magnitude speedup over popular tools. A reanalysis of a urinary tract infection (UTI) dataset of 110 individuals revealed a complex pattern of polymicrobial expression, including sub-types of UTIs, cases of bacterial vaginosis, and evidence of no underlying disease. Importantly, compared to the initial UTI study that restricted the search database to a manually curated list of 20 genera, ProteoStorm identified additional genera that were previously unreported, including a case of infection with the rare pathogen Propionimicrobium.

| S-EPMC6231400 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data