Project description:We designed a metaproteomic analysis method (ComPIL) to accommodate the ever-increasing number of sequences against which experimental shotgun proteomics spectra could be accurately and rapidly queried. Our objective was to create these large databases for the analysis of complex metasamples with unknown composition, including those derived from human, animal, and environmental microbiomes. The amount of high-throughput sequencing data has substantially increased since our original database was assembled in 2014. Here, we present a rebuild of the ComPIL libraries comprised of updated publicly disseminated sequence data as well as a modified version of the search engine ProLuCID-ComPIL optimized for querying experimental spectra. ComPIL 2.0 consists of 113 million protein records and roughly 4.8 billion unique tryptic peptide sequences and is 2.3 times the size of our original version. We searched a data set collected on a healthy human gut microbiome proteomic sample and compared the results to demonstrate that ComPIL 2.0 showed a substantial increase in the number of unique identified peptides and proteins compared to the first ComPIL version. The high confidence of protein identification and accuracy demonstrated by the use of ComPIL 2.0 may encourage the method's application for large-scale proteomic annotation of complex protein systems.
Project description:ADP-ribosylation is a protein modification responsible for biological processes such as DNA repair, RNA regulation, cell cycle and biomolecular condensate formation. Dysregulation of ADP-ribosylation is implicated in cancer, neurodegeneration and viral infection. We developed ADPriboDB (adpribodb.leunglab.org) to facilitate studies in uncovering insights into the mechanisms and biological significance of ADP-ribosylation. ADPriboDB 2.0 serves as a one-stop repository comprising 48 346 entries and 9097 ADP-ribosylated proteins, of which 6708 were newly identified since the original database release. In this updated version, we provide information regarding the sites of ADP-ribosylation in 32 946 entries. The wealth of information allows us to interrogate existing databases or newly available data. For example, we found that ADP-ribosylated substrates are significantly associated with the recently identified human protein interaction networks associated with SARS-CoV-2, which encodes a conserved protein domain called macrodomain that binds and removes ADP-ribosylation. In addition, we create a new interactive tool to visualize the local context of ADP-ribosylation, such as structural and functional features as well as other post-translational modifications (e.g. phosphorylation, methylation and ubiquitination). This information provides opportunities to explore the biology of ADP-ribosylation and generate new hypotheses for experimental testing.
Project description:Identification and annotation of the mutations involved in oncogenesis and tumor progression are crucial for both cancer biology and clinical applications. Previously, we developed a public resource CanProVar, a human cancer proteome variation database for storing and querying single amino acid alterations in the human cancers. Since the publication of CanProVar, extensive cancer genomics efforts have revealed the enormous genomic complexity of various types of human cancers. Thus, there is an overwhelming need for comprehensive annotation of the genomic alterations at the protein level and making such knowledge easily accessible. Here, we describe CanProVar 2.0, a significantly expanded version of CanProVar, in which the amount of cancer-related variations and noncancer specific variations was increased by about 10-fold as compared to the previous version. To facilitate the interpretation of the variations, we added to the database functional data on potential impact of the cancer-related variations on 3D protein interaction and on the differential expression of the variant-bearing proteins between cancer and normal samples. The web interface allows for flexible queries based on gene or protein IDs, cancer types, chromosome locations, or pathways. An integrated protein sequence database containing variations that can be directly used for proteomics database searching can be downloaded.
Project description:The study of palaeo-chronologies using fossil data provides evidence for past ecological and evolutionary processes, and is therefore useful for predicting patterns and impacts of future environmental change. However, the robustness of inferences made from fossil ages relies heavily on both the quantity and quality of available data. We compiled Quaternary non-human vertebrate fossil ages from Sahul published up to 2013. This, the FosSahul database, includes 9,302 fossil records from 363 deposits, for a total of 478 species within 215 genera, of which 27 are from extinct and extant megafaunal species (2,559 records). We also provide a rating of reliability of individual absolute age based on the dating protocols and association between the dated materials and the fossil remains. Our proposed rating system identified 2,422 records with high-quality ages (i.e., a reduction of 74%). There are many applications of the database, including disentangling the confounding influences of hypothetical extinction drivers, better spatial distribution estimates of species relative to palaeo-climates, and potentially identifying new areas for fossil discovery.
Project description:Disulphide bonds are stabilizing crosslinks in proteins and serve to enhance their thermal stability. In proteins that are small and rich in disulphide bonds, they could be the major determining factor for the choice of conformational state since their constraints on appropriate backbone conformation can be substantial. Such crosslinks and their positional conservation could itself enable protein family and functional association. Despite the importance of the field, there is no comprehensive database on disulphide crosslinks that is available to the public. Herein we provide information on disulphides in DSDBASE2.0, an updated and significantly expanded database that is freely available, fully annotated and manually curated database on native and modelled disulphides. The web interface also provides several useful computational tools that have been specifically developed for proteins containing disulphide crosslinks. The modelling of disulphide crosslinks is performed using stereochemical criteria, coded within our Modelling of Disulphides in Proteins (MODIP) algorithm. The inclusion of modelled disulphides potentially enhances the loop database substantially, thereby permitting the recognition of compatible polypeptide segments that could serve as templates for immediate modelling. The DSDBASE2.0 database has been updated to include 153,944 PDB entries, 216,096 native and 20,153,850 modelled disulphide bond segments from PDB January 2021 release. The current database also provides a resource to user-friendly search for multiple disulphide bond containing loops, along with annotation of their function using GO and subcellular localization of the query. Furthermore, it is possible to obtain the three-dimensional models of disulphide-rich small proteins using an independent algorithm, RANMOD, that generates and examines random, but allowed backbone conformations of the polypeptide. DSDBASE2.0 still remains the largest open-access repository that organizes all disulphide bonds of proteins on a single platform. The database can be accessed from http://caps.ncbs.res.in/dsdbase2.
Project description:MoonDB 2.0 (http://moondb.hb.univ-amu.fr/) is a database of predicted and manually curated extreme multifunctional (EMF) and moonlighting proteins, i.e. proteins that perform multiple unrelated functions. We have previously shown that such proteins can be predicted through the analysis of their molecular interaction subnetworks, their functional annotations and their association to distinct groups of proteins that are involved in unrelated functions. In MoonDB 2.0, we updated the set of human EMF proteins (238 proteins), using the latest functional annotations and protein-protein interaction networks. Furthermore, for the first time, we applied our method to four additional model organisms - mouse, fly, worm and yeast - and identified 54 novel EMF proteins in these species. In addition to novel predictions, this update contains 63 human and yeast proteins that were manually curated from literature, including descriptions of moonlighting functions and associated references. Importantly, MoonDB's interface was fully redesigned and improved, and its entries are now cross-referenced in the UniProt Knowledgebase (UniProtKB). MoonDB will be updated once a year with the novel EMF candidates calculated from the latest available protein interactions and functional annotations.
Project description:ICEberg 2.0 (http://db-mml.sjtu.edu.cn/ICEberg/) is an updated database that provides comprehensive information about bacterial integrative and conjugative elements (ICEs). Compared with the previous version, three major improvements were made. First, with the aid of text mining and manual curation, it now recorded the details of 1032 ICEs, including 270 with experimental supports and 762 from bioinformatics prediction. Second, as increasing evidence has shown that ICEs frequently mobilize the so-called 'hitchhikers', such as integrative and mobilizable elements (IMEs) and cis-mobilizable elements (CIMEs), 83 known transfer interactions between 49 IMEs and 7 CIMEs with 19 ICEs taken from the literature were included and illustrated with visually intuitive directed graphs. An expanded collection of 260 chromosome-borne IMEs and 235 CIMEs was also added. At last, ICEberg 2.0 provides an online tool ICEfinder to predict ICEs or IMEs in bacterial genome sequences. It combines a similarity search for the integrase, relaxase and/or type IV secretion system and the co-localization of these corresponding homologous genes. With the recent updates, ICEberg 2.0 might provide better support for understanding the biological traits of ICEs, especially as their interaction with cognate mobilizable elements may further promote horizontal gene flow.
Project description:In prokaryotes, protein phosphorylation plays a critical role in regulating a broad spectrum of biological processes and occurs mainly on various amino acids, including serine (S), threonine (T), tyrosine (Y), arginine (R), aspartic acid (D), histidine (H) and cysteine (C) residues of protein substrates. Through literature curation and public database integration, here we reported an updated database of phosphorylation sites (p-sites) in prokaryotes (dbPSP 2.0) that contains 19,296 experimentally identified p-sites in 8,586 proteins from 200 prokaryotic organisms, which belong to 12 phyla of two kingdoms, bacteria and archaea. To carefully annotate these phosphoproteins and p-sites, we integrated the knowledge from 88 publicly available resources that covers 9 aspects, namely, taxonomy annotation, genome annotation, function annotation, transcriptional regulation, sequence and structure information, family and domain annotation, interaction, orthologous information and biological pathway. In contrast to version 1.0 (~30 MB), dbPSP 2.0 contains ~9 GB of data, with a 300-fold increased volume. We anticipate that dbPSP 2.0 can serve as a useful data resource for further investigating phosphorylation events in prokaryotes. dbPSP 2.0 is free for all users to access at: http://dbpsp.biocuckoo.cn.
Project description:A large amount of differentially expressed proteins (DEPs) have been identified in various cancer proteomics experiments, curation and annotation of these proteins are important in deciphering their roles in oncogenesis and tumor progression, and may further help to discover potential protein biomarkers for clinical applications. In 2009, we published the first database of DEPs in human cancers (dbDEPCs). In this updated version of 2011, dbDEPC 2.0 has more than doubly expanded to over 4000 protein entries, curated from 331 experiments across 20 types of human cancers. This resource allows researchers to search whether their interested proteins have been reported changing in certain cancers, to compare their own proteomic discovery with previous studies, to picture selected protein expression heatmap across multiple cancers and to relate protein expression changes with aberrance in other genetic level. New important developments include addition of experiment design information, advanced filter tools for customer-specified analysis and a network analysis tool. We expect dbDEPC 2.0 to be a much more powerful tool than it was in its first release and can serve as reference to both proteomics and cancer researchers. dbDEPC 2.0 is available at http://lifecenter.sgst.cn/dbdepc/index.do.
Project description:TADB2.0 (http://bioinfo-mml.sjtu.edu.cn/TADB2/) is an updated database that provides comprehensive information about bacterial type II toxin-antitoxin (TA) loci. Compared with the previous version, the database refined and the new data schema is employed. With the aid of text mining and manual curation, it recorded 6193 type II TA loci in 870 replicons of bacteria and archaea, including 105 experimentally validated TA loci. In addition, the newly developed tool TAfinder combines the homolog searches and the operon structure detection, allowing the prediction for type II TA pairs in bacterial genome sequences. It also helps to investigate the genomic context of predicted TA loci for putative virulence factors, antimicrobial resistance determinants and mobile genetic elements via alignments to the specific public databases. Additionally, the module TAfinder-Compare allows comparing the presence of the given TA loci across the close relative genomes. With the recent updates, TADB2.0 might provide better support for understanding the important roles of type II TA systems in the prokaryotic life activities.