Project description:Dependent on concise, pre-defined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large scale proteomics datasets, and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) which leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.
Project description:Therapy-related acute myeloid leukemia (t-AML) is a severe complication of the cytotoxic therapy used for primary cancer treatment. The outcome of these patients is poor compared to people who develop de novo acute myeloid leukemia (p-AML). Chromosome abnormalities in t-AML are partly dependent on the induction agent. Partial or total losses of chromosome 5 and/or 7 are observed after therapy with alkylating agents. Balanced translocations, most of which involve 11q23 with MLL rearrangement, are found after treatment with the topoisomerase II inhibitor. Complex cases are also more frequent. The aim of this study was to compare t-AML to p-AML using high-resolution array CGH in order to identify gene-specific copy number abnormalities (CNA). Thirty t-AML versus thirty-six p-AML patient samples were studied. In t-AML, 99 CNAs were observed with 63 losses and 36 gains while the mean number was 3,3 per case. In p-AML, 64 CNAs were observed with 30 losses and 34 gains with a mean number of 1.78 per case. A few very complex cases (>8 chromosomal abnormalities) contributed considerably to the chromosomal burden in p-AML. Several minimal critical regions (MCR) that contain proteins and microRNA genes implicated in leukemogenesis were found in t-AML. On 7p15.2, a HOXA gene cluster involved in the processes of hematopoietic progenitor cell development and leukemogenesis was recurrently gained. Loss of a 5 Mb MCR located on 5q31.3q32 (142,91-148,19 Mb) was found distal to a previously described MCR; it harbored 29 genes. A 40kb deleted MCR pointed to RUNX1 on 21q22, a gene coding for a transcription factor implicated in frequent rearrangements in leukemia and in familial thrombocytopenia with susceptibility to AML. The sequence revealed no abnormality in 3 patients and a mutation in one patient, resulting in complete deficiency of RUNX1. In de novo AML a gain of 21q22<38,41-39,36> harboring ERG and ETS2 was observed in two patients with very complex rearrangements.
Project description:The Zika outbreak, spread by the Aedes aegypti mosquito, highlights the need to create high-quality assemblies of large genomes in a rapid and cost-effective fashion. Here, we combine Hi-C data with existing draft assemblies to generate chromosome-length scaffolds. We validate this method by assembling a human genome, de novo, from short reads alone (67X coverage, Sample GSM1551550). We then combine our method with draft sequences to create genome assemblies of the mosquito disease vectors Aedes aegypti and Culex quinquefasciatus, each consisting of three scaffolds corresponding to the three chromosomes in each species. These assemblies indicate that virtually all genomic rearrangements among these species occur within, rather than between, chromosome arms. The genome assembly procedure we describe is fast, inexpensive, accurate, and can be applied to many species.
Project description:De novo peptide sequencing is a fundamental research area in mass spectrometry (MS) based proteomics. However, those methods have often been evaluated using a couple of simple metrics that do not fully reflect their overall performance. Moreover, there has not been an established method to estimate the false discovery rate (FDR) and the significance of de novo peptide-spectrum matches (PSMs). Here we propose NovoBoard, a comprehensive framework to evaluate the performance of de novo peptide sequencing methods. The framework consists of diverse benchmark datasets (including tryptic, nontryptic, immunopeptidomics, and different species), and a standard set of accuracy metrics to evaluate the fragment ions, amino acids, and peptides of the de novo results. More importantly, a new approach is designed to evaluate de novo peptide sequencing methods on target-decoy spectra and to estimate their FDRs. Our results thoroughly reveal the strengths and weaknesses of different de novo peptide sequencing methods, and how their performances depend on specific applications and the types of data. Our FDR estimation also shows that some tools may perform better than the others in distinguishing between de novo PSMs and random matches, and can be used to assess the significance of de novo PSMs.
Project description:De novo copy number variations in cloned dogs from the same nuclear donor In this study, we aimed to identify de novo post-cloning CNV events and estimated the rate of CNV mosaicism in cloned dogs with the identical genetic background. We analyzed CNVs in seven cloned dogs using the nuclear donor genome as reference by array-CGH