Dataset Information

A phylogenetic approach to target selection for structural genomics: solution structure of YciH.

ABSTRACT: Structural genomics presents an enormous challenge with up to 100 000 protein targets in the human genome alone. At current rates of structure deter-mination, judicious selection of targets is necessary. Here, a phylogenetic approach to target selection is described which makes use of the National Center for Biotechnology Information database of Clusters of Orthologous Groups (COGS). The strategy is designed so that each new protein structure is likely to provide novel sequence-fold information. To demonstrate this approach, the NMR solution structure of YciH (COG0023), a putative translation initiation factor from Escherichia coli, has been determined and its fold classified. YciH is an ortholog of eIF-1/SUI1, an integral component of the translation initiation complex in eukaryotes. The structure consists of two antiparallel alpha-helices packed against the same side of a five-stranded beta-sheet. The first 31 residues of the 11.5 kDa protein are unstructured in solution. Comparative analysis indicates that the folded portion of YciH resembles a number of structures with the alpha-beta plait topology, though its sequence is not homologous to any of them. Thus, the phylogenetic approach to target selection described here was used successfully to identify a new homologous superfamily within this topology.

SUBMITTER: Cort JR

PROVIDER: S-EPMC148669 | biostudies-other | 1999 Oct

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

A phylogenetic approach to target selection for structural genomics: solution structure of YciH.

Cort J R JR Koonin E V EV Bash P A PA Kennedy M A MA

Nucleic acids research 19991001 20

Structural genomics presents an enormous challenge with up to 100 000 protein targets in the human genome alone. At current rates of structure deter-mination, judicious selection of targets is necessary. Here, a phylogenetic approach to target selection is described which makes use of the National Center for Biotechnology Information database of Clusters of Orthologous Groups (COGS). The strategy is designed so that each new protein structure is likely to provide novel sequence-fold information. ...[more]

PMID: 10497266

Similar Datasets

Project description:Genomic epidemiology uses pathogens' whole-genome sequences to understand and manage the spread of infectious diseases. Whole-genome data can be used to monitor outbreaks and cluster formation, identify cross-community transmissions, and characterize drug resistance and immune evasion. Typically, bacteria are cultured from clinical samples to obtain DNA for sequencing to generate whole-genome data. However, culture-independent diagnostic methods are utilized for some fastidious bacteria for better diagnostic yield and rapid pathogen genomics. Whole-genome enrichment (WGE) using targeted DNA sequencing enables direct sequencing of clinical samples without having to culture pathogens. However, the cost of capture probes ("baits") limits the utility of this method for large-scale genomic epidemiology. We developed a cost-effective method named Circular Nucleic acid Enrichment Reagent synthesis (CNERs) to generate whole-genome enrichment probes. We demonstrated the method by producing probes for Mycobacterium tuberculosis, which we used to enrich M. tuberculosis DNA that had been spiked at concentrations as low as 0.01% and 100 genome copies against a human DNA background to 1,225-fold and 4,636-fold. Furthermore, we enriched DNA from different M. tuberculosis lineages and M. bovis and demonstrated the utility of the WGE-CNERs data for lineage identification and drug-resistance characterization using an established pipeline. The CNERs method for whole-genome enrichment will be a valuable tool for the genomic epidemiology of emerging and difficult-to-grow pathogens. IMPORTANCE Emerging infectious diseases require continuous pathogen monitoring. Rapid clinical diagnosis by nucleic acid amplification is limited to a small number of targets and may miss target detection due to new mutations in clinical isolates. Whole-genome sequencing (WGS) identifies genome-wide variations that may be used to determine a pathogen's drug resistance patterns and phylogenetically characterize isolates to track disease origin and transmission. WGS is typically performed using DNA isolated from cultured clinical isolates. Culturing clinical specimens increases turn-around time and may not be possible for fastidious bacteria. To overcome some of these limitations, direct sequencing of clinical specimens has been attempted using expensive capture probes to enrich the entire genomes of target pathogens. We present a method to produce a cost-effective, time-efficient, and large-scale synthesis of probes for whole-genome enrichment. We envision that our method can be used for direct clinical sequencing of a wide range of microbial pathogens for genomic epidemiology.

Project description:Significant advances in biotechnology have allowed for simultaneous measurement of molecular data across multiple genomic, epigenomic and transcriptomic levels from a single tumor/patient sample. This has motivated systematic data-driven approaches to integrate multi-dimensional structured datasets, since cancer development and progression is driven by numerous co-ordinated molecular alterations and the interactions between them. We propose a novel multi-scale Bayesian approach that combines integrative graphical structure learning from multiple sources of data with a variable selection framework-to determine the key genomic drivers of cancer progression. The integrative structure learning is first accomplished through novel joint graphical models for heterogeneous (mixed scale) data, allowing for flexible and interpretable incorporation of prior existing knowledge. This subsequently informs a variable selection step to identify groups of co-ordinated molecular features within and across platforms associated with clinical outcomes of cancer progression, while according appropriate adjustments for multicollinearity and multiplicities. We evaluate our methods through rigorous simulations to establish superiority over existing methods that do not take the network and/or prior information into account. Our methods are motivated by and applied to a glioblastoma multiforme (GBM) dataset from The Cancer Genome Atlas to predict patient survival times integrating gene expression, copy number and methylation data. We find a high concordance between our selected prognostic gene network modules with known associations with GBM. In addition, our model discovers several novel cross-platform network interactions (both cis and trans acting) between gene expression, copy number variation associated gene dosing and epigenetic regulation through promoter methylation, some with known implications in the etiology of GBM. Our framework provides a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers.

Dataset Information

A phylogenetic approach to target selection for structural genomics: solution structure of YciH.

Publications

A phylogenetic approach to target selection for structural genomics: solution structure of YciH.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets