Dataset Information

Towards the identification of essential genes using targeted genome sequencing and comparative analysis.

ABSTRACT: The identification of genes essential for survival is of theoretical importance in the understanding of the minimal requirements for cellular life, and of practical importance in the identification of potential drug targets in novel pathogens. With the great time and expense required for experimental studies aimed at constructing a catalog of essential genes in a given organism, a computational approach which could identify essential genes with high accuracy would be of great value.We gathered numerous features which could be generated automatically from genome sequence data and assessed their relationship to essentiality, and subsequently utilized machine learning to construct an integrated classifier of essential genes in both S. cerevisiae and E. coli. When looking at single features, phyletic retention, a measure of the number of organisms an ortholog is present in, was the most predictive of essentiality. Furthermore, during construction of our phyletic retention feature we for the first time explored the evolutionary relationship among the set of organisms in which the presence of a gene is most predictive of essentiality. We found that in both E. coli and S. cerevisiae the optimal sets always contain host-associated organisms with small genomes which are closely related to the reference. Using five optimally selected organisms, we were able to improve predictive accuracy as compared to using all available sequenced organisms. We hypothesize the predictive power of these genomes is a consequence of the process of reductive evolution, by which many parasites and symbionts evolved their gene content. In addition, essentiality is measured in rich media, a condition which resembles the environments of these organisms in their hosts where many nutrients are provided. Finally, we demonstrate that integration of our most highly predictive features using a probabilistic classifier resulted in accuracies surpassing any individual feature.Using features obtainable directly from sequence data, we were able to construct a classifier which can predict essential genes with high accuracy. Furthermore, our analysis of the set of genomes in which the presence of a gene is most predictive of essentiality may suggest ways in which targeted sequencing can be used in the identification of essential genes. In summary, the methods presented here can aid in the reduction of time and money invested in essential gene identification by targeting those genes for experimentation which are predicted as being essential with a high probability.

SUBMITTER: Gustafson AM

PROVIDER: S-EPMC1624830 | biostudies-other | 2006 Oct

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Towards the identification of essential genes using targeted genome sequencing and comparative analysis.

Gustafson Adam M AM Snitkin Evan S ES Parker Stephen C J SC DeLisi Charles C Kasif Simon S

BMC genomics 20061019

<h4>Background</h4>The identification of genes essential for survival is of theoretical importance in the understanding of the minimal requirements for cellular life, and of practical importance in the identification of potential drug targets in novel pathogens. With the great time and expense required for experimental studies aimed at constructing a catalog of essential genes in a given organism, a computational approach which could identify essential genes with high accuracy would be of great ...[more]

PMID: 17052348

Similar Datasets

Project description:Hodgkin lymphoma (HL) is a lymphoproliferative malignancy of B-cell origin that accounts for 10% of all lymphomas. Despite evidence suggesting strong familial clustering of HL, there is no clear understanding of the contribution of genes predisposing to HL. In this study, whole genome sequencing (WGS) was performed on 7 affected and 9 unaffected family members from three HL-prone families and variants were prioritized using our Familial Cancer Variant Prioritization Pipeline (FCVPPv2). WGS identified a total of 98,564, 170,550, and 113,654 variants which were reduced by pedigree-based filtering to 18,158, 465, and 26,465 in families I, II, and III, respectively. In addition to variants affecting amino acid sequences, variants in promoters, enhancers, transcription factors binding sites, and microRNA seed sequences were identified from upstream, downstream, 5' and 3' untranslated regions. A panel of 565 cancer predisposing and other cancer-related genes and of 2,383 potential candidate HL genes were also screened in these families to aid further prioritization. Pathway analysis of segregating genes with Combined Annotation Dependent Depletion Tool (CADD) scores >20 was performed using Ingenuity Pathway Analysis software which implicated several candidate genes in pathways involved in B-cell activation and proliferation and in the network of "Cancer, Hematological disease and Immunological Disease." We used the FCVPPv2 for further in silico analyses and prioritized 45 coding and 79 non-coding variants from the three families. Further literature-based analysis allowed us to constrict this list to one rare germline variant each in families I and II and two in family III. Functional studies were conducted on the candidate from family I in a previous study, resulting in the identification and functional validation of a novel heterozygous missense variant in the tumor suppressor gene DICER1 as potential HL predisposition factor. We aim to identify the individual genes responsible for predisposition in the remaining two families and will functionally validate these in further studies.

Dataset Information

Towards the identification of essential genes using targeted genome sequencing and comparative analysis.

Publications

Towards the identification of essential genes using targeted genome sequencing and comparative analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets