Dataset Information

Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology.

ABSTRACT:

Background

DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM).

Results

We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes.

Conclusion

Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repertoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.

SUBMITTER: Brown JB

PROVIDER: S-EPMC2660303 | biostudies-literature | 2009 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology.

Brown J B JB Akutsu Tatsuya T

BMC bioinformatics 20090120

<h4>Background</h4>DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair a ...[more]

PMID: 19154573

Similar Datasets

Project description:BackgroundThe C9ORF72 hexanucleotide repeat expansion is the most common known genetic cause of amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD), two fatal age-related neurodegenerative diseases. The C9ORF72 expansion encodes five dipeptide repeat proteins (DPRs) that are produced through a non-canonical translation mechanism. Among the DPRs, proline-arginine (PR), glycine-arginine (GR), and glycine-alanine (GA) are the most neurotoxic and increase the frequency of DNA double strand breaks (DSBs). While the accumulation of these genotoxic lesions is increasingly recognized as a feature of disease, the mechanism(s) of DPR-mediated DNA damage are ill-defined and the effect of DPRs on the efficiency of each DNA DSB repair pathways has not been previously evaluated.Methods and resultsUsing DNA DSB repair assays, we evaluated the efficiency of specific repair pathways, and found that PR, GR and GA decrease the efficiency of non-homologous end joining (NHEJ), single strand annealing (SSA), and microhomology-mediated end joining (MMEJ), but not homologous recombination (HR). We found that PR inhibits DNA DSB repair, in part, by binding to the nucleolar protein nucleophosmin (NPM1). Depletion of NPM1 inhibited NHEJ and SSA, suggesting that NPM1 loss-of-function in PR expressing cells leads to impediments of both non-homologous and homology-directed DNA DSB repair pathways. By deleting NPM1 sub-cellular localization signals, we found that PR binds NPM1 regardless of the cellular compartment to which NPM1 was directed. Deletion of the NPM1 acidic loop motif, known to engage other arginine-rich proteins, abrogated PR and NPM1 binding. Using confocal and super-resolution immunofluorescence microscopy, we found that levels of RAD52, a component of the SSA repair machinery, were significantly increased iPSC neurons relative to isogenic controls in which the C9ORF72 expansion had been deleted using CRISPR/Cas9 genome editing. Western analysis of post-mortem brain tissues confirmed that RAD52 immunoreactivity is significantly increased in C9ALS/FTD samples as compared to controls.ConclusionsCollectively, we characterized the inhibitory effects of DPRs on key DNA DSB repair pathways, identified NPM1 as a facilitator of DNA repair that is inhibited by PR, and revealed deficits in homology-directed DNA DSB repair pathways as a novel feature of C9ORF72-related disease.

Project description:MotivationThe structure of proteins is organized in a hierarchy among which the secondary structure elements, α-helix, β-strand and loop, are the basic bricks. The determination of secondary structure elements usually requires the knowledge of the whole structure. Nevertheless, in numerous experimental circumstances, the protein structure is partially known. The detection of secondary structures from these partial structures is hampered by the lack of information about connecting residues along the primary sequence.ResultsWe introduce a new methodology to estimate the secondary structure elements from the values of local distances and angles between the protein atoms. Our method uses a message passing neural network, named Sequoia, which allows the automatic prediction of secondary structure elements from the values of local distances and angles between the protein atoms. This neural network takes as input the topology of the given protein graph, where the vertices are protein residues, and the edges are weighted by values of distances and pseudo-dihedral angles generalizing the backbone angles ϕ and ψ. Any pair of residues, independently of its covalent bonds along the primary sequence of the protein, is tagged with this distance and angle information. Sequoia permits the automatic detection of the secondary structure elements, with an F1-score larger than 80% for most of the cases, when α helices and β strands are predicted. In contrast to the approaches classically used in structural biology, such as DSSP, Sequoia is able to capture the variations of geometry at the interface of adjacent secondary structure element. Due to its general modeling frame, Sequoia is able to handle graphs containing only Cα atoms, which is particularly useful on low resolution structural input and in the frame of electron microscopy development.Availability and implementationSequoia source code can be found at https://github.com/Khalife/Sequoia with additional documentation.Supplementary informationSupplementary data are available at Bioinformatics Advances online.

Dataset Information

Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology.

Background

Results

Conclusion

Publications

Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets