Dataset Information

Splice site identification using probabilistic parameters and SVM classification.

ABSTRACT:

Background

Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive.

Results

The proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases.

Conclusion

We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.

SUBMITTER: Baten AK

PROVIDER: S-EPMC1764471 | biostudies-literature | 2006 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Splice site identification using probabilistic parameters and SVM classification.

Baten A K M A AK Chang B C H BC Halgamuge S K SK Li Jason J

BMC bioinformatics 20061218

<h4>Background</h4>Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. ...[more]

PMID: 17254299

Similar Datasets

Project description:BackgroundBreast cancer is a highly predominant destructive disease among women characterised with varied tumour biology, molecular subgroups and diverse clinicopathological specifications. The potentiality of machine learning to transform complex medical data into meaningful knowledge has led to its application in breast cancer detection and prognostic evaluation.ObjectiveThe emergence of data-driven diagnostic model for assisting clinicians in diagnostic decision making has gained an increasing curiosity in breast cancer identification and analysis. This motivated us to develop a breast cancer data-driven model for subtype classification more accurately.MethodIn this article, we proposed a firefly-support vector machine (SVM) breast cancer predictive model that uses clinicopathological and demographic data gathered from various tertiary care cancer hospitals or oncological centres to distinguish between patients with triple-negative breast cancer (TNBC) and non-triple-negative breast cancer (non-TNBC).ResultsThe results of the firefly-support vector machine (firefly-SVM) predictive model were distinguished from the traditional grid search-support vector machine (Grid-SVM) model, particle swarm optimisation-support vector machine (PSO-SVM) and genetic algorithm-support vector machine (GA-SVM) hybrid models through hyperparameter tuning. The findings show that the recommended firefly-SVM classification model outperformed other existing models in terms of prediction accuracy (93.4%, 86.6%, 69.6%) for automated SVM parameter selection. The effectiveness of the prediction model was also evaluated using well-known metrics, such as the F1-score, mean square error, area under the ROC curve, logarithmic loss and precision-recall curve.ConclusionFirefly-SVM predictive model may be treated as an alternate tool for breast cancer subgroup classification that would benefit the clinicians for managing the patient with proper treatment and diagnostic outcome.

Dataset Information

Splice site identification using probabilistic parameters and SVM classification.

Background

Results

Conclusion

Publications

Splice site identification using probabilistic parameters and SVM classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets