Unknown

Dataset Information

0

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins.


ABSTRACT: We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.

SUBMITTER: Bruna T 

PROVIDER: S-EPMC7222226 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC7787252 | biostudies-literature
| S-EPMC1298918 | biostudies-literature
| S-EPMC8153819 | biostudies-literature
| S-EPMC5333190 | biostudies-literature
| S-EPMC5305209 | biostudies-literature
| S-EPMC3089888 | biostudies-literature
| S-EPMC4556409 | biostudies-literature
| S-EPMC7295766 | biostudies-literature
| S-EPMC5424156 | biostudies-literature
| S-EPMC2812018 | biostudies-literature