Dataset Information

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa.

ABSTRACT: Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained on two classes of genes at once, and this can be expected to result in poor results when genes are predicted in new genome sequences.We find that gene prediction programs trained on grass genes with random GC content do not completely predict all grass genes with extreme GC content. We show that gene prediction programs that are trained with grass genes with high or low GC content can make both better and unique gene predictions compared to gene prediction programs that are trained on genes with random GC content. By separately training gene prediction programs with genes from multiple GC ranges and using the programs within the MAKER genome annotation pipeline, we were able to improve the annotation of the Oryza sativa genome compared to using the standard MAKER annotation protocol. Gene structure was improved in over 13% of genes, and 651 novel genes were predicted by the GC-specific MAKER protocol.We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa. We expect that this protocol will also be beneficial for gene prediction in any organism with bimodal or other unusual gene GC content.

SUBMITTER: Bowman MJ

PROVIDER: S-EPMC5702205 | biostudies-literature | 2017 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa.

Bowman Megan J MJ Pulman Jane A JA Liu Tiffany L TL Childs Kevin L KL

BMC bioinformatics 20171125 1

<h4>Background</h4>Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained o ...[more]

PMID: 29178822

Similar Datasets

Project description:BackgroundRice, the most important crop in Asia, has been cultivated in Taiwan for more than 5000 years. The landraces preserved by indigenous peoples and brought by immigrants from China hundreds of years ago exhibit large variation in morphology, implying that they comprise rich genetic resources. Breeding goals according to the preferences of farmers, consumers and government policies also alter gene pools and genetic diversity of improved varieties. To unveil how genetic diversity is affected by natural, farmers', and breeders' selections is crucial for germplasm conservation and crop improvement.ResultsA diversity panel of 148 rice accessions, including 47 cultivars and 59 landraces from Taiwan and 42 accessions from other countries, were genotyped by using 75 molecular markers that revealed an average of 12.7 alleles per locus with mean polymorphism information content of 0.72. These accessions could be grouped into five subpopulations corresponding to wild rice, japonica landraces, indica landraces, indica cultivars, and japonica cultivars. The genetic diversity within subpopulations was: wild rices > landraces > cultivars; and indica rice > japonica rice. Despite having less variation among cultivars, japonica landraces had greater genetic variation than indica landraces because the majority of Taiwanese japonica landraces preserved by indigenous peoples were classified as tropical japonica. Two major clusters of indica landraces were formed by phylogenetic analysis, in accordance with immigration from two origins. Genetic erosion had occurred in later japonica varieties due to a narrow selection of germplasm being incorporated into breeding programs for premium grain quality. Genetic differentiation between early and late cultivars was significant in japonica (FST = 0.3751) but not in indica (FST = 0.0045), indicating effects of different breeding goals on modern germplasm. Indigenous landraces with unique intermediate and admixed genetic backgrounds were untapped, representing valuable resources for rice breeding.ConclusionsThe genetic diversity of improved rice varieties has been substantially shaped by breeding goals, leading to differentiation between indica and japonica cultivars. Taiwanese landraces with different origins possess various and unique genetic backgrounds. Taiwanese rice germplasm provides diverse genetic variation for association mapping to unveil useful genes and is a precious genetic reservoir for rice improvement.

Dataset Information

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa.

Publications

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets