Dataset Information

Improving genome-wide scans of positive selection by using protein isoforms of similar length.

ABSTRACT: Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode different protein isoforms. As analyses tend to be gene centered, one single-protein isoform per gene is selected for the alignment, with the de facto approach being to use the longest protein isoform per gene (Longest), presumably to avoid including partial sequences and to maximize sequence information. Here, we show that this approach is problematic because it increases the number of indels in the alignments due to the inclusion of nonhomologous regions, such as those derived from species-specific exons, increasing the number of misaligned positions. With the aim of ameliorating this problem, we have developed a novel heuristic, Protein ALignment Optimizer (PALO), which, for each gene family, selects the combination of protein isoforms that are most similar in length. We examine several evolutionary parameters inferred from alignments in which the only difference is the method used to select the protein isoform combination: Longest, PALO, the combination that results in the highest sequence conservation, and a randomly selected combination. We observe that Longest tends to overestimate both nonsynonymous and synonymous substitution rates when compared with PALO, which is most likely due to an excess of misaligned positions. The estimation of the fraction of genes that have experienced positive selection by maximum likelihood is very sensitive to the method of isoform selection employed, both when alignments are constructed with MAFFT and with Prank(+F). Longest performs better than a random combination but still estimates up to 3 times more positively selected genes than the combination showing the highest conservation, indicating the presence of many false positives. We show that PALO can eliminate the majority of such false positives and thus that it is a more appropriate approach for large-scale analyses than Longest. A web server has been set up to facilitate the use of PALO given a user-defined set of gene families; it is available at http://evolutionarygenomics.imim.es/palo.

SUBMITTER: Villanueva-Canas JL

PROVIDER: S-EPMC3590775 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improving genome-wide scans of positive selection by using protein isoforms of similar length.

Villanueva-Cañas José Luis JL Laurie Steve S Albà M Mar MM

Genome biology and evolution 20130101 2

Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode different protein isoforms. As analyses tend to be gene centered, one single-protein isoform per gene is selected for the alignment, with the de facto approach being to use the longest protein isoform per gen ...[more]

PMID: 23377868

Similar Datasets

Project description:BackgroundThe detection of signatures of selection in genomic regions provides insights into the evolutionary process, enabling discoveries regarding complex phenotypic traits. In this research, we focused on identifying genomic regions affected by different selection pressures, mainly highlighting the recent positive selection, as well as understanding the candidate genes and functional pathways associated with the signatures of selection in the Mangalarga Marchador genome. Besides, we seek to direct the discussion about genes and traits of importance in this breed, especially traits related to the type and quality of gait, temperament, conformation, and locomotor system.ResultsThree different methods were used to search for signals of selection: Tajima's D (TD), the integrated haplotype score (iHS), and runs of homozygosity (ROH). The samples were composed of males (n = 62) and females (n = 130) that were initially chosen considering well-defined phenotypes for gait: picada (n = 86) and batida (n = 106). All horses were genotyped using a 670 k Axiom® Equine Genotyping Array (Axiom MNEC670). In total, 27, 104 (chosen), and 38 candidate genes were observed within the signatures of selection identified in TD, iHS, and ROH analyses, respectively. The genes are acting in essential biological processes. The enrichment analysis highlighted the following functions: anterior/posterior pattern for the set of genes (GLI3, HOXC9, HOXC6, HOXC5, HOXC4, HOXC13, HOXC11, and HOXC10); limb morphogenesis, skeletal system, proximal/distal pattern formation, JUN kinase activity (CCL19 and MAP3K6); and muscle stretch response (MAPK14). Other candidate genes were associated with energy metabolism, bronchodilator response, NADH regeneration, reproduction, keratinization, and the immunological system.ConclusionsOur findings revealed evidence of signatures of selection in the MM breed that encompass genes acting on athletic performance, limb development, and energy to muscle activity, with the particular involvement of the HOX family genes. The genome of MM is marked by recent positive selection. However, Tajima's D and iHS results point also to the presence of balancing selection in specific regions of the genome.

Dataset Information

Improving genome-wide scans of positive selection by using protein isoforms of similar length.

Publications

Improving genome-wide scans of positive selection by using protein isoforms of similar length.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets