Dataset Information

PhyloHerb: A high-throughput phylogenomic pipeline for processing genome skimming data.

ABSTRACT:

Premise

The application of high-throughput sequencing, especially to herbarium specimens, is rapidly accelerating biodiversity research. Low-coverage sequencing of total genomic DNA (genome skimming) is particularly promising and can simultaneously recover the plastid, mitochondrial, and nuclear ribosomal regions across hundreds of species. Here, we introduce PhyloHerb, a bioinformatic pipeline to efficiently assemble phylogenomic data sets derived from genome skimming.

Methods and results

PhyloHerb uses either a built-in database or user-specified references to extract orthologous sequences from all three genomes using a BLAST search. It outputs FASTA files and offers a suite of utility functions to assist with alignment, partitioning, concatenation, and phylogeny inference. The program is freely available at https://github.com/lmcai/PhyloHerb/.

Conclusions

We demonstrate that PhyloHerb can accurately identify genes using a published data set from Clusiaceae. We also show via simulations that our approach is effective for highly fragmented assemblies from herbarium specimens and is scalable to thousands of species.

SUBMITTER: Cai L

PROVIDER: S-EPMC9215275 | biostudies-literature | 2022 May-Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PhyloHerb: A high-throughput phylogenomic pipeline for processing genome skimming data.

Cai Liming L Zhang Hongrui H Davis Charles C CC

Applications in plant sciences 20220501 3

<h4>Premise</h4>The application of high-throughput sequencing, especially to herbarium specimens, is rapidly accelerating biodiversity research. Low-coverage sequencing of total genomic DNA (genome skimming) is particularly promising and can simultaneously recover the plastid, mitochondrial, and nuclear ribosomal regions across hundreds of species. Here, we introduce PhyloHerb, a bioinformatic pipeline to efficiently assemble phylogenomic data sets derived from genome skimming.<h4>Methods and re ...[more]

PMID: 35774988

Similar Datasets

Project description:The rapid development of phenotyping technologies over the last years gave the opportunity to study plant development over time. The treatment of the massive amount of data collected by high-throughput phenotyping (HTP) platforms is however an important challenge for the plant science community. An important issue is to accurately estimate, over time, the genotypic component of plant phenotype. In outdoor and field-based HTP platforms, phenotype measurements can be substantially affected by data-generation inaccuracies or failures, leading to erroneous or missing data. To solve that problem, we developed an analytical pipeline composed of three modules: detection of outliers, imputation of missing values, and mixed-model genotype adjusted means computation with spatial adjustment. The pipeline was tested on three different traits (3D leaf area, projected leaf area, and plant height), in two crops (chickpea, sorghum), measured during two seasons. Using real-data analyses and simulations, we showed that the sequential application of the three pipeline steps was particularly useful to estimate smooth genotype growth curves from raw data containing a large amount of noise, a situation that is potentially frequent in data generated on outdoor HTP platforms. The procedure we propose can handle up to 50% of missing values. It is also robust to data contamination rates between 20 and 30% of the data. The pipeline was further extended to model the genotype time series data. A change-point analysis allowed the determination of growth phases and the optimal timing where genotypic differences were the largest. The estimated genotypic values were used to cluster the genotypes during the optimal growth phase. Through a two-way analysis of variance (ANOVA), clusters were found to be consistently defined throughout the growth duration. Therefore, we could show, on a wide range of scenarios, that the pipeline facilitated efficient extraction of useful information from outdoor HTP platform data. High-quality plant growth time series data is also provided to support breeding decisions. The R code of the pipeline is available at https://github.com/ICRISAT-GEMS/SpaTemHTP.

Project description:BackgroundGenome skimming is a popular method in plant phylogenomics that do not include a biased enrichment step, relying on random shallow sequencing of total genomic DNA. From these data the plastome is usually readily assembled and constitutes the bulk of phylogenetic information generated in these studies. Despite a few attempts to use genome skims to recover low copy nuclear loci for direct phylogenetic use, such endeavor remains neglected. Causes might include the trade-off between libraries with few reads and species with large genomes (i.e., missing data caused by low coverage), but also might relate to the lack of pipelines for data assembling.MethodsA pipeline and its companion R package designed to automate the recovery of low copy nuclear markers from genome skimming libraries are presented. Additionally, a series of analyses aiming to evaluate the impact of key assembling parameters, reference selection and missing data are presented.ResultsA substantial amount of putative low copy nuclear loci was assembled and proved useful to base phylogenetic inference across the libraries tested (4 to 11 times more data than previously assembled plastomes from the same libraries).DiscussionCritical aspects of assembling low copy nuclear markers from genome skims include the minimum coverage and depth of a sequence to be used. More stringent values of these parameters reduces the amount of assembled data and increases the relative amount of missing data, which can compromise phylogenetic inference, in turn relaxing the same parameters might increase sequence error. These issues are discussed in the text, and parameter tuning through multiple comparisons tracking their effects on support and congruence is highly recommended when using this pipeline. The skimmingLoci pipeline (https://github.com/mreginato/skimmingLoci) might stimulate the use of genome skims to recover nuclear loci for direct phylogenetic use, increasing the power of genome skimming data to resolve phylogenetic relationships, while reducing the amount of sequenced DNA that is commonly wasted.

Dataset Information

PhyloHerb: A high-throughput phylogenomic pipeline for processing genome skimming data.

Premise

Methods and results

Conclusions

Publications

PhyloHerb: A high-throughput phylogenomic pipeline for processing genome skimming data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets