Dataset Information

A rank-based sequence aligner with applications in phylogenetic analysis.

ABSTRACT: Recent tools for aligning short DNA reads have been designed to optimize the trade-off between correctness and speed. This paper introduces a method for assigning a set of short DNA reads to a reference genome, under Local Rank Distance (LRD). The rank-based aligner proposed in this work aims to improve correctness over speed. However, some indexing strategies to speed up the aligner are also investigated. The LRD aligner is improved in terms of speed by storing [Formula: see text]-mer positions in a hash table for each read. Another improvement, that produces an approximate LRD aligner, is to consider only the positions in the reference that are likely to represent a good positional match of the read. The proposed aligner is evaluated and compared to other state of the art alignment tools in several experiments. A set of experiments are conducted to determine the precision and the recall of the proposed aligner, in the presence of contaminated reads. In another set of experiments, the proposed aligner is used to find the order, the family, or the species of a new (or unknown) organism, given only a set of short Next-Generation Sequencing DNA reads. The empirical results show that the aligner proposed in this work is highly accurate from a biological point of view. Compared to the other evaluated tools, the LRD aligner has the important advantage of being very accurate even for a very low base coverage. Thus, the LRD aligner can be considered as a good alternative to standard alignment tools, especially when the accuracy of the aligner is of high importance. Source code and UNIX binaries of the aligner are freely available for future development and use at http://lrd.herokuapp.com/aligners. The software is implemented in C++ and Java, being supported on UNIX and MS Windows.

SUBMITTER: Dinu LP

PROVIDER: S-EPMC4136772 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A rank-based sequence aligner with applications in phylogenetic analysis.

Dinu Liviu P LP Ionescu Radu Tudor RT Tomescu Alexandru I AI

PloS one 20140818 8

Recent tools for aligning short DNA reads have been designed to optimize the trade-off between correctness and speed. This paper introduces a method for assigning a set of short DNA reads to a reference genome, under Local Rank Distance (LRD). The rank-based aligner proposed in this work aims to improve correctness over speed. However, some indexing strategies to speed up the aligner are also investigated. The LRD aligner is improved in terms of speed by storing [Formula: see text]-mer positions ...[more]

PMID: 25133391

Similar Datasets

Project description:Discovering and understanding patterns in networks of protein-protein interactions (PPIs) is a central problem in systems biology. Alignments between these networks aid functional understanding as they uncover important information, such as evolutionary conserved pathways, protein complexes and functional orthologs. A few methods have been proposed for global PPI network alignments, but because of NP-completeness of underlying sub-graph isomorphism problem, producing topologically and biologically accurate alignments remains a challenge.We introduce a novel global network alignment tool, Lagrangian GRAphlet-based ALigner (L-GRAAL), which directly optimizes both the protein and the interaction functional conservations, using a novel alignment search heuristic based on integer programming and Lagrangian relaxation. We compare L-GRAAL with the state-of-the-art network aligners on the largest available PPI networks from BioGRID and observe that L-GRAAL uncovers the largest common sub-graphs between the networks, as measured by edge-correctness and symmetric sub-structures scores, which allow transferring more functional information across networks. We assess the biological quality of the protein mappings using the semantic similarity of their Gene Ontology annotations and observe that L-GRAAL best uncovers functionally conserved proteins. Furthermore, we introduce for the first time a measure of the semantic similarity of the mapped interactions and show that L-GRAAL also uncovers best functionally conserved interactions. In addition, we illustrate on the PPI networks of baker's yeast and human the ability of L-GRAAL to predict new PPIs. Finally, L-GRAAL's results are the first to show that topological information is more important than sequence information for uncovering functionally conserved interactions.L-GRAAL is coded in C++. Software is available at: http://bio-nets.doc.ic.ac.uk/L-GRAAL/.n.malod-dognin@imperial.ac.ukSupplementary data are available at Bioinformatics online.

Project description:Leishmaniasis is a debilitating infectious disease that has a variety of clinical forms. In China, visceral leishmaniasis (VL) is the most common symptom, and L. donovani and/or L. infantum are the likely pathogens. In this study, multilocus sequence typing (MLST) of five enzyme-coding genes (fh, g6pdh, icd, mpi, pgd) and two conserved genes (hsp70, lack) was used to investigate the phylogenetic relationships of Chinese Leishmania strains. Concatenated alignment of the nucleotide sequences of the seven genes was analyzed and phylogenetic trees were constructed using neighbor-joining and maximum parsimony models. A set of additional sequences from 25 strains (24 strains belong to the L. donovani complex and one strain belongs to L. gerbilli) were retrieved from GenBank to infer the molecular evolutionary history of Leishmania from China and other endemic areas worldwide. Phylogenetic analyses consolidated Chinese Leishmania into four groups: (i) one clade A population comprised 13 isolates from different foci in China, which were pathogenic to humans and canines. This population was subdivided into two subclades, clade A1 and clade A2, which comprised sister organisms to the remaining members of the worldwide L. donovani complex; (ii) a population in clade B consisted of one reference strain of L. turanica and five Chinese strains from Xinjiang; (iii) clade C (SELF-7 and EJNI-154) formed a population that was closely related to clade B, and both isolates were identified as L. gerbilli; and (iv) the final group, clade D, included Sauroleishmania (LIZRD and KXG-E) and was distinct from the other strains. We hypothesize that the phylogeny of Chinese Leishmania is associated with the geographical origins rather than with the clinical forms (VL or CL) of leishmaniasis. To conclude, this study provides further molecular information on Chinese Leishmania isolates and the Chinese isolates appear to have a more complex evolutionary history than previously thought.

Dataset Information

A rank-based sequence aligner with applications in phylogenetic analysis.

Publications

A rank-based sequence aligner with applications in phylogenetic analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets