Dataset Information

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus.

ABSTRACT: Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types (STs), allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long-read sequencing technologies, such as from Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short-read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a ST directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore. It provides STs for isolates on average within 90 s, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

SUBMITTER: Page AJ

PROVIDER: S-EPMC6074768 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Rapid multi-locus sequence typing direct from uncorrected long reads using <i>Krocus</i>.

Page Andrew J AJ Keane Jacqueline A JA

PeerJ 20180731

Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types (STs), allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long-read sequencing technologies, such as from Oxford Nanopore, can produce read data w ...[more]

PMID: 30083440

Similar Datasets

Project description:BackgroundMulti-locus sequence typing (MLST) has become the gold standard for population analyses of bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the last decade, researchers and population health specialists have invested substantial effort in building up public MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important information linked to MLST sequence types such as time and place of isolation, host or niche, serotype and even clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards compatibility with MLST schemes so that new genome analyses can be understood in their proper historical context.ResultsWe present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, using inputs easily downloaded from public databases. SRST uses read mapping and an allele assignment score incorporating sequence coverage and variability, to determine the most likely allele at each MLST locus. Analysis of over 3,500 loci in more than 500 publicly accessible Illumina read sets showed SRST to be highly accurate at allele assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also be generated for novel alleles.ConclusionsSRST is a novel software tool for accurate assignment of sequence types using short read data. Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, plasmid MLST and analysis of genomic data during outbreak investigation. SRST is open-source, requires Python, BWA and SamTools, and is available from http://srst.sourceforge.net.

Project description:BackgroundMycoplasma anserisalpingitidis is a waterfowl pathogen that mainly infects geese, can cause significant economic losses and is present worldwide. With the advance of whole genome sequencing technologies, new methods are available for the researchers; one emerging methodology is the core genome Multi-Locus Sequence Typing (cgMLST). The core genome contains a high percentage of the coding DNA sequence (CDS) set of the studied strains. The cgMLST schemas are powerful genotyping tools allowing for the investigation of potential epidemics, and precise and reliable classification of the strains. Although whole genome sequences of M. anserisalpingitidis strains are available, to date, no cgMLST schema has been published for this species.ResultsIn this study, Illumina short reads of 81 M. anserisalpingitidis strains were used, including samples from Hungary, Poland, Sweden, and China. Draft genomes were assembled with the SPAdes software and analysed with the online available chewBBACA program. User made modifications in the program enabled analysis of mycoplasmas and provided similar results as the conventional SeqSphere+ software. The threshold of the presence of CDS in the strains was set to 93% due to the quality of the draft genomes, resulting in the most accurate and robust schema. Three hundred thirty-one CDSs constituted our cgMLST schema (representing 42,77% of the whole CDS set of M. anserisalpingitidis ATCC BAA-2147), and a Neighbor joining tree was created using the allelic profiles. The correlation was observed between the strains' cgMLST profile and geographical origin; however, strains from the same integration but different locations also showed close relationship. Strains isolated from different tissue samples of the same animal revealed highly similar cgMLST profiles.ConclusionsThe Neighbor joining tree from the cgMLST schema closely resembled the real-life spatial and temporal relationships of the strains. The incongruences between background data and the cgMLST profile in the strains from the same integration can be because of the higher probability of contacts between the flocks. This schema can help with the epidemiological investigation and can be used as a basis for further studies.

Dataset Information

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus.

Publications

Rapid multi-locus sequence typing direct from uncorrected long reads using <i>Krocus</i>.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets