Dataset Information

KmerKeys: a web resource for searching indexed genome assemblies and variants.

ABSTRACT: K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.

SUBMITTER: Pavlichin DS

PROVIDER: S-EPMC9252721 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:For nearly 100 years serotyping has been the gold standard for the identification of Salmonella serovars. Despite the increasing adoption of DNA-based subtyping approaches, serotype information remains a cornerstone in food safety and public health activities aimed at reducing the burden of salmonellosis. At the same time, recent advances in whole-genome sequencing (WGS) promise to revolutionize our ability to perform advanced pathogen characterization in support of improved source attribution and outbreak analysis. We present the Salmonella In Silico Typing Resource (SISTR), a bioinformatics platform for rapidly performing simultaneous in silico analyses for several leading subtyping methods on draft Salmonella genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST), ribosomal MLST (rMLST), and core genome MLST (cgMLST). We show how phylogenetic context from cgMLST analysis can supplement the genoserotyping analysis and increase the accuracy of in silico serovar prediction to over 94.6% on a dataset comprised of 4,188 finished genomes and WGS draft assemblies. In addition to allowing analysis of user-uploaded whole-genome assemblies, the SISTR platform incorporates a database comprising over 4,000 publicly available genomes, allowing users to place their isolates in a broader phylogenetic and epidemiological context. The resource incorporates several metadata driven visualizations to examine the phylogenetic, geospatial and temporal distribution of genome-sequenced isolates. As sequencing of Salmonella isolates at public health laboratories around the world becomes increasingly common, rapid in silico analysis of minimally processed draft genome assemblies provides a powerful approach for molecular epidemiology in support of public health investigations. Moreover, this type of integrated analysis using multiple sequence-based methods of sub-typing allows for continuity with historical serotyping data as we transition towards the increasing adoption of genomic analyses in epidemiology. The SISTR platform is freely available on the web at https://lfz.corefacility.ca/sistr-app/.

Project description:BackgroundWithin the past decade, Africa has faced several recurrent outbreaks of Ebola virus disease (EVD), including the 2014-2016 outbreak in West Africa and the recent 2018-2020 Kivu outbreak in the Democratic Republic of Congo. The study thus aimed at quantifying and mapping the scientific output of EVD research published within 2010-2020 though a bibliometric perspective.MethodsEVD-related publications from 2010 to 2020 were retrieved from the Web of Science (WoS) and Scopus databases by using the keywords 'Ebola', 'Ebola Virus Disease', 'Ebolas', and 'ebolavirus'. Biblioshiny software (using R-studio cloud) was used to categorise and evaluate authors', countries' and journals' contribution. VOSviewer was used for network visualisation.ResultsAccording to the used search strategy, a total of 3865 and 3848 EVD documents were published in WoS and Scopus, respectively. The average citation per document was 16.1 (WoS) and 16.3 (Scopus). The results show an overall increase in the publication trend within the study period. The leading countries in EVD research were the USA and UK, with over 100 papers in both databases, including Nigeria and South Africa. NIAID and CDC-USA were the most influential institutions, while "Infectious Diseases" and "Medicine" were the most decisive research fields. The most contributing authors included Feldmann H and Qiu XG with over 60 papers in each database, while Journal of Infectious Diseases was the most crucial journal. The most cited article was from Aylward et al. published in 2014, while recent years displayed a keyword focus on "double-blind", "efficacy", "ring vaccination" and "drug effect".ConclusionThis bibliometric analysis provides an updated historical perspective of progress in EVD research and has highlighted the role played by various stakeholders. However, the contribution of African countries and institutions is not sufficiently reflected, implying a need for increased funding and focus on EVD research for effective prevention and control.

Dataset Information

KmerKeys: a web resource for searching indexed genome assemblies and variants.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets