Dataset Information

Automated download and clean-up of family-specific databases for kmer-based virus identification.

ABSTRACT:

Summary

Here, we present an automated pipeline for Download Of NCBI Entries (DONE) and continuous updating of a local sequence database based on user-specified queries. The database can be created with either protein or nucleotide sequences containing all entries or complete genomes only. The pipeline can automatically clean the database by removing entries with matches to a database of user-specified sequence contaminants. The default contamination entries include sequences from the UniVec database of plasmids, marker genes and sequencing adapters from NCBI, an E.coli genome, rRNA sequences, vectors and satellite sequences. Furthermore, duplicates are removed and the database is automatically screened for sequences from green fluorescent protein, luciferase and antibiotic resistance genes that might be present in some GenBank viral entries, and could lead to false positives in virus identification. For utilizing the database, we present a useful opportunity for dealing with possible human contamination. We show the applicability of DONE by downloading a virus database comprising 37 virus families. We observed an average increase of 16 776 new entries downloaded per month for the 37 families. In addition, we demonstrate the utility of a custom database compared to a standard reference database for classifying both simulated and real sequence data.

Availabilityand implementation

The DONE pipeline for downloading and cleaning is deposited in a publicly available repository (https://bitbucket.org/genomicepidemiology/done/src/master/).

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Allesoe RL

PROVIDER: S-EPMC8097684 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Although it is widely recognized that microorganisms are essential for sustaining soil fertility, structure, nutrient cycling, groundwater purification, and other soil functions, soil microbial toxicity data were excluded from the derivation of Ecological Soil Screening Levels (Eco-SSL) in the United States. Among the reasons for such exclusion were claims that microbial toxicity tests were too difficult to interpret because of the high variability of microbial responses, uncertainty regarding the relevance of the various endpoints, and functional redundancy. Since the release of the first draft of the Eco-SSL Guidance document by the US Environmental Protection Agency in 2003, soil microbial toxicity testing and its use in ecological risk assessments have substantially improved. A wide range of standardized and nonstandardized methods became available for testing chemical toxicity to microbial functions in soil. Regulatory frameworks in the European Union and Australia have successfully incorporated microbial toxicity data into the derivation of soil threshold concentrations for ecological risk assessments. This article provides the 3-part rationale for including soil microbial processes in the development of soil clean-up values (SCVs): 1) presenting a brief overview of relevant test methods for assessing microbial functions in soil, 2) examining data sets for Cu, Ni, Zn, and Mo that incorporated soil microbial toxicity data into regulatory frameworks, and 3) offering recommendations on how to integrate the best available science into the method development for deriving site-specific SCVs that account for bioavailability of metals and metalloids in soil. Although the primary focus of this article is on the development of the approach for deriving SCVs for metals and metalloids in the United States, the recommendations provided in this article may also be applicable in other jurisdictions that aim at developing ecological soil threshold values for protection of microbial processes in contaminated soils.