Dataset Information

RAFTS³G: an efficient and versatile clustering software to analyses in large protein datasets.

ABSTRACT:

Background

Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials.

Results

Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS³G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS³G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering.

Conclusion

In general, RAFTS³G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS³G compared to other "standard-gold" methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS³G process.

SUBMITTER: de Lima Nichio BT

PROVIDER: S-EPMC6631606 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

RAFTS<sup>3</sup>G: an efficient and versatile clustering software to analyses in large protein datasets.

de Lima Nichio Bruno Thiago BT de Oliveira Aryel Marlus Repula AMR de Pierri Camilla Reginatto CR Santos Leticia Graziela Costa LGC Lejambre Alexandre Quadros AQ Vialle Ricardo Assunção RA da Rocha Coimbra Nilson Antônio NA Guizelini Dieval D Marchaukoski Jeroniza Nunes JN de Oliveira Pedrosa Fabio F Raittz Roberto Tadeu RT

BMC bioinformatics 20190715 1

<h4>Background</h4>Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed ...[more]

PMID: 31307371

Similar Datasets

Project description:The expanding and dynamic market of new psychoactive substances (NPSs) poses challenges for laboratories worldwide. The retrospective data analysis (RDA) of previously analyzed samples for new targets can be used to investigate analytes missed in the first data analysis. However, RDA has historically been unsuitable for routine evaluation because reprocessing and reevaluating large numbers of forensic samples are highly work- and time-consuming. In this project, we developed an efficient and scalable retrospective data analysis workflow that can easily be tailored and optimized for groups of NPSs. The objectives of the study were to establish a retrospective data analysis workflow for benzodiazepines in whole blood samples and apply it on previously analyzed driving-under-the-influence-of-drugs (DUID) cases. The RDA workflow was based on a training set of hits in ultrahigh-performance liquid chromatography-quadrupole time-of-flight-mass spectrometry (UHPLC-QTOF-MS) data files, corresponding to common benzodiazepines that also had been analyzed with a complementary UHPLC-tandem mass spectrometry (MS/MS) method. Quantitative results in the training set were used as the true condition to evaluate whether a hit in the UHPLC-QTOF-MS data file was true or false positive. The training set was used to evaluate and set filters. The RDA was used to extract information from 47 DBZDs in 13,514 UHPLC-QTOF-MS data files from DUID cases analyzed from 2014 to 2020, with filters on the retention time window, count level, and mass error. Sixteen designer and uncommon benzodiazepines (DBZDs) were detected, where 47 identifications had been confirmed by using complementary methods when the case was open (confirmed positive finding), and 43 targets were not reported when the case was open (tentative positive finding). The most common tentative and confirmed findings were etizolam (n = 26), phenazepam (n = 13), lorazepam (n = 9), and flualprazolam (n = 8). This method efficiently found DBZDs in previously acquired UHPLC-QTOF-MS data files, with only nine false-positive hits. When the standard of an emerging DBZD becomes available, all previously acquired DUID data files can be screened in less than 1 min. Being able to perform a fast and accurate retrospective data analysis across previously acquired data files is a major technological advancement in monitoring NPS abuse.

Dataset Information

RAFTS³G: an efficient and versatile clustering software to analyses in large protein datasets.

Background

Results

Conclusion

Publications

RAFTS<sup>3</sup>G: an efficient and versatile clustering software to analyses in large protein datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Dataset Information

RAFTS3G: an efficient and versatile clustering software to analyses in large protein datasets.

Background

Results

Conclusion

Publications

RAFTS<sup>3</sup>G: an efficient and versatile clustering software to analyses in large protein datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

RAFTS³G: an efficient and versatile clustering software to analyses in large protein datasets.