Dataset Information

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.

ABSTRACT:

Motivation

Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning.

Results

Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided.

Availability and implementation

The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Becquey L

PROVIDER: S-EPMC8189678 | biostudies-literature | 2021 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.

Becquey Louis L Angel Eric E Tahi Fariza F

Bioinformatics (Oxford, England) 20210601 9

<h4>Motivation</h4>Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, ...[more]

PMID: 33135044

Dataset Information

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.

Motivation

Results

Availability and implementation

Supplementary information

Publications

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

CentroidHomfold-LAST: accurate prediction of RNA secondary structure using automatically collected homologous sequences.
| S-EPMC3125741 | biostudies-literature

Nh3D: a reference dataset of non-homologous protein structures.
| S-EPMC1182382 | biostudies-literature

Predicting pseudoknotted structures across two RNA sequences.
| S-EPMC3516145 | biostudies-literature

DecoyFinder: Identification of Contaminants in Sets of Homologous RNA Sequences.
| S-EPMC11507696 | biostudies-literature

RNA 3D structure prediction guided by independent folding of homologous sequences.
| S-EPMC6806525 | biostudies-literature

INTEGRATING MULTIPLE BUILT ENVIRONMENT DATA SOURCES.
| S-EPMC11600455 | biostudies-literature

A Biomedically oriented automatically annotated Twitter COVID-19 Dataset.
| S-EPMC8328063 | biostudies-literature

Data mining of functional RNA structures in genomic sequences.
| S-EPMC8301259 | biostudies-literature

Automatically Fixing Errors in Glycoprotein Structures with Rosetta.
| S-EPMC6616339 | biostudies-literature

Bochun: Automatically annotated stance detection dataset for Sorani Kurdish language.
| S-EPMC12266528 | biostudies-literature