Unknown

Dataset Information

0

Bias-invariant RNA-sequencing metadata annotation.


ABSTRACT:

Background

Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs.

Findings

Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples.

Conclusion

Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.

SUBMITTER: Wartmann H 

PROVIDER: S-EPMC8559615 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC8401820 | biostudies-literature
| S-EPMC10277029 | biostudies-literature
| S-EPMC4197826 | biostudies-literature
| S-EPMC3149584 | biostudies-literature
| S-EPMC4117970 | biostudies-literature
| S-EPMC5778030 | biostudies-literature
| S-EPMC4130647 | biostudies-literature
| S-EPMC7703774 | biostudies-literature
| S-EPMC5428526 | biostudies-literature
| S-EPMC3328248 | biostudies-literature