Unknown

Dataset Information

0

Genotype prediction of 336,463 samples from public expression data.


ABSTRACT: Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples. Most of these samples only include RNA expression measurements; genotype data for these same samples would enable a wide range of analyses including variant prioritization, eQTL analysis, and studies of allele specific expression. Here, we developed a statistical model based on the existing reference and alternative read counts from the RNA-seq experiments available through Recount3 to predict genotypes at autosomal biallelic loci in coding regions. We demonstrate the accuracy of our model using large-scale studies that measured both gene expression and genotype genome-wide. We show that our predictive model is highly accurate with 99.5% overall accuracy, 99.6% major allele accuracy, and 90.4% minor allele accuracy. Our model is robust to tissue and study effects, provided the coverage is high enough. We applied this model to genotype all the samples in Recount 3 and provide the largest ready-to-use expression repository containing genotype information. We illustrate that the predicted genotype from RNA-seq data is sufficient to unravel the underlying population structure of samples in Recount3 using Principal Component Analysis.

SUBMITTER: Razi A 

PROVIDER: S-EPMC10979922 | biostudies-literature | 2024 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

Genotype prediction of 336,463 samples from public expression data.

Razi Afrooz A   Lo Christopher C CC   Wang Siruo S   Leek Jeffrey T JT   Hansen Kasper D KD  

bioRxiv : the preprint server for biology 20240313


Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples. Most of these samples only include RNA expression measurements; genotype data for these same samples would enable a wide range of analys  ...[more]

Similar Datasets

| S-EPMC3490960 | biostudies-literature
| S-EPMC5961118 | biostudies-literature
| S-EPMC10864173 | biostudies-literature
| S-EPMC6456650 | biostudies-literature
| S-EPMC5009519 | biostudies-literature
| S-EPMC3526609 | biostudies-literature
2019-07-18 | PXD013455 | Pride
| S-EPMC7010235 | biostudies-literature
| S-EPMC8293825 | biostudies-literature
| S-ECPF-GEOD-36245 | biostudies-other