Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data.
Ontology highlight
ABSTRACT: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE).We generated sixteen million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias towards higher mapping rates of the allele in the reference sequence, compared to the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, $\sim$5-10\% of SNPs still have an inherent bias towards more effective mapping of one allele. Filtering out inherently biased SNPs removes 40\% of the top signals of ASE. The remaining SNPs showing ASE are enriched in genes previously known to harbor cis-regulatory variation or known to show uniparental imprinting. Our results have implications for a variety of applications involving detection of alternate alleles from short-read sequence data. Scripts, written in Perl and R, for simulating short reads, masking SNP variation in a reference genome, and analyzing the simulation output are available upon request from JFD.
Project description:Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE).We generated sixteen million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias towards higher mapping rates of the allele in the reference sequence, compared to the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, $\sim$5-10\% of SNPs still have an inherent bias towards more effective mapping of one allele. Filtering out inherently biased SNPs removes 40\% of the top signals of ASE. The remaining SNPs showing ASE are enriched in genes previously known to harbor cis-regulatory variation or known to show uniparental imprinting. Our results have implications for a variety of applications involving detection of alternate alleles from short-read sequence data. Scripts, written in Perl and R, for simulating short reads, masking SNP variation in a reference genome, and analyzing the simulation output are available upon request from JFD. RNA-Seq on two YRI Hapmap cell lines. Each individual sequenced on two lanes of the Illumina Genome Analyzer
Project description:In this study we use RNAseq to explore allele specific expression (ASE) in adipose tissue of male and female F1 mice, produced from reciprocal crosses of C57BL/6J and DBA/2J strains. Comparison of the identified cis-eQTLs, to local-eQTLs, that were obtained from adipose tissue expression in two previous population based studies in our laboratory, yields poor overlap between the two mapping approaches, while both local-eQTL studies show highly concordant results. Specifically, local-eQTL studies show ~60% overlap between themselves, while only 15-20% of local-eQTLs are identified as cis by ASE, and less than 50% of ASE genes are recovered in local-eQTL studies. Utilizing recently published ENCODE data, we also find that ASE genes show significant bias for SNPs prevalence in DNase I hypersensitive sites that is ASE direction specific. We suggest a new approach to analysis of allele specific expression that is more sensitive and accurate than the commonly used fisher or chi-square statistics. Our analysis indicates that technical differences between the cis and local-eQTL approaches, such as differences in genomic background or sex specificity, account for relatively small fraction of the discrepancy. Therefore, we suggest that the differences between two eQTL mapping approaches may facilitate sorting of SNP-eQTL interactions into true cis and trans, and that a considerable portion of local-eQTL may actually represent trans interactions. 4 samples - male and female, BxD and DxB adipose of pooled RNA (3 animals per pool) were analyzed with high coverage RNAseq data.
Project description:In this study we use RNAseq to explore allele specific expression (ASE) in adipose tissue of male and female F1 mice, produced from reciprocal crosses of C57BL/6J and DBA/2J strains. Comparison of the identified cis-eQTLs, to local-eQTLs, that were obtained from adipose tissue expression in two previous population based studies in our laboratory, yields poor overlap between the two mapping approaches, while both local-eQTL studies show highly concordant results. Specifically, local-eQTL studies show ~60% overlap between themselves, while only 15-20% of local-eQTLs are identified as cis by ASE, and less than 50% of ASE genes are recovered in local-eQTL studies. Utilizing recently published ENCODE data, we also find that ASE genes show significant bias for SNPs prevalence in DNase I hypersensitive sites that is ASE direction specific. We suggest a new approach to analysis of allele specific expression that is more sensitive and accurate than the commonly used fisher or chi-square statistics. Our analysis indicates that technical differences between the cis and local-eQTL approaches, such as differences in genomic background or sex specificity, account for relatively small fraction of the discrepancy. Therefore, we suggest that the differences between two eQTL mapping approaches may facilitate sorting of SNP-eQTL interactions into true cis and trans, and that a considerable portion of local-eQTL may actually represent trans interactions.
Project description:In this experiment, we asked how the allelic distribution of the active and repressive chromatin marks in clonal cell lines relates to the transcriptional allelic bias. A multiplexed padlock probe approach (Zhang et al., 2009) enabled us to assess allelic bias in heterozygous exonic SNPs in two clones with GM12878 genotype, and four clones from GM13130 cells. We used this approach to assess allelic bias in H3K27me3 and H3K36me3 ChIP samples simultaneously with cDNA from the same cells, as well as ChIP input and genomic DNA controls. In order to pool data from two individuals, one of which (GM13130) lacked complete genotypes for parents, we assessed SNP bias as reference and alternative alleles (rather than maternal or paternal bias). SNPs in cDNA were assigned to one of three bins: reference allele bias; no bias; and alternative allele bias. For these groups, allelic bias in H3K27me3 (Fig.4B) and H3K36me3 (Fig.4C) was determined. In unbiased loci, both H3K27me3 and H3K36me3 were equally represented. In contrast, preferential expression of an allele was associated with elevated levels of H3K36me3 and decreased levels of H3K27me3 on that allele. Both effects were highly significant (p<2x10e-9). Genes predicted to have MAE were about four-fold over-represented among genes where SNPs showed significant bias (Fig.4D). SNPs with skewed H3K27me3 and H3K36me3 distribution were highly enriched in the genes predicted as MAE (p<10e-6 and p=0.01, respectively; two-tailed Fisher's exact test). This suggests that the asymmetric distribution of the histone modifications is to a large extent due to the genes that have the chromatin signature of monoallelic expression.
Project description:Though sequence differences between alleles are often limited to a few polymorphisms, these differences can cause large and widespread allelic variation at the expression level. Such allele-specific expression (ASE) has been extensively explored at the level of transcription but not translation. Here we measured ASE in the diploid yeast Candida albicans at both the transcriptional and translational levels using RNA-seq and ribosome profiling, respectively. Since C. albicans is an obligate diploid, our analysis isolates ASE arising from cis elements in a natural, non-hybrid organism, where allelic effects reflect evolutionary forces. Importantly, we find that ASE arising from translation is of a similar magnitude as transcriptional ASE, both in terms of the number of genes affected and the magnitude of the bias. We further observe coordination between ASE at the levels of transcription and translation for single genes. Specifically, reinforcing relationships—where transcription and translation favor the same allele—are more frequent than expected by chance, consistent with selective pressure tuning ASE at multiple regulatory steps. Finally, we parameterize alleles based on a range of properties and find that SNP location and predicted mRNA-structure stability are associated with translational ASE in cis. Since this analysis probes more than 4,000 allelic pairs spanning a broad range of variations, our data provide a genome-wide view into the relative impacts of cis elements that regulate translation.
Project description:In this experiment, we asked how the allelic distribution of the active and repressive chromatin marks in clonal cell lines relates to the transcriptional allelic bias. A multiplexed padlock probe approach (Zhang et al., 2009) enabled us to assess allelic bias in heterozygous exonic SNPs in two clones with GM12878 genotype, and four clones from GM13130 cells. We used this approach to assess allelic bias in H3K27me3 and H3K36me3 ChIP samples simultaneously with cDNA from the same cells, as well as ChIP input and genomic DNA controls. In order to pool data from two individuals, one of which (GM13130) lacked complete genotypes for parents, we assessed SNP bias as reference and alternative alleles (rather than maternal or paternal bias). SNPs in cDNA were assigned to one of three bins: reference allele bias; no bias; and alternative allele bias. For these groups, allelic bias in H3K27me3 (Fig.4B) and H3K36me3 (Fig.4C) was determined. In unbiased loci, both H3K27me3 and H3K36me3 were equally represented. In contrast, preferential expression of an allele was associated with elevated levels of H3K36me3 and decreased levels of H3K27me3 on that allele. Both effects were highly significant (p<2x10e-9). Genes predicted to have MAE were about four-fold over-represented among genes where SNPs showed significant bias (Fig.4D). SNPs with skewed H3K27me3 and H3K36me3 distribution were highly enriched in the genes predicted as MAE (p<10e-6 and p=0.01, respectively; two-tailed Fisher's exact test). This suggests that the asymmetric distribution of the histone modifications is to a large extent due to the genes that have the chromatin signature of monoallelic expression. Samples analyzed were A. polyclonal cell line GM12878 , and clones derived from it: DF1 and DF2, B. Polyclonal GM13130 (H0) and clones derived from it: H7, H14 and H16. gDNA, cDNA, ChIP material and input were used.
Project description:Though sequence differences between alleles are often limited to a few polymorphisms, these differences can cause large and widespread allelic variation at the expression level. Such allele-specific expression (ASE) has been extensively explored at the level of transcription but not translation. Here we measured ASE in the diploid yeast Candida albicans at both the transcriptional and translational levels using RNA-seq and ribosome profiling, respectively. Since C. albicans is an obligate diploid, our analysis isolates ASE arising from cis elements in a natural, non-hybrid organism, where allelic effects reflect evolutionary forces. Importantly, we find that ASE arising from translation is of a similar magnitude as transcriptional ASE, both in terms of the number of genes affected and the magnitude of the bias. We further observe coordination between ASE at the levels of transcription and translation for single genes. Specifically, reinforcing relationshipsM-bM-^@M-^Twhere transcription and translation favor the same alleleM-bM-^@M-^Tare more frequent than expected by chance, consistent with selective pressure tuning ASE at multiple regulatory steps. Finally, we parameterize alleles based on a range of properties and find that SNP location and predicted mRNA-structure stability are associated with translational ASE in cis. Since this analysis probes more than 4,000 allelic pairs spanning a broad range of variations, our data provide a genome-wide view into the relative impacts of cis elements that regulate translation. Two biological replicates of WT Candida albicans ribosome profiling and RNA-seq
Project description:The aim of this study was to compare the power to detect associations between SNPs using cis-eQTL mapping and ASE analysis (allele specific expression).
Project description:A High Density Rice Array (HDRA) was developed as an Affymetrix Custom GeneChip Array by the McCouch Rice Lab at Cornell University. The HDRA assays 700,000 SNPs, or approximately one SNP every 0.54 Kb across the rice genome (genome size = 380 Mb). It was designed to capture most of the haplotype variation observed in a discovery panel consisting of 16M SNPs (generated by sequencing 125 rice genomes at ~7X genome coverage) and to maximize the inclusion of non-synonymous SNPs. Six probes per SNP target were designed as 3 A-allele and 3 B-allele probes at offsets from center ranging from -6 to +6. A small fraction of SNPs have only 4 probes (2-A, 2-B). For all SNPs, the “A” allele is the reference allele (Os-Nipponbare-Reference-IRGSP-1.0 assembly). Additionally, we designed 23,656 x 25-bp probes complimentary to invariant regions of the genome that were used to normalize systematic differences between samples. An estimated 45% of HDRA SNPs map within genes, hitting all 39,045 unique, non-TE rice gene models (MSUv7 rice genome annotation, GFF3 file, Feb. 7, 2012, http://rice.plantbiology.msu.edu/), while 55% of SNPs map to intergenic regions. Non-synonymous are found in 91% of unique, non-TE gene models, and 57% of genic SNPs are distributed within exons, 36% within introns, 5% within 5’ UTRs and 2% within 3’ UTRs. Of the intergenic SNPs, 40% are located in putative regulatory regions within 2 Kb of a transcriptional start site.
Project description:Genome-wide association studies implicate multiple loci in risk for systemic lupus erythematosus (SLE), but few contain exonic variants, rendering systematic identification of non-coding variants essential to decoding SLE genetics. We utilized SNP-seq and bioinformatic enrichment to interrogate 2180 single-nucleotide polymorphisms (SNPs) from 87 SLE risk loci for potential binding of transcription factors and related proteins from B cells. 52 SNPs that passed initial screening were tested by electrophoretic mobility shift (EMSA) and luciferase reporter assays. To identify binding of transcription factors and/or other nuclear proteins in an allele-determined manner, we employed pulldown using nuclear extract from Daudi cells and silver staining in SNPs that had exhibited allele-specific differential binding by EMSA. Each pulldown product for each allele of the five high-probability SNPs (rs2297550 C/G, rs13213604 C/G, rs276461 T/C, rs9907955 C/T, rs7302634 T/C) was evaluated by mass spectrometry (MS) to identify binding nuclear proteins, yielding a set of candidate proteins for each.