Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

IVT-seq reveals extreme bias in RNA-sequencing

ABSTRACT: Background RNA sequencing (RNA-seq) is a powerful technique for identifying and quantifying transcription and splicing events, both known and novel. However, given its recent development and the proliferation of library construction methods, understanding the bias it introduces is incomplete but critical to realizing its value. Results Here we present a method, in vitro transcription sequencing (IVT-seq), for identifying and assessing the technical biases in RNA-seq library generation and sequencing at scale. We created a pool of > 1000 in vitro transcribed (IVT) RNAs from a full-length human cDNA library and sequenced them with poly-A and total RNA-seq, the most common protocols. Because each cDNA is full length and we show IVT is incredibly processive, each base in each transcript should be equivalently represented. However, with common RNA-seq applications and platforms, we find ~50% of transcripts have > 2-fold and ~10% have > 10-fold differences in within-transcript sequence coverage. Strikingly, we also find > 6% of transcripts have regions of high, unpredictable sequencing coverage, where the same transcript varies dramatically in coverage between samples, confounding accurate determination of their expression. To get at causal factors, we used a combination of experimental and computational approaches to show that rRNA depletion is responsible for the most significant variability in coverage and that several sequence determinants also strongly influence representation. Conclusions In sum, these results show the utility of IVT-seq in promoting better understanding of bias introduced by RNA-seq and suggest caution in its interpretation. Furthermore, we find that rRNA-depletion is responsible for substantial, unappreciated biases in coverage. Perhaps most importantly, these coverage biases introduced during library preparation suggest exon level expression analysis may be inadvisable. 5 rRNA-depleted samples with duplicates, 1 polyA selected, 1 total RNA, and 1 plasmid library all without replicates.

ORGANISM(S): Homo sapiens

SUBMITTER: Nicholas Lahens

PROVIDER: E-GEOD-50445 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Similar Datasets

Project description:To study target sequence specificity, selectivity, and reaction kinetics of Streptococcus pyogenes Cas9 activity, we challenged libraries of random variant targets with purified Cas9::guide RNA complexes in vitro. Cleavage kinetics were nonlinear, with a burst of initial activity followed by slower sustained cleavage. Consistent with other recent analyses of Cas9 sequence specificity, we observe considerable (albeit incomplete) impairment of cleavage for targets mutated in the PAM sequence or in "seed" sequences matching the proximal 8 bp of the guide. A second target region requiring close homology was located at the other end of the guide::target duplex (positions 13-18 relative to the PAM). Strikingly, a subset of variants which broke homology in the intervening region consistently increased the capacity of Cas9 to cleave in extended reactions. Sequences flanking the guide+PAM region had measurable (albeit modest) effects on cleavage. Taken together, these studies provide both a basis for predicting effective cleavage targets and a basis for potential optimization of guide RNAs to yield efficiency beyond that of the simple perfect-match guides. 118 samples anaylzed. Controls have con in sample name. To quantitatively measure cleavage efficiency of a single gRNA, we created a population of random variant target sequences to two gRNA targets. The targets used were "unc-22A", [a sequence from the well-characterized unc-22 gene of Caenorhabditis elegans], and "protospacer 4" (ps4), a previously characterized sequence from a natural spacer from S. pyogenes MGAS10750 . Using custom mixtures of oligonucleotide precursors for each base during chemical synthesis, a set of polymorphic target libraries ('Random Variant Libraries') were designed to have a baseline variation rate at each position. On each side of the gRNA homology and PAM regions, 6 bps of random sequence were added. The first base of intended gRNA homology is designated base 1 . The entire 35 bp random variant library mixture was cloned into a standard plasmid vector (pHRL-TK). Several thousand colonies from plates were washed in pools and prepared by standard plasmid preparation methods. The complexity of the libraries were estimated based on Illumina sequencing of the uncut libraries and filtering for minimum representation expected from the pooling. Approximately 1500-3000 unique species were obtained in the unc-22A libraries and 5000 unique sequences in the ps4 library (see Materials and Methods). To assay cleavage, purified Cas9 was first incubated with gRNA, followed by incubation with the variant library for various time points and under various conditions. DNA template is among the conditions varied in the experiments. After protein removal, flanking sequences outside of the target region are used for PCR amplification and plasmid cleavage was measured through loss of PCR products that span the region of interest. A set of perfectly matched targets and highly mutated versions present in the random variant library served as internal positive and negative controls respectively. A log retention score for each sequence in each experiment was calculated by quantifying the representation of each sequence before and after addition of the Cas9 protein. Two approaches were used for normalization: first we used a population of ps4 targets "spiked" into the library as an uncleaved control, second, we used a population of unc-22A targets with large numbers of variations from the perfect target (between 4 and 7), and hence likely limited if any cleavage. Equivalent results are obtained with these two normalization approaches (see Computational Methods for details). Retention scores are expressed as the log2 of the normalized ratio, so that a more negative retention score indicates efficient cleavage of substrate while a less negative score indicates less cleavage. Templates which are uncleaved will yield a retention score at or near zero. Comparisons between multiple experiments indicate strong correlation between independent retention measurements. GSM1410678-GSM1410761; AF_SOL*.dat' files contain the calculated final retentions for each experiment. Each experiment labeled: M-bM-^@M-^\AF_SOL_###_t###M-bM-^@M-^]. M-bM-^@M-^\AF_SOL_###M-bM-^@M-^] corresponds to the experiment run ID and M-bM-^@M-^\t###M-bM-^@M-^] corresponds to the incubation time of the experiment. For example AF_SOL_513_t360, corresponds to experiment 513 on the protospacer 4 guide and DNA target and the incubation time was 360 mins. The experimental conditions and ID can be found in the associated publication. GSM1544297-GSM1544332; unc*.dat file is a tab-delimited file of all considered sequences in each experiment. The names of the files and the AF_SOL_# run number can be found in the associated publication (Supplementary Materials) with the details of the conditions. Each filename starts with the type of gRNA used (either unc-22WT or the mutant version unc22C11G). The next number (#min) is indication of the time of incubation for the experiment and this is either followed by #pcr_AF_SOL_# or just AF_SOL_#. If followed by #pcr, that is the indication of the number of PCR cycles used in the experiments. Finally, AF_SOL_# denotes the sequencing run ID number.

Dataset Information

IVT-seq reveals extreme bias in RNA-sequencing

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets