Project description:RNA sequencing (RNA-seq) has been a widely used high-throughput method to characterize transcriptomic dynamics spatiotemporally. However, typical RNA-seq data analysis pipelines depend on either a sequenced genome or reference transcripts. This constriction makes the use of RNA-seq for species lacking both of sequenced genomes and reference transcripts challenging. To solve this problem, we developed CRSP, an RNA-seq pipeline integrating multiple comparative species strategy but not depending on a specific sequenced genome or reference transcripts. Benchmarking suggests the CRSP tool can achieve high accuracy to quantify gene expression levels.
Project description:DNA methylation plays critical roles in gene regulation and cellular specification without altering DNA sequences. The wide application of reduced representation bisulfite sequencing (RRBS) and whole genome bisulfite sequencing (bis-seq) opens the door to study DNA methylation at single CpG site resolution. One challenging question is how best to test for significant methylation differences between groups of biological samples in order to minimize false positive findings. Current methods to analyze genome-wide bisulfite sequencing data use a smoothing approach or a simple statistical test based on the binomial distribution. Comparative DNA methylation profiling in AML blasts and normal CD34(+) control cells
Project description:Current pipelines used to map genetrap insertion sites are based on inverse- or splinkerette-PCR methods, which despite their efficacy are prone to artifacts and do not provide information on the impact of the genetrap on the expression of the targeted gene. We developed a new method, which we named TrapSeq, for the mapping of genetrap insertions based on paired-end RNA sequencing. By recognizing chimeric mRNAs containing genetrap sequences spliced to an endogenous exon, our method identifies insertions that lead to productive trapping.
Project description:We investigated the reported binding of telomere associated factor TERF1 and TERF2 to internal telomere sites using ChIP-Seq for these two factors in a lymphoblastoid cell line. We mapped over 40 million reads for each sample to a custom reference genome that incorporates our subtelomere assembly, and generated signal tracks using only uniquely mapping reads, and also using a multimapping pipeline we developed. We find that peaks are misshapen and made up of reads that cannot be distinguished from true telomere sequence. Removing telomere identified reads removes all internal signal. Examination of TRF1 and TRF2
Project description:RNA-Sequencing is a transformative method that captures the quantitative dynamics of a transcriptome with exquisite sensitivity and single-base resolution. There are, however, few computational pipelines for RNA-Seq with statistical tests that evince sufficient robustness and power as demanded by the difficult combination of small sample sizes and high variability in sequence read counts. To this end, we developed GENE-counter, a complete software pipeline for analyzing RNA-Seq data for genome-wide expression differences between replicated treatment groups. One important component of GENE-counter is a statistical test based on the NBP parameterization of the negative binomial distribution for identifying differentially expressed genome features. We used GENE-counter to analyze RNA-Seq data derived from Arabidopsis thaliana infected with a strain of defense-eliciting bacteria. We identified 308 genes that were differentially induced. Using alternative methods, we provided support for the induced expression and biological relevance of a substantial proportion of the genes. These results suggest the NBP parameterization of the negative binomial distribution is well suited for explaining RNA-Seq data and the statistical test makes GENE-counter a powerful pipeline for studying genome-wide expression changes. GENE-counter is freely available at http://changlab.cgrb.oregonstate.edu/. Our RNA-seq data is uploaded on the NCBI short read archive (SRA) under the SRA025952.
Project description:Metagenomic data compression is very important as metagenomic projects are facing the challenges of larger data volumes per sample and more samples nowadays. Reference-based compression is a promising method to obtain a high compression ratio. However, existing microbial reference genome databases are not suitable to be directly used as references for compression due to their large size and redundancy, and different metagenomic cohorts often have various microbial compositions. We present a novel pipeline that generated simplified and tailored reference genomes for large metagenomic cohorts, enabling the reference-based compression of metagenomic data. We constructed customized reference genomes, ranging from 2.4 to 3.9 GB, for 29 real metagenomic datasets and evaluated their compression performance. Reference-based compression achieved an impressive compression ratio of over 20 for human whole-genome data and up to 33.8 for all samples, demonstrating a remarkable 4.5 times improvement than the standard Gzip compression. Our method provides new insights into reference-based metagenomic data compression and has a broad application potential for faster and cheaper data transfer, storage, and analysis.
Project description:DamID is a powerful technique for identifying regions of the genome bound by a DNA-binding (or DNA-associated) protein. Currently no method exists for automatically processing next-generation sequencing DamID (DamID-seq) data, and the use of DamID-seq datasets with normalisation based on read-counts alone can lead to high background and the loss of bound signal. DamID-seq thus presents novel challenges in terms of normalisation and background minimisation. We describe here damidseq_pipeline, a software pipeline that performs automatic normalisation and background reduction on multiple DamID-seq FASTQ or BAM datasets. Single replicate profiling of pol II occupancy in 3rd instar larval neuroblasts of Drosophila
Project description:Primary objectives: The primary objective is to investigate circulating tumor DNA (ctDNA) via deep sequencing for mutation detection and by whole genome sequencing for copy number analyses before start (baseline) with regorafenib and at defined time points during administration of regorafenib for treatment efficacy in colorectal cancer patients in terms of overall survival (OS).
Primary endpoints: circulating tumor DNA (ctDNA) via deep sequencing for mutation detection and by whole genome sequencing for copy number analyses before start (baseline) with regorafenib and at defined time points during administration of regorafenib for treatment efficacy in colorectal cancer patients in terms of overall survival (OS).
Project description:MotivationMetagenomics is a powerful tool for assaying the DNA from every genome present in an environment. Recent advances in bioinformatics have enabled the rapid assembly of near-complete metagenome-assembled genomes (MAGs), and there is a need for reproducible pipelines that can annotate and characterize thousands of genomes simultaneously, to enable identification and functional characterization.ResultsHere we present MAGpy, a scalable and reproducible pipeline that takes multiple genome assemblies as FASTA and compares them to several public databases, checks quality, suggests a taxonomy and draws a phylogenetic tree.Availability and implementationMAGpy is available on github: https://github.com/WatsonLab/MAGpy.Supplementary informationSupplementary data are available at Bioinformatics online.