Project description:In Europe, ticks are the most important vectors of diseases threatening humans, livestock, wildlife and companion animals. Nevertheless, genomic sequence information and functional annotation of proteins of the most important European tick, Ixodes ricinus, is limited. Here we present the first analysis of the I. ricinus genome and of the transcriptome of the unfed I. ricinus midgut. We combined and integrated data from genome, transcriptome and proteome. The de novo assembly of 1 billion paired-end sequences identified 6,415 putative genes providing an unprecedented insight into the I. ricinus genome. Mapping of our midgut mRNA reads to the assembled contigs let us estimate to cover around two third of the unique genomic sequences. In addition, more than 10,000 transcripts from naïve midgut were annotated functionally and/or locally. By combining the alignment-based with a motif-search based annotation approach, we could double the number of annotations throughout all groups without shifting the dataset. Moreover, 1,175 proteins expressed in the naïve midgut were identified by mass spectrometry confirming the high completeness of our transcriptome database, and 608 were significantly annotated for function and/or localization. This multiple-omics study vastly extends the publicly available DNA, RNA and protein databases for I. ricinus and ticks in general.
Project description:The incomplete genome annotation of non-model organisms hampers molecular and proteomic studies. Proteomics informed by transcriptomics (PIT) is suited to non-model organisms because peptides are identified using transcriptomic, not genomic, data. Aedes aegypti is the mosquito vector for the (re-)emerging dengue, chikungunya, yellow fever and Zika viruses. An Ae. aegypti genome sequence is available, however experimental evidence for >90% of the Ae. aegypti proteome or the activity of transposable elements (TEs) that constitute 50% of the Ae. aegypti genome is lacking. We used PIT to characterise the proteome of the Aedes aegypti derived cell line Aag2. Hotspots of incomplete genome annotation were identified which are not explained by poor sequence and assembly quality. We developed criteria for the characterisation of proteomically active TEs and demonstrate that protein expression does not correlate with a TE’s genomic abundance. Finally, we identify Phasi Charoen-like virus as an unrecognised contaminant of Aag2 cells. We therefore present the first proteomic characterisation of mobile genetic elements, and provide proof-of-principle that PIT can evaluate a genome’s annotation to guide annotation efforts.
Project description:Macaque species share over 93% genome homology with humans and develop many disease phenotypes similar to those of humans, making them valuable animal models for the study of human diseases (e.g.,HIV and neurodegenerative diseases). However, the quality of genome assembly and annotation for several macaque species lags behind the human genome effort. To close this gap and enhance functional genomics approaches, we employed a combination of de novo linked-read assembly and scaffolding using proximity ligation assay (HiC) to assemble the pig-tailed macaque (Macaca nemestrina) genome. This combinatorial method yielded large scaffolds at chromosome-level with a scaffold N50 of 127.5 Mb; the 23 largest scaffolds covered 90% of the entire genome. This assembly revealed large-scale rearrangements between pig-tailed macaque chromosomes 7, 12, and 13 and human chromosomes 2, 14, and 15. We subsequently annotated the genome using transcriptome and proteomics data from personalized induced pluripotent stem cells (iPSCs) derived from the same animal. Reconstruction of the evolutionary tree using whole genome annotation and orthologous comparisons among three macaque species, human and mouse genomes revealed extensive homology between human and pig-tailed macaques with regards to both pluripotent stem cell genes and innate immune gene pathways. Our results confirm that rhesus and cynomolgus macaques exhibit a closer evolutionary distance to each other than either species exhibits to humans or pig-tailed macaques. These findings demonstrate that pig-tailed macaques can serve as an excellent animal model for the study of many human diseases particularly with regards to pluripotency and innate immune pathways.
Project description:With the creation of accurate, chromosome-scale genomes, the next challenge facing the genomics community is the accurate idenfication of transcriptional units, distinguishing them from aberrant transcriptional noise. This has proven to be a challenge as annotation by traditional means, such as short read RNA-seq followed by transcriptome assembly, which is prone to the generation of in-silico artifacts. To address this issue, we took advantage of epigenomic data in the form of ChIP-seq to unbiasedly annotate plant genomes and identify potential annotation issues, as well as identify novel genes. Histone modifications appear in the genome in a reproducible and predictable manner, making them an ideal resource to use in annotation. Trimethylation of histone 3 lysine 4 (H3K4me3), as well as acetylation of histone 3 lysine 56 are well documented to coincide with initiation of transcription by polymerase II (Pol II) at promoter sequences. These initiation marks, paired with marks deposited across the gene body during transcriptional elongation, such as histone 3 lysine 36 tri-methylation (H3K36me3) and histone 3 lysine 4 mono-methylation (H3K4me1), offer a framework to begin identifying complete transcriptional units. We leveraged these data on a genome-wide scale, allowing for identification of annotations discordant with empirical data. In total, 13,159 potential annotation issues were found in Zea mays across three different tissues, which were corroborated using complementary RNA-based approaches. Upon correction and validation, genes were extended by an average of 2,128 base pairs, and the length of discovered novel genes was 1,962 base pairs. Application of this method to five additional plant genomes revealed a variety of novel gene annotations, including 13,836 in Asparagus officianalis, 2,724 in Setaria viridis, 2,446 in Sorghum bicolor, 8,631 in Glycine max, and 2,585 in Phaseolous vulgaris.
Project description:The skin commensal yeast Malassezia is associated with several skin disorders. To establish a reference resource, we sought to determine the complete genome sequence of Malassezia sympodialis and identify its protein-coding genes. A novel genome annotation workflow combining RNA sequencing, proteomics, and manual curation was developed to determine gene structures with high accuracy.
Project description:With the emergence of zebrafish as an important model organism, a concerted effort has been made to study its transcriptome. This effort is limited by gaps in zebrafish annotation, which is especially pronounced concerning transcripts dynamically expressed during zygotic genome activation (ZGA). To date, short read sequencing has been the principal technology for zebrafish transcriptome annotation. In part because these sequence reads are too short for assembly methods to resolve the full complexity of the transcriptome, the current annotation is rudimentary. By providing direct observation of full-length transcripts, recently refined long-read sequencing platforms can dramatically improve annotation coverage and accuracy. Here, we leveraged the SMRT platform to study the early ZGA-stage zebrafish transcriptome. Our analysis revealed additional novelty and complexity in the zebrafish transcriptome, identifying 2748 high confidence novel transcripts that originated from previously unannotated loci and 1835 new isoforms in previously annotated genes.
Project description:This dataset includes RNAseq data of 7 tissues/developmental stages of Lathyrus sativus genotype LSWT11 and 2 tissues with drought- and well-watered treatments of Lathyrus sativus genotypes LS007 and Mahateora. These data were used in the functional annotation pipeline of the Rbp1.0 genome assembly of LS007. The multi-tissue transcriptome was also used to support gene candidate identification by mRNA abundance. Also included is Hi-C sequencing data used to scaffold the assembly into pseudochromosomes