Dataset Information

Direct determination of diploid genome sequences.

ABSTRACT: Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of reference bias, but nearly all were constructed by merging homologous loci into single "consensus" sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing, and one using thousands of clone pools. Here, we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ?1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new "pushbutton" algorithm, Supernova. Each computation took 2 d on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.

SUBMITTER: Weisenfeld NI

PROVIDER: S-EPMC5411770 | biostudies-literature | 2017 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Direct determination of diploid genome sequences.

Weisenfeld Neil I NI Kumar Vijay V Shah Preyas P Church Deanna M DM Jaffe David B DB

Genome research 20170405 5

Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of referenc ...[more]

PMID: 28381613

Similar Datasets

Project description:BackgroundAlmost all genome sequencing projects neglect the fact that diploid organisms contain two genome copies and consequently what is published is a composite of the two. This means that the relationship between alternate alleles at two or more linked loci is lost. We have developed a simplified method of directly obtaining the haploid sequences of each genome copy from an individual organism.ResultsThe diploid sequences of three groups of cattle samples were obtained using a simple sample preparation procedure requiring only a microscope and a haemocytometer. Samples were: 1) lymphocytes from a single Angus steer; 2) sperm cells from an Angus bull; 3) lymphocytes from East African Zebu (EAZ) cattle collected and processed in a field laboratory in Eastern Kenya. Haploid sequence from a fosmid library prepared from lymphocytes of an EAZ cow was used for comparison. Cells were serially diluted to a concentration of one cell per microlitre by counting with a haemocytometer at each dilution. One microlitre samples, each potentially containing a single cell, were lysed and divided into six aliquots (except for the sperm samples which were not divided into aliquots). Each aliquot was amplified with phi29 polymerase and sequenced. Contigs were obtained by mapping to the bovine UMD3.1 reference genome assembly and scaffolds were assembled by joining adjacent contigs that were within a threshold distance of each other. Scaffolds that appeared to contain artefacts of CNV or repeats were filtered out leaving scaffolds with an N50 length of 27-133 kb and a 88-98 % genome coverage. SNP haplotypes were assembled with the Single Individual Haplotyper program to generate an N50 size of 97-201 kb but only ~27-68 % genome coverage. This method can be used in any laboratory with no special equipment at only slightly higher costs than conventional diploid genome sequencing. A substantial body of software for analysis and workflow management was written and is available as supplementary data.ConclusionsWe have developed a set of laboratory protocols and software tools that will enable any laboratory to obtain haplotype sequences at only modestly greater cost than traditional mixed diploid sequences.

Dataset Information

Direct determination of diploid genome sequences.

Publications

Direct determination of diploid genome sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets