Dataset Information

Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey.

ABSTRACT: BACKGROUND: Extensive computational and database tools are available to mine genomic and genetic databases for model organisms, but little genomic data is available for many species of ecological or agricultural significance, especially those with large genomes. Genome surveys using conventional sequencing techniques are powerful, particularly for detecting sequences present in many copies per genome. However these methods are time-consuming and have potential drawbacks. High throughput 454 sequencing provides an alternative method by which much information can be gained quickly and cheaply from high-coverage surveys of genomic DNA. RESULTS: We sequenced 78 million base-pairs of randomly sheared soybean DNA which passed our quality criteria. Computational analysis of the survey sequences provided global information on the abundant repetitive sequences in soybean. The sequence was used to determine the copy number across regions of large genomic clones or contigs and discover higher-order structures within satellite repeats. We have created an annotated, online database of sequences present in multiple copies in the soybean genome. The low bias of pyrosequencing against repeat sequences is demonstrated by the overall composition of the survey data, which matches well with past estimates of repetitive DNA content obtained by DNA re-association kinetics (Cot analysis). CONCLUSION: This approach provides a potential aid to conventional or shotgun genome assembly, by allowing rapid assessment of copy number in any clone or clone-end sequence. In addition, we show that partial sequencing can provide access to partial protein-coding sequences.

SUBMITTER: Swaminathan K

PROVIDER: S-EPMC1894642 | biostudies-literature | 2007

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey.

Swaminathan Kankshita K Varala Kranthi K Hudson Matthew E ME

BMC genomics 20070524

<h4>Background</h4>Extensive computational and database tools are available to mine genomic and genetic databases for model organisms, but little genomic data is available for many species of ecological or agricultural significance, especially those with large genomes. Genome surveys using conventional sequencing techniques are powerful, particularly for detecting sequences present in many copies per genome. However these methods are time-consuming and have potential drawbacks. High throughput 4 ...[more]

PMID: 17524145

Similar Datasets

Project description:BACKGROUND: The beta-defensin gene cluster (DEFB) at chromosome 8p23.1 is one of the most copy number (CN) variable regions of the human genome. Whereas individual DEFB CNs have been suggested as independent genetic risk factors for several diseases (e.g. psoriasis and Crohn's disease), the role of multisite sequence variations (MSV) is less well understood and to date has only been reported for prostate cancer. Simultaneous assessment of MSVs and CNs can be achieved by PCR, cloning and Sanger sequencing, however, these methods are labour and cost intensive as well as prone to methodological bias introduced by bacterial cloning. Here, we demonstrate that amplicon sequencing of pooled individual PCR products by the 454 technology allows in-depth determination of MSV haplotypes and estimation of DEFB CNs in parallel. RESULTS: Six PCR products spread over approximately 87 kb of DEFB and harbouring 24 known MSVs were amplified from 11 DNA samples, pooled and sequenced on a Roche 454 GS FLX sequencer. From approximately 142,000 reads, approximately 120,000 haplotype calls (HC) were inferred that identified 22 haplotypes ranging from 2 to 7 per amplicon. In addition to the 24 known MSVs, two additional sequence variations were detected. Minimal CNs were estimated from the ratio of HCs and compared to absolute CNs determined by alternative methods. Concordance in CNs was found for 7 samples, the CNs differed by one in 2 samples and the estimated minimal CN was half of the absolute in one sample. For 7 samples and 2 amplicons, the 454 haplotyping results were compared to those by cloning/Sanger sequencing. Intrinsic problems related to chimera formation during PCR and differences between haplotyping by 454 and cloning/Sanger sequencing are discussed. CONCLUSION: Deep amplicon sequencing using the 454 technology yield thousands of HCs per amplicon for an affordable price and may represent an effective method for parallel haplotyping and CN estimation in small to medium-sized cohorts. The obtained haplotypes represent a valuable resource to facilitate further studies of the biomedical impact of highly CN variable loci such as the beta-defensin locus.

Project description:Carbapenemase production is one of the leading mechanisms of carbapenem resistance in Gram-negative bacteria. An increase in carbapenemase gene (blaCarb) copies is an important mechanism of carbapenem resistance. No currently available bioinformatics tools allow for reliable detection and reporting of carbapenemase gene copy numbers. Here, we describe the carbapenemase-encoding gene copy number estimator (CCNE), a ready-to-use bioinformatics tool that was developed to estimate blaCarb copy numbers from whole-genome sequencing data. Its performance on Klebsiella pneumoniae carbapenemase gene (blaKPC) copy number estimation was evaluated by simulation and quantitative PCR (qPCR), and the results were compared with available algorithms. CCNE has two components, CCNE-acc and CCNE-fast. CCNE-acc detects blaCarb copy number in a comprehensive and high-accuracy way, while CCNE-fast rapidly screens blaCarb copy numbers. CCNE-acc achieved the best accuracy (100%) and the lowest root mean squared error (RMSE; 0.07) in simulated noise data sets, compared to the assembly-based method (23.4% accuracy, 1.697 RMSE) and the OrthologsBased method (78.9% accuracy, 0.395 RMSE). In the qPCR validation, a high consistency was observed between the blaKPC copy number determined by qPCR and that determined with CCNE. Reverse transcription-qPCR transcriptional analysis of 40 isolates showed that blaKPC expression was positively correlated with the blaKPC copy numbers detected by CCNE (P < 0.001). An association study of 357 KPC-producing K. pneumoniae isolates and their antimicrobial susceptibility identified a significant association between the estimated blaKPC copy number and MICs of imipenem (P < 0.001) and ceftazidime-avibactam (P < 0.001). Overall, CCNE is a useful genomic tool for the analysis of antimicrobial resistance genes copy number; it is available at https://github.com/biojiang/ccne. IMPORTANCE Globally disseminated carbapenem-resistant Enterobacterales is an urgent threat to public health. The most common carbapenem resistance mechanism is the production of carbapenemases. Carbapenemase-producing isolates often exhibit a wide range of carbapenem MICs. Higher carbapenem MICs have been associated with treatment failure. The increase of carbapenemase gene (blaCarb) copy numbers contributes to increased carbapenem MICs. However, blaCarb gene copy number detection is not routinely conducted during a genomic analysis, in part due to the lack of optimal bioinformatics tools. In this study, we describe a ready-to-use tool we developed and designated the carbapenemase-encoding gene copy number estimator (CCNE) that can be used to estimate the blaCarb copy number directly from whole-genome sequencing data, and we extended the data to support the analysis of all known blaCarb genes and some other antimicrobial resistance genes. Furthermore, CCNE can be used to interrogate the correlations between genotypes and susceptibility phenotypes and to improve our understanding of antimicrobial resistance mechanisms.

Project description:BACKGROUND:Detection of DNA copy number alterations (CNAs) is critical to understand genetic diversity, genome evolution and pathological conditions such as cancer. Cancer genomes are plagued with widespread multi-level structural aberrations of chromosomes that pose challenges to discover CNAs of different length scales, and distinct biological origins and functions. Although several computational tools are available to identify CNAs using read depth (RD) signal, they fail to distinguish between large-scale and focal alterations due to inaccurate modeling of the RD signal of cancer genomes. Additionally, RD signal is affected by overdispersion-driven biases at low coverage, which significantly inflate false detection of CNA regions. RESULTS:We have developed CNAtra framework to hierarchically discover and classify 'large-scale' and 'focal' copy number gain/loss from a single whole-genome sequencing (WGS) sample. CNAtra first utilizes a multimodal-based distribution to estimate the copy number (CN) reference from the complex RD profile of the cancer genome. We implemented Savitzky-Golay smoothing filter and Modified Varri segmentation to capture the change points of the RD signal. We then developed a CN state-driven merging algorithm to identify the large segments with distinct copy numbers. Next, we identified focal alterations in each large segment using coverage-based thresholding to mitigate the adverse effects of signal variations. Using cancer cell lines and patient datasets, we confirmed CNAtra's ability to detect and distinguish the segmental aneuploidies and focal alterations. We used realistic simulated data for benchmarking the performance of CNAtra against other single-sample detection tools, where we artificially introduced CNAs in the original cancer profiles. We found that CNAtra is superior in terms of precision, recall and f-measure. CNAtra shows the highest sensitivity of 93 and 97% for detecting large-scale and focal alterations respectively. Visual inspection of CNAs revealed that CNAtra is the most robust detection tool for low-coverage cancer data. CONCLUSIONS:CNAtra is a single-sample CNA detection tool that provides an analytical and visualization framework for CNA profiling without relying on any reference control. It can detect chromosome-level segmental aneuploidies and high-confidence focal alterations, even from low-coverage data. CNAtra is an open-source software implemented in MATLAB®. It is freely available at https://github.com/AISKhalil/CNAtra.

Dataset Information

Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey.

Publications

Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets