Dataset Information

IterCluster: a barcode clustering algorithm for long fragment read analysis.

ABSTRACT: Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster.

SUBMITTER: Weng J

PROVIDER: S-EPMC7100596 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

IterCluster: a barcode clustering algorithm for long fragment read analysis.

Weng Jiancong J Chen Tian T Xie Yinlong Y Xu Xun X Zhang Gengyun G Peters Brock A BA Drmanac Radoje R

PeerJ 20200324

Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions ...[more]

PMID: 32231869

Similar Datasets

Project description:BackgroundThere are 3 issues in bibliometrics that need to be addressed: The lack of a clear definition for author collaborations in cluster analysis that takes into account collaborations with and without self-connections; The need to develop a simple yet effective clustering algorithm for use in coword analysis, and; The inadequacy of general bibliometrics in regard to comparing research achievements and identifying articles that are worth reading and recommended for readers. The study aimed to put forth a clustering algorithm for cluster analysis (called following leader clustering [FLCA], a follower-leading clustering algorithm), examine the dissimilarities in cluster outcomes when considering collaborations with and without self-connections in cluster analysis, and demonstrate the application of the clustering algorithm in bibliometrics.MethodsThe study involved a search for articles and review articles published in JMIR Medical Informatics between 2016 and 2022, conducted using the Web of Science core collections. To identify author collaborations (ACs) and themes over the past 7 years, the study utilized the FLCA algorithm. With the 3 objectives of; Comparing the results obtained from scenarios with and without self-connections; Applying the FLCA algorithm in ACs and themes, and; Reporting the findings using traditional bibliometric approaches based on counts and citations, and all plots were created using R.ResultsThe study found a significant difference in cluster outcomes between the 2 scenarios with and without self-connections, with a 53.8% overlap (14 out of the top 20 countries in ACs). The top clusters were led by Yonsei University in South Korea, Grang Luo from the US, and model in institutes, authors, and themes over the past 7 years. The top entities with the most publications in JMIR Medical Informatics were the United States, Yonsei University in South Korea, Medical School, and Grang Luo from the US.ConclusionThe FLCA algorithm proposed in this study offers researchers a comprehensive approach to exploring and comprehending the complex connections among authors or keywords. The study suggests that future research on ACs with cluster analysis should employ FLCA and R visualizations.

Project description:MotivationDeep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs.ResultsWe developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5'-end processing and 3'-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain.Availability and implementationThe source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA004502.Contactyasu@bio.keio.ac.jpSupplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

IterCluster: a barcode clustering algorithm for long fragment read analysis.

Publications

IterCluster: a barcode clustering algorithm for long fragment read analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets