Unknown

Dataset Information

0

Augmented Interval List: a novel data structure for efficient genomic interval search.


ABSTRACT: MOTIVATION:Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. RESULTS:We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5-18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4-60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. AVAILABILITY AND IMPLEMENTATION:An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.

SUBMITTER: Feng J 

PROVIDER: S-EPMC6901075 | biostudies-literature | 2019 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Augmented Interval List: a novel data structure for efficient genomic interval search.

Feng Jianglin J   Ratan Aakrosh A   Sheffield Nathan C NC  

Bioinformatics (Oxford, England) 20191201 23


<h4>Motivation</h4>Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.<h4>Results</h4>We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed b  ...[more]

Similar Datasets

| S-EPMC6434014 | biostudies-literature
| S-EPMC9587027 | biostudies-literature
| S-EPMC3618241 | biostudies-literature
| S-EPMC3740633 | biostudies-literature
| S-EPMC3530906 | biostudies-other
| S-EPMC4987887 | biostudies-other
| S-EPMC5547444 | biostudies-other
| S-EPMC6420049 | biostudies-literature
| S-EPMC7198746 | biostudies-literature
| S-EPMC15548 | biostudies-literature