Augmented Interval List: a novel data structure for efficient genomic interval search.
Ontology highlight
ABSTRACT: MOTIVATION:Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. RESULTS:We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5-18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4-60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. AVAILABILITY AND IMPLEMENTATION:An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
SUBMITTER: Feng J
PROVIDER: S-EPMC6901075 | biostudies-literature | 2019 Dec
REPOSITORIES: biostudies-literature
ACCESS DATA