Unknown

Dataset Information

0

Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data.


ABSTRACT:

Background

A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied.

Results

We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard).

Conclusion

We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.

SUBMITTER: Chung NC 

PROVIDER: S-EPMC6929325 | biostudies-literature | 2019 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data.

Chung Neo Christopher NC   Miasojedow BłaŻej B   Startek Michał M   Gambin Anna A  

BMC bioinformatics 20191224 Suppl 15


<h4>Background</h4>A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their u  ...[more]

Similar Datasets

| S-EPMC7050271 | biostudies-literature
| S-EPMC6755604 | biostudies-literature
| S-EPMC7660437 | biostudies-literature
| S-EPMC10505501 | biostudies-literature
| S-EPMC5541006 | biostudies-literature
| S-EPMC6479077 | biostudies-literature
| S-EPMC9235470 | biostudies-literature
| S-EPMC3773873 | biostudies-literature
| S-EPMC3857202 | biostudies-literature
| S-EPMC4984511 | biostudies-literature