Dataset Information

The Fermi-Dirac distribution provides a calibrated probabilistic output for binary classifiers.

ABSTRACT: Binary classification is one of the central problems in machine-learning research and, as such, investigations of its general statistical properties are of interest. We studied the ranking statistics of items in binary classification problems and observed that there is a formal and surprising relationship between the probability of a sample belonging to one of the two classes and the Fermi-Dirac distribution determining the probability that a fermion occupies a given single-particle quantum state in a physical system of noninteracting fermions. Using this equivalence, it is possible to compute a calibrated probabilistic output for binary classifiers. We show that the area under the receiver operating characteristics curve (AUC) in a classification problem is related to the temperature of an equivalent physical system. In a similar manner, the optimal decision threshold between the two classes is associated with the chemical potential of an equivalent physical system. Using our framework, we also derive a closed-form expression to calculate the variance for the AUC of a classifier. Finally, we introduce FiDEL (Fermi-Dirac-based ensemble learning), an ensemble learning algorithm that uses the calibrated nature of the classifier's output probability to combine possibly very different classifiers.

SUBMITTER: Kim SC

PROVIDER: S-EPMC8403970 | biostudies-literature | 2021 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The Fermi-Dirac distribution provides a calibrated probabilistic output for binary classifiers.

Kim Sung-Cheol SC Arun Adith S AS Ahsen Mehmet Eren ME Vogel Robert R Stolovitzky Gustavo G

Proceedings of the National Academy of Sciences of the United States of America 20210801 34

Binary classification is one of the central problems in machine-learning research and, as such, investigations of its general statistical properties are of interest. We studied the ranking statistics of items in binary classification problems and observed that there is a formal and surprising relationship between the probability of a sample belonging to one of the two classes and the Fermi-Dirac distribution determining the probability that a fermion occupies a given single-particle quantum stat ...[more]

PMID: 34413191

Similar Datasets

Project description:With the continuous development of information technology and the running speed of computers, the development of informatization has led to the generation of increasingly more medical data. Solving unmet needs such as employing the constantly developing artificial intelligence technology to medical data and providing support for the medical industry is a hot research topic. Cytomegalovirus (CMV) is a kind of virus that exists widely in nature with strict species specificity, and the infection rate among Chinese adults is more than 95%. Therefore, the detection of CMV is of great importance since the vast majority of infected patients are in a state of invisible infection after the infection, except for a few patients with clinical symptoms. In this study, we present a new method to detect CMV infection status by analyzing high-throughput sequencing results of T cell receptor beta chains (TCRβ). Based on the high-throughput sequencing data of 640 subjects from cohort 1, Fisher's exact test was performed to evaluate the relationship between TCRβ sequences and CMV status. Furthermore, the number of subjects with these correlated sequences to different degrees in cohort 1 and cohort 2 were measured to build binary classifier models to identify whether the subject was CMV positive or negative. We select four binary classification algorithms: logistic regression (LR), support vector machine (SVM), random forest (RF), and linear discriminant analysis (LDA) for side-by-side comparison. According to the performance of different algorithms corresponding to different thresholds, four optimal binary classification algorithm models are obtained. The logistic regression algorithm performs best when Fisher's exact test threshold is 10-5, and the sensitivity and specificity are 87.5% and 96.88%, respectively. The RF algorithm performs better at the threshold of 10-5, with a sensitivity of 87.5% and a specificity of 90.63%. The SVM algorithm also achieves high accuracy at the threshold value of 10-5, with a sensitivity of 85.42% and specificity of 96.88%. The LDA algorithm achieves high accuracy with 95.83% sensitivity and 90.63% specificity when the threshold value is 10-4. This is probably because the two-dimensional distribution of CMV data samples is linearly separable, and linear division models such as LDA are more effective, while the division effect of nonlinear separable algorithms such as random forest is relatively inaccurate. This new finding may be a potential diagnostic method for CMV and may even be applicable to other viruses, such as the infectious history detection of the new coronavirus.

Dataset Information

The Fermi-Dirac distribution provides a calibrated probabilistic output for binary classifiers.

Publications

The Fermi-Dirac distribution provides a calibrated probabilistic output for binary classifiers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets