Loading...
Thumbnail Image
Item

Measuring data drift with the unstable population indicator

Haas,M.R.
Sibbald,L.
Abstract
Measuring data drift is essential in machine learning applications where model scoring (evaluation) is done on data samples that differ from those used in training. The Kullback-Leibler divergence is a common measure of shifted probability distributions, for which discretized versions are invented to deal with binned or categorical data. We present the Unstable Population Indicator, a robust, flexible and numerically stable, discretized implementation of Jeffrey’s divergence, along with an implementation in a Python package that can deal with continuous, discrete, ordinal and nominal data in a variety of popular data types. We show the numerical and statistical properties in controlled experiments. It is not advised to employ a common cut-off to distinguish stable from unstable populations, but rather to let that cut-off depend on the use case.
Description
Date
2024
Journal Title
Journal ISSN
Volume Title
Publisher
Research Projects
Organizational Units
Journal Issue
Keywords
data drift, data shift, machine learning, kl-divergence
Citation
Haas, M R & Sibbald, L 2024, 'Measuring data drift with the unstable population indicator', Data Science, vol. 7, no. 1, 1, pp. 1-12. https://doi.org/10.3233/DS-240059
Embedded videos