CONNECTION SCIENCE
2024, VOL. 36, NO. 1, 2325496
https://doi.org/10.1080/09540091.2024.2325496
Explainable data mining model for hyperinsulinemia
diagnostics
Nevena Rankovic a , Dragica Rankovic b , Mirjana Ivanovic c and Igor Lukic d
a Department of Cognitive Science and Artificial Intelligence, Tilburg Univeristy, Tilburg, The Netherlands;
b Department of Mathematics, Informatics and Statistics, Union University “Nikola Tesla”, Nis, Serbia;
c Department of Mathematics and Informatics, University of Novi Sad, Novi Sad, Serbia; d Department of
Preventive Medicine, University of Kragujevac, Kragujevac, Serbia
ABSTRACT
In our research, we present a data mining model for the early diag-
nosis of hyperinsulinemia, potentially reducing the risk of diabetes,
heart disease, and other chronic conditions. The dataset, gathered
from 2019 to 2022 by Serbia’s Healthcare Center through an obser-
vational cross-sectional study, includes 1008 adolescents. Medical
datasets are often highly imbalanced and may contain irrelevant
features that hinder predictive performance. To address these chal-
lenges in the medical data analysis, we propose a model employ-
ing Functional Principal Component Analysis (FPCA), which also
accounts for outliers that could otherwise lead to the inclusion
of irrelevant features. Unlike standard Principal Component Analy-
sis (PCA), which is sensitive to the initial positions of cluster cen-
ters influencing the final outcome, our model integrates FPCA with
K-Means clustering to improve the preprocessing stage. Addition-
ally, we have incorporated the post-hoc explanatory method SHAP
(SHapley Additive exPlanations) alongside algorithms such as Ran-
dom Forest, XGBoost, and LightGBM to provide deeper insights into
our model, identifying the most contributory features for the devel-
opment of hyperinsulinemia. Experimental results showed that com-
bining FPCA with K-Means clustering enhances the accuracy of the
XGBoost classifier, with this model achieving an accuracy score of
0.99.
ARTICLE HISTORY
Received 2 September 2023
Accepted 26 February 2024
KEYWORDS
PCA; FPCA; K-Means; SHAP;
Hyperinsulinemia
1. Introduction
Hyperinsulinemia represents a state of pre-type 2 diabetes mellitus and is characterised by
significantly elevated insulin levels in the blood Figure 1. As this pathological entity can
persist for years without pronounced symptoms, proper identification and verification of
its presence, along with potential risk factors, during early adolescence can hold excep-
tional significance in preventing various conditions stemming directly from such a state.
Hyperinsulinemia is a disorder that can emerge at any life age, including adolescence. It
CONTACT Nevena Rankovic n.rankovic@uvt.nl Department of Cognitive Science and Artificial Intelligence,
Tilburg Univeristy, Tilburg 5037AB, The Netherlands
© 2024 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://
creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium,
provided the original work is properly cited. The terms on which this article has been published allow the posting of the Accepted Manuscript
in a repository by the author(s) or with their consent.
2024, VOL. 36, NO. 1, 2325496
https://doi.org/10.1080/09540091.2024.2325496
Explainable data mining model for hyperinsulinemia
diagnostics
Nevena Rankovic a , Dragica Rankovic b , Mirjana Ivanovic c and Igor Lukic d
a Department of Cognitive Science and Artificial Intelligence, Tilburg Univeristy, Tilburg, The Netherlands;
b Department of Mathematics, Informatics and Statistics, Union University “Nikola Tesla”, Nis, Serbia;
c Department of Mathematics and Informatics, University of Novi Sad, Novi Sad, Serbia; d Department of
Preventive Medicine, University of Kragujevac, Kragujevac, Serbia
ABSTRACT
In our research, we present a data mining model for the early diag-
nosis of hyperinsulinemia, potentially reducing the risk of diabetes,
heart disease, and other chronic conditions. The dataset, gathered
from 2019 to 2022 by Serbia’s Healthcare Center through an obser-
vational cross-sectional study, includes 1008 adolescents. Medical
datasets are often highly imbalanced and may contain irrelevant
features that hinder predictive performance. To address these chal-
lenges in the medical data analysis, we propose a model employ-
ing Functional Principal Component Analysis (FPCA), which also
accounts for outliers that could otherwise lead to the inclusion
of irrelevant features. Unlike standard Principal Component Analy-
sis (PCA), which is sensitive to the initial positions of cluster cen-
ters influencing the final outcome, our model integrates FPCA with
K-Means clustering to improve the preprocessing stage. Addition-
ally, we have incorporated the post-hoc explanatory method SHAP
(SHapley Additive exPlanations) alongside algorithms such as Ran-
dom Forest, XGBoost, and LightGBM to provide deeper insights into
our model, identifying the most contributory features for the devel-
opment of hyperinsulinemia. Experimental results showed that com-
bining FPCA with K-Means clustering enhances the accuracy of the
XGBoost classifier, with this model achieving an accuracy score of
0.99.
ARTICLE HISTORY
Received 2 September 2023
Accepted 26 February 2024
KEYWORDS
PCA; FPCA; K-Means; SHAP;
Hyperinsulinemia
1. Introduction
Hyperinsulinemia represents a state of pre-type 2 diabetes mellitus and is characterised by
significantly elevated insulin levels in the blood Figure 1. As this pathological entity can
persist for years without pronounced symptoms, proper identification and verification of
its presence, along with potential risk factors, during early adolescence can hold excep-
tional significance in preventing various conditions stemming directly from such a state.
Hyperinsulinemia is a disorder that can emerge at any life age, including adolescence. It
CONTACT Nevena Rankovic n.rankovic@uvt.nl Department of Cognitive Science and Artificial Intelligence,
Tilburg Univeristy, Tilburg 5037AB, The Netherlands
© 2024 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://
creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium,
provided the original work is properly cited. The terms on which this article has been published allow the posting of the Accepted Manuscript
in a repository by the author(s) or with their consent.
2 N. RANKOVIC ET AL.
Figure 1. Hyperinsulinemia process.
is imperative during these formative years to be vigilant about the potential onset of this
disorder and to take proactive steps to prevent it. The adolescent phase is characterised
by significant hormonal fluctuations, which can adversely affect pancreatic functionality
and insulin regulation in the bloodstream. Moreover, this is a time when unhealthy lifestyle
habits are prone to take root, including suboptimal dietary patterns and a lack of physical
activity, both of which can substantially heighten the risk of various metabolic syndromes
such as hyperinsulinemia. Adolescents at higher risk for hyperinsulinemia often include
those with a family history of type 2 diabetes, elevated body mass, insulin resistance, inad-
equate physical activity, and unhealthy dietary habits (IDF Diabetes Atlas, 2021; Thomas
et al., 2019).
According to the International Diabetes Federation (IDF), if insulin values are greater
than 15 µU/ml at 0 min and insulin values after the OGTT (Oral Glucose Tolerance) test are
greater than 75 µU/ml at 120 min, or if the cumulative insulin value exceeds 300 µU/ml,
hyperinsulinemia is diagnosed (Sun et al., 2022). As the prevalence of hyperinsulinemia
accelerates globally and also in our region during this life period, the results of this research
can hold considerable scientific and practical importance for pediatricians (Ryder et al.,
2020). They can aid in strategizing the application of preventive and timely corrective mea-
sures to avert the onset of the mentioned pathological entity and the development of
potential complications, primarily type 2 diabetes and cardiovascular diseases, extending
into adulthood (Andes et al., 2020; Horta & de Lima, 2019).
Enhancing the diagnosis of hyperinsulinemia necessitates the adoption of innovative
techniques. Data mining, with its capacity to uncover vital yet hidden patterns within
vast databases, holds the promise of transforming the landscape of medical diagnostics
(Savić et al., 2023; Vrbančič et al., 2022). Numerous data mining techniques and algorithms
have been tailored to distil knowledge from medical databases for disease diagnosis (Chen
Figure 1. Hyperinsulinemia process.
is imperative during these formative years to be vigilant about the potential onset of this
disorder and to take proactive steps to prevent it. The adolescent phase is characterised
by significant hormonal fluctuations, which can adversely affect pancreatic functionality
and insulin regulation in the bloodstream. Moreover, this is a time when unhealthy lifestyle
habits are prone to take root, including suboptimal dietary patterns and a lack of physical
activity, both of which can substantially heighten the risk of various metabolic syndromes
such as hyperinsulinemia. Adolescents at higher risk for hyperinsulinemia often include
those with a family history of type 2 diabetes, elevated body mass, insulin resistance, inad-
equate physical activity, and unhealthy dietary habits (IDF Diabetes Atlas, 2021; Thomas
et al., 2019).
According to the International Diabetes Federation (IDF), if insulin values are greater
than 15 µU/ml at 0 min and insulin values after the OGTT (Oral Glucose Tolerance) test are
greater than 75 µU/ml at 120 min, or if the cumulative insulin value exceeds 300 µU/ml,
hyperinsulinemia is diagnosed (Sun et al., 2022). As the prevalence of hyperinsulinemia
accelerates globally and also in our region during this life period, the results of this research
can hold considerable scientific and practical importance for pediatricians (Ryder et al.,
2020). They can aid in strategizing the application of preventive and timely corrective mea-
sures to avert the onset of the mentioned pathological entity and the development of
potential complications, primarily type 2 diabetes and cardiovascular diseases, extending
into adulthood (Andes et al., 2020; Horta & de Lima, 2019).
Enhancing the diagnosis of hyperinsulinemia necessitates the adoption of innovative
techniques. Data mining, with its capacity to uncover vital yet hidden patterns within
vast databases, holds the promise of transforming the landscape of medical diagnostics
(Savić et al., 2023; Vrbančič et al., 2022). Numerous data mining techniques and algorithms
have been tailored to distil knowledge from medical databases for disease diagnosis (Chen
CONNECTION SCIENCE 3
et al., 2023; Sun et al., 2022). PCA, a simple yet potent non-parametric method, provides
a pathway to extract pertinent insights from intricate datasets (Thenappan et al., 2022).
In scenarios requiring the categorisation of vast datasets into user-defined clusters, the K-
Means algorithm (Edeh et al., 2022) facilitates this by minimising the squared error function.
However, susceptibility to outliers and elevated time complexity hinder its efficacy. There-
fore, PCA plays a pivotal role in reducing dataset dimensions while conserving paramount
information, thereby refining cluster centroids for improved accuracy. Importantly, the
effectiveness of K-Means clustering is rooted in its capacity to group similar entities within
a dataset. Any clusters that significantly deviate from the norm, indicating outliers, are
identified as anomalies and then removed. To address these challenges, Functional-based
Principal Component Analysis (FPCA) (Gecili et al., 2021) emerges as advantageous. FPCA
identifies and eliminates irrelevant features, a key factor for unbiased outcomes and expe-
dited training. Remarkably, FPCA’s strength lies in minimal data loss despite dimension
reduction, resulting in enhanced classification accuracy and computational efficiency com-
pared to classical PCA (Pan et al., 2023). Within this landscape, the integration of machine
learning algorithms like Random Forest, XGBoost, and LightGBM stands as the logical
next step. Enriched by the insights from FPCA and K-Means clustering, these algorithms
adeptly navigate the complexities of hyperinsulinemia diagnostics. Finally, for a deeper
understanding and augmented transparency of the proposed model, the SHAP method
is employed. It serves to elucidate the outcomes of intricate ensemble models such as
Random Forest, XGBoost, and LightGBM, thus providing enriched insights. This multidi-
mensional approach not only refines accuracy but also transforms the medical diagnosis
landscape by harnessing the latent power of advanced technologies. To highlight the con-
tribution of this study, the following main research objective with two sub-questions is
posed:
RO: To what extent can unsupervised techniques such as PCA and FPCA improve K-Means
clustering and provide foundation for input values of Random Forest, XGBoost and LightGBM
contributing to higher classifiers accuracies?
RQ1: Among supervised models such as Random Forest, XGBoost, and LightGBM, which per-
forms better compared to the baseline model, in this case, Logistic Regression, in terms of
Recall, Precision, Accuracy, F1 score, and MCC?
RQ2: Using the SHAP method, what are the most informative features that the best-performing
model among the three supervised models - Random Forest, XGBoost, and LightGBM - consid-
ers when predicting hyperinsulinemia?
As the prevalence of hyperinsulinemia in adolescence rapidly increases, and the sig-
nificance and widespread application of a combination of unsupervised techniques and
supervised models such as Random Forest, XGBoost, and LightGBM become evident, this
study’s findings hold exceptional scientific and practical value for pediatricians in craft-
ing strategies for preventive and timely corrective measures. These strategies aim to
prevent the onset of this pathophysiological entity and the development of potential com-
plications, primarily type 2 diabetes mellitus and cardiovascular diseases, in later adult
life.
The results of this research can significantly contribute to a better understanding of the
risk factors that influence the occurrence of hyperinsulinemia with elevated glycemia in
adolescents, especially those with excessive body weight, poor dietary habits, insufficient
et al., 2023; Sun et al., 2022). PCA, a simple yet potent non-parametric method, provides
a pathway to extract pertinent insights from intricate datasets (Thenappan et al., 2022).
In scenarios requiring the categorisation of vast datasets into user-defined clusters, the K-
Means algorithm (Edeh et al., 2022) facilitates this by minimising the squared error function.
However, susceptibility to outliers and elevated time complexity hinder its efficacy. There-
fore, PCA plays a pivotal role in reducing dataset dimensions while conserving paramount
information, thereby refining cluster centroids for improved accuracy. Importantly, the
effectiveness of K-Means clustering is rooted in its capacity to group similar entities within
a dataset. Any clusters that significantly deviate from the norm, indicating outliers, are
identified as anomalies and then removed. To address these challenges, Functional-based
Principal Component Analysis (FPCA) (Gecili et al., 2021) emerges as advantageous. FPCA
identifies and eliminates irrelevant features, a key factor for unbiased outcomes and expe-
dited training. Remarkably, FPCA’s strength lies in minimal data loss despite dimension
reduction, resulting in enhanced classification accuracy and computational efficiency com-
pared to classical PCA (Pan et al., 2023). Within this landscape, the integration of machine
learning algorithms like Random Forest, XGBoost, and LightGBM stands as the logical
next step. Enriched by the insights from FPCA and K-Means clustering, these algorithms
adeptly navigate the complexities of hyperinsulinemia diagnostics. Finally, for a deeper
understanding and augmented transparency of the proposed model, the SHAP method
is employed. It serves to elucidate the outcomes of intricate ensemble models such as
Random Forest, XGBoost, and LightGBM, thus providing enriched insights. This multidi-
mensional approach not only refines accuracy but also transforms the medical diagnosis
landscape by harnessing the latent power of advanced technologies. To highlight the con-
tribution of this study, the following main research objective with two sub-questions is
posed:
RO: To what extent can unsupervised techniques such as PCA and FPCA improve K-Means
clustering and provide foundation for input values of Random Forest, XGBoost and LightGBM
contributing to higher classifiers accuracies?
RQ1: Among supervised models such as Random Forest, XGBoost, and LightGBM, which per-
forms better compared to the baseline model, in this case, Logistic Regression, in terms of
Recall, Precision, Accuracy, F1 score, and MCC?
RQ2: Using the SHAP method, what are the most informative features that the best-performing
model among the three supervised models - Random Forest, XGBoost, and LightGBM - consid-
ers when predicting hyperinsulinemia?
As the prevalence of hyperinsulinemia in adolescence rapidly increases, and the sig-
nificance and widespread application of a combination of unsupervised techniques and
supervised models such as Random Forest, XGBoost, and LightGBM become evident, this
study’s findings hold exceptional scientific and practical value for pediatricians in craft-
ing strategies for preventive and timely corrective measures. These strategies aim to
prevent the onset of this pathophysiological entity and the development of potential com-
plications, primarily type 2 diabetes mellitus and cardiovascular diseases, in later adult
life.
The results of this research can significantly contribute to a better understanding of the
risk factors that influence the occurrence of hyperinsulinemia with elevated glycemia in
adolescents, especially those with excessive body weight, poor dietary habits, insufficient