SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 11, No. 3, June 2021, pp. 2407~2413
ISSN: 2088-8708, DOI: 10.11591/ijece.v11i3.pp2407-2413  2407
Journal homepage: http://ijece.iaescore.com
Prediction model of algal blooms using logistic regression and
confusion matrix
Hongwon Yun
Department of Computer Software Engineering, Silla University, Busan, Republic of Korea
Article Info ABSTRACT
Article history:
Received Jul 31, 2020
Revised Sep 22, 2020
Accepted Oct 8, 2020
Algal blooms data are collected and refined as experimental data for algal
blooms prediction. Refined algal blooms dataset is analyzed by logistic
regression analysis, and statistical tests and regularization are performed to
find the marine environmental factors affecting algal blooms. The predicted
value of algal bloom is obtained through logistic regression analysis using
marine environment factors affecting algal blooms. The actual values and the
predicted values of algal blooms dataset are applied to the confusion matrix.
By improving the decision boundary of the existing logistic regression, and
accuracy, sensitivity and precision for algal blooms prediction are improved.
In this paper, the algal blooms prediction model is established by the
ensemble method using logistic regression and confusion matrix. Algal
blooms prediction is improved, and this is verified through big data analysis.
Keywords:
Algal blooms
Confusion matrix
Ensemble method
Logistic regression
Prediction model This is an open access article under the CC BY-SA license.
Corresponding Author:
Hongwon Yun
Department of Computer Software Engineering
Silla University
Busan 46958, Republic of Korea
Email: hwyun@silla.ac.kr
1. INTRODUCTION
Logistic regression is a special case of a typical model and is similar to linear regression, however it
has a difference in the relationship between dependent and independent variables. The dependent variable of
logistic regression can be binary or continuous, and it is used as a model for classification or prediction when
the dependent variable is binary [1, 2]. If the dependent variable of logistic regression is binary, the range of
its value is limited to the bivariate and the distribution of conditional probability follows the Bernoulli
distribution. Logistic regression allows dependent variable values to be between 0 and 1 regardless of the
range of independent variable values, so it is possible to classify the result of data into a specific
classification when input data is given and predict the likelihood of an event occurring [3-5].
In logistic regression, where the dependent variable is binary, the predicted value can be calculated
using a linear combination of the independent variables. However, since the value of the dependent variable
is classified as pass or fail around the decision boundary, the value close to the decision boundary may be
less accurate [6-8]. In binary logistic regression, since the actual value of the dependent variable is present
and the predicted value can be calculated, the predicted value can be applied to a confusion matrix that can be
compared to the target value [9, 10]. It can be obtained sensitivity and precision from the confusion matrix
using the actual and predicted values of the logistic regression, and apply it to algal blooms to create a
summary of indicators such as sensitivity and precision including accuracy [11-13].
Sensitivity and precision are as important as accuracy in predicting algal bloom occurrence. This is
because high sensitivity and precision can provide indicators that can prevent massive property damage
[14-17]. The elements of the marine environment that cause algal blooms are generally known, but no study
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413
2408
can be found to analyze the influence of each element on algal blooms and predict algal blooms. In this
study, the predicted value of logistic regression is calculated by machine learning. The actual value used in
logistic regression analysis and the prediction calculated through machine running are applied to the
confusion matrix to create a prediction model for algal blooms.
This paper is organized as follows. The logistic regression and confusion matrix as the background
theory of this study are describe in section 2. In section 3, we describe the algal blooms prediction model
using the ensemble method of the logistic regression and confusion matrix proposed in this study. Here we
describe the process of extracting marine environmental elements using logistic regression, obtaining red tide
prediction values, applying improved decision boundaries to logistic regression, and how to improve
accuracy, sensitivity and precision through confusion matrix. In section 4, we verify the proposed algal
blooms prediction model using the algal blooms dataset, and conclusions are described in section 5.
2. LOGISTIC REGRESSION AND CONFUTION MATRIX
2.1. Logistic regression
Linear regression is a model that estimates a regression coefficient that can linearly express the
relationship between independent variables X and dependent variables Y with continuous values. If the
dependent variable Y is a binary variable, logistic regression is used because linear regression cannot be
applied directly. Some regression algorithms can be used for classification, and logistic regression is widely
used to estimate the probability that a sample belongs to a particular class. If the estimated probability
exceeds 0.5, the sample is predicted to belong to the class, and if it is less than 0.5, it is used as a binary
classifier to predict that the sample does not belong to the class [18, 19]. To estimate the probability, logistic
regression calculates the weighted sum of the input characteristics, but instead of outputting the result
immediately such as linear regression, it outputs the logistic of the result value. Logistic is a sigmoid function
that outputs a value between 0 and 1 [20]. The logistic function has the effect of limiting the output result to
always between 0 and 1 for numerical values x, and its expression is defined as follows.
𝑦 =
1
1+𝑒−𝑓(𝑥) (1)
In (1), 𝑓(𝑥) can be either a simple linear function or a multiple linear function. For classification
problems with two categories, if 𝑓(𝑥) > 0 is classified as 𝑦 → 1 and 𝑓(𝑥) < 0 is classified as 𝑦 → 0. The
decision boundary of the logistic regression model is the 𝑓(𝑥) = 0 in hyperplane and becomes 𝑦 = 0.5.
Errors in prediction usually occur around the decision boundary [21, 22].
2.2. Confusion matrix
The confusion matrix is a tool that easily and effectively shows the performance of the classifier and
has the advantage of being easy to interpret the results. A confusion matrix can be used to evaluate the
performance of any models or algorithms. As shown in Table 1, the rows in the confusion matrix represent
the values of the predictive class and the columns represent the values of the actual class. Each cell is one of
the possible combinations of prediction and actuality. In the 2×2 confusion matrix, there are true positive
(TP), false positive (FP), false negative (FN), and true false (TF) [23].
The perfect model will only have values on the diagonal, the rest of the cells will be all zeros, and
the bad model will be evenly distributed in all cells. The error matrix tells us how bad a model is when it is
bad. The value of each cell can identify a misclassified pattern [24].
Table 1. Confusion matrix
Confusion matrix
True class (Actual)
P N
Hypothesized class
(Predicted)
Y True Positives False Positives
N False Negatives True Negatives
Methods for summarizing the results of the confusion matrix include accuracy, precision, and recall.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
(2)
The accuracy is obtained by dividing the accurately predicted number (TP+TN) by the total number
of samples, and is represented by (2). Among the methods for summarizing the results in the confusion
Int J Elec & Comp Eng ISSN: 2088-8708 
Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun)
2409
matrix, the most frequently used precision and sensitivity are as shown in (3) and (4), respectively.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
(3)
Precision is a positive predictive value that measures how many of the samples (TP+FP) predicted
to be positive are true positives (TP). Precision is used as a performance indicator when the goal is to reduce
the number of false positives (FP).
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
(4)
Sensitivity measures how many of the total positive samples (TP+FN) are classified as positive classes (TP).
3. PREDICTION MODEL
After collecting algal blooms dataset from the National Institute of Fisheries Science, it was cleaned
and refined. The first multiple logistic regression analysis was performed on the refined algal blooms dataset,
and some attributes were removed through a statistical test. A second multiple logistic regression analysis
was performed with the exception of the attributes removed and then the regularization was applied. After
applying the regularization, a third multiple logistic regression analysis is performed and the results are
applied to the confusion matrix. Figure 1 shows this process.
Figure 1. Prediction process of algal blooms
The probability of occurrence of harmful algal blooms with two or more independent variables is
defined as p(x) and the odds as =
𝑝
1−𝑝
. When the range of input values is [0, 1], log it transformation is
performed to adjust the range of output values to (−∞, ∞), resulting in log(𝑜𝑑𝑑𝑠) = log (
𝑝(𝑥)
1−𝑝(𝑥)
) = 𝛽0 +
𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 [25]. Therefore, for multiple independent variables that affect harmful algal blooms,
the multiple logistic function that allows the dependent variable range to be between [0, 1] is as shown in (5).
In (5) calculates the effect of each element of the ocean observation data, which is an independent variable,
on the occurrence of a harmful algal blooms as a dependent variable. This is a basic model for estimating the
probability of occurrence of harmful algal blooms.
𝑝(𝑥) =
1
1+𝑒−(𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖) (5)
The maximum likelihood estimation is used to estimate parameter 𝛽 in regression expression
𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 by logit transformation. The log likelihood function can be obtained from the
likelihood function [26] expressed as the product of Bernoulli's probability function, and is expressed as (6).
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413
2410
The parameter that maximizes the log likelihood function in (6) is determined from multiple independent
variables that affect the harmful algal blooms.
ln𝐿 = ∑ 𝑦𝑖(𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖)
𝑖 + ∑ ln
𝑖 (1 + 𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖 ) (6)
The L1 regularization [27] used to eliminate low-impact independent variables among multiple independent
variables that affect harmful algal blooms is shown in (7).
argmin ∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2
+ 𝜆 ∑ |𝛽𝑗|
𝑝
𝑗=1
𝑝
𝑗=1
𝑛
𝑖=1 (7)
The properties in the marine environment observation dataset are shown in Table 2 and used as
independent variables in logistic regression.
Table 2. Multiple independent variables for logistic regression
Variables Comments
T Temperature
S Salinity
DO Dissolved Oxygen
P Phosphate Phosphorus
NA Nitrous Acid Nitrogen
N Nitric Acid Nitrogen
SA Silicic Acid Silicon
In (8) is obtained by applying seven independent variables, such as water temperature, salinity,
dissolved oxygen, phosphate phosphorus, nitrous acid nitrogen, nitric acid nitrogen, silicic acid silicon, to the
basic model of multiple logistic regression (5).
log
𝑝(𝑥)
1−𝑝(𝑥)
= 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝐷𝑂𝑖 + 𝛽4𝑃𝑖 + 𝛽5𝑁𝐴𝑖 + 𝛽6𝑁𝑖 + 𝛽7𝑆𝐴𝑖 (8)
P-value is used to determine if any independent variable was statistically significant in the results of
multiple logistic regression analysis on the training dataset, and independent variables with a P-value of 0.05
or higher are excluded. The parameters for statistically significant independent variables are as shown in (9).
log
𝑝(𝑥)
1−𝑝(𝑥)
= 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 + 𝛽4𝑁𝑖 (9)
The regulation for removing an independent variable close to zero in order to make some coefficients zero is
as shown in (7). The result is as shown in (10) when (7) is applied to the result of (9).
log
𝑝(𝑥)
1−𝑝(𝑥)
= 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 (10)
In (11) is the logistic regression model for algal blooms prediction obtained by applying the above process to
the algal blooms dataset.
𝑝(𝑥) =
1
1+𝑒−(−5.89+0.34𝑇𝑖−0.12𝑆𝑖+0.35𝑃𝑖) (11)
The normalization process from the (8) to the (11) is from Step 2 to Step 5 among the algorithms in
Table 3, respectively. The algal blooms prediction model was normalized while performing experiments
based on the algorithm in Table 3. The detailed experimental process is described in section 4. The equation
for obtaining a decision boundary to increase the sensitivity and precision is defined as shown in (12).
𝛥 = |0.5 ±
1
2
(
𝑇𝑃
𝑇𝑃+𝐹𝑁
+
𝑇𝑃
𝑇𝑃+𝐹𝑃
)| (12)
Table 3 shows algorithm for establishing algal blooms prediction model. This algorithm shows the
process of performing multiple logistic regression first in a refined dataset, then statistical tests on the results,
Int J Elec & Comp Eng ISSN: 2088-8708 
Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun)
2411
and then removing low-weight independent variables, finally setting up a logistic regression model, and
finding the decision boundary finally.
Table 3. Algorithm for establishing algal blooms prediction model
Step Statements
1 Extraction, Transformation and Loading from collected dataset
Prepare training dataset
2 Perform multiple regression analysis using (1) on the training dataset
Output regression coefficients and statistical tests
3 Perform a statistical significance test
∙attributes P-value > 0.5 are excluded in the training dataset
4 Perform multiple regression analysis for the training dataset with attributes whose P-value <= 0.5
Output regression coefficients and statistical tests for attributes whose P-value <= 0.5
5 Regularize regression coefficients from step 4 using
argmin∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2
+ 𝜆 ∑ |𝛽𝑗|
𝑝
𝑗=1
𝑝
𝑗=1
𝑛
𝑖=1
Perform multiple regression analysis using a regularized regression formula
Output test dataset
6 Input test dataset from step 5
Predict probability using decision boundary
𝛥 = |0.5 ±
1
2
(
𝑇𝑃
𝑇𝑃+𝐹𝑁
+
𝑇𝑃
𝑇𝑃+𝐹𝑃
)|based on confusion matrix
4. EXPERIMENT
Multiple logistic regression analysis (8) can be performed on the training dataset to obtain the
results shown in Table 4. In Table 4, p-value is used to determine whether any independent variable is
statistically significant, and independent variables with a p-value of 0.05 or higher are excluded. Parameters
𝛽 are determined for statistically significant independent variables in Table 4 and L1 regularization is applied
and then the results shown in Table 5 can be obtained. In (11) of the logistic regression model for algal
blooms prediction is obtained from the coefficients in Table 5.
Table 4. 1st
multiple logistic regression analysis on training dataset
Input variables Coefficient Std. error P-value
Constant -5.35 1.49 0.00
Temperature 0.33 0.02 0.00
Salinity -0.12 0.03 0.00
Dissolved Oxygen -0.05 0.08 0.54
Phosphate Phosphorus 0.38 0.14 0.01
Nitrous Acid Nitrogen -0.07 0.12 0.58
Nitric Acid Nitrogen -0.06 0.02 0.00
Silicic Acid Silicon -0.02 0.01 0.16
Table 5. Coefficients of logistic regression model for algal blooms prediction
Input variables Coefficient Std. error P-value
Constant -5.89 1.21 0.00
Temperature 0.34 0.01 0.00
Salinity -0.12 0.03 0.00
Phosphate Phosphorus 0.35 0.14 0.01
Predicting the occurrence of algal blooms from (11) gives 91.84% accuracy. Accuracy alone may
not be sufficient to assess the predicted performance of algal blooms. We utilize the confusion matrix since
we do not know false negatives or false positives of algal blooms. A confusion matrix for algal blooms
shown in Table 6 is obtained from algal blooms dataset.
Table 6. Confusion matrix for algal blooms
Confusion matrix
(Error matrix)
Actual values of algal blooms
P(Occurrence) N(Not occurrence)
Predicted values of
algal blooms
Y (0.5 or higher) True Positive
tp=222
False Positive
fp=205
N (less than 0.5 ) False Negative
fn=599
True Negative
tn=8828
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413
2412
Table 7 shows the sensitivity, specificity, and precision are obtained based on the decision boundary
0.5 using the values of the confusion matrix in Table 6. The prediction rate of false negative of algal blooms
is low as the sensitivity is 27.04%, and the prediction rate of false positive is also low because the precision is
51.99%. Since the sensitivity and precision are low in case of the decision boundary is 0.5, we apply
proposed the decision boundary (12) in order to solve these problems, and results are as shown in Table 8.
When the decision boundary proposed in this paper is applied, the decision boundary becomes
𝛥 = |0.5 ± 0.25|. When this is used as a decision boundary, TP=494, TN=9026, FN=327, FP=7, the
sensitivity is 60.17%, and the precision is 98.6% as shown in Table 8.
Table 7. Resulting confusion matrix based on decision boundary 0.5 (unit: %)
TPR TNR PPV FPR ACC F1
Sensitivity Specificity Precision Fallout Accuracy F1 Score
27.04 97.73 51.99 2.27 91.84 35.58
Table 8. Resulting confusion matrix based on proposed decision boundary 𝛥 (unit: %)
TPR TNR PPV FPR ACC F1
Sensitivity Specificity Precision Fallout Accuracy F1 Score
60.17 99.92 98.6 0.08 96.61 74.74
5. CONCLUSION
In this paper, logistic regression and confusion matrix were used to predict the occurrence of algal
blooms. Algal blooms datasets were collected and refined for experimental analysis of algal blooms
prediction. Logistic regression analysis was performed on refined algal blooms dataset and main marine
environmental factors affecting algal blooms were found through statistical test and regularization processes.
Logistic regression was performed using the marine environmental factors that were influential on algal
blooms and the accuracy of algal bloom occurrence was obtained. The values of the confustion matrix were
obtained using the dataset for algal blooms prediction and the predicted values obtained from logistic
regression. Although the sensitivity and precision for the occurrence of algal blooms can be obtained from
the values of the confusion matrix, the sensitivity and precision were low when the existing decision
boundary was 0.5. Sensitivity and precision were improved by using the decision boundary proposed in this
study. In this paper, the algal blooms prediction model was established by the ensemble method using logistic
regression analysis and confusion matrix. Also, the accuracy, sensitivity, and precision for algal blooms
prediction were improved, and these were verified through big data analysis.
REFERENCES
[1] Hosmer Jr., et al., “Applied logistic regression,” John Wiley & Sons, vol. 398, 2013.
[2] M. Chang, et al., “Selection of Transformations of Continuous Predictors in Logistic Regression,” Information
Technology-New Generations, pp. 443-447, 2018.
[3] J. W. Osborne, “Simple Linear Models with Categorical Dependent Variables: Binary Logistic Regression,” SAGE,
pp. 97-132, 2017.
[4] J. Tolles and J. M. William, “Logistic regression: relating patient characteristics to outcomes,” Jama, vol. 316,
no. 5, pp. 533-534, 2016.
[5] A. D. Caigny, et al., “A new hybrid classification algorithm for customer churn prediction based on logistic
regression and decision trees,” European Journal of Operational Research, vol. 269, no. 2, pp. 760-772, 2018.
[6] X. Wan, “The Influence of Polynomial Order in Logistic Regression on Decision Boundary,” IOP Conference
Series: Earth and Environmental Science, vol. 267, no. 4, pp. 1-4, 2019.
[7] S. Ghazaal and A. Hakan, “Using the Distance in Logistic Regression Models for Predictor Ranking in Diabetes
Detection,” International Conference on Medical and Biological Engineering, 2019, pp. 665-670.
[8] J. Friedman, et al., “Additive logistic regression: A statistical view of boosting,” Annals of statistics, vol. 28, no. 2,
pp. 337-374, 2000.
[9] H. M. Ramos, et al., “A new explanatory index for evaluating the binary logistic regression based on the sensitivity
of the estimated model,” Statistics and Probability Letters, vol. 120, pp. 135-140, 2017.
[10] M. Ohsaki, et al., “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE
Transactions on Knowledge and Data Engineering, vol. 29, no. 9, pp. 1806-1819, 2017.
[11] J. A. McGowan, et al., “Predicting coastal algal blooms in southern California,” Ecology, vol. 98, no. 5,
pp. 1419-1433, 2017.
[12] N. F. Manning, et al., “Extending the forecast model: Predicting Western Lake Erie harmful algal blooms at
multiple spatial scales,” Journal of Great Lakes Research, vol. 45, no. 3 pp. 587-595, 2019.
Int J Elec & Comp Eng ISSN: 2088-8708 
Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun)
2413
[13] N. Mellios, et al., “Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern
European Lakes,” Water, vol. 12, no. 4, p. 1191, 2020.
[14] L. Wang, et al., “An approach of improved Multivariate Timing-Random Deep Belief Net modelling for algal
bloom prediction,” Biosystems engineering, vol. 177, pp. 130-138, 2019.
[15] Ghatkar, et al., “Classification of algal bloom species from remote sensing data using an extreme gradient boosted
decision tree model,” International Journal of Remote Sensing, vol. 40, no. 24, pp. 9412-9438, 2019.
[16] S. Lee and D. Lee, “Improved prediction of harmful algal blooms in four Major South Korea’s Rivers using deep
learning models,” International journal of environmental research and public health, vol. 15, no. 7, p. 1322, 2018.
[17] X. Sun, et al., “A Bayesian structural model for predicting algal blooms,” Journal of Forecasting, vol. 38, no. 8,
pp. 788-802, 2019.
[18] F. Thabtah et al., “A machine learning autism classification based on logistic regression analysis,” Health
information science and systems, vol. 7, no. 1, p. 12, 2019.
[19] D. Menezes, et al., “Data classification with binary response through the Boosting algorithm and logistic
regression,” Expert Systems with Applications, vol. 69, pp. 62-73, 2017.
[20] J. M. Hilbe, “Practical guide to logistic regression,” CRC Press, 2016.
[21] C. Fernández and F. Provost, “Causal Classification: Treatment Effect vs. Outcome Prediction,” Outcome
Prediction, 2019.
[22] K. Lee, et al., “Unbalanced data, type II error, and nonlinearity in predicting M&A failure,” Journal of Business
Research, vol. 109, pp. 271-287, 2010.
[23] S. V. Stehman, “Selecting and interpreting measures of thematic classification accuracy,” Remote Sensing of
Environment, vol. 62, no. 1, pp. 77-89, 1997.
[24] D. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness &
Correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011.
[25] J. S. Cramer, “The origins and development of the logit model,” Cambridge UP, pp. 1-19, 2003.
[26] I. M. Myung, “Tutorial on Maximum Likelihood Estimation,” Journal of Mathematical Psychology, vol. 47, no. 1,
pp. 90-100, 2003.
[27] F. Santosa and W. Symes, “Linear inversion of band-limited reflection seismograms,” SIAM Journal on Scientific
and Statistical Computing, SIAM, vol. 7, no. 4, pp. 1307-1330, 1986.
BIOGRAPHY OF AUTHOR
Hongwon Yun is a Professor with the Department of Computer Software Engineering at Silla
University, Busan, South Korea. He received his B.S. and the Ph.D. degrees at the Department of
Computer Science from Pusan National University, South Korea. His research interests include
database system, big data analysis, and machine learning.

More Related Content

What's hot (16)

PDF
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET Journal
 
PDF
BPSO&1-NN algorithm-based variable selection for power system stability ident...
IJAEMSJORNAL
 
PDF
Sensitivity analysis in a lidar camera calibration
csandit
 
PPTX
Chi square
Jacob John Panicker
 
PDF
A New SDM Classifier Using Jaccard Mining Procedure (CASE STUDY: RHEUMATIC FE...
Soaad Abd El-Badie
 
PDF
A Genetic Algorithm on Optimization Test Functions
IJMERJOURNAL
 
PDF
A new sdm classifier using jaccard mining procedure case study rheumatic feve...
ijbbjournal
 
PDF
G03405049058
theijes
 
PDF
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
PDF
Data Mining using SAS
Tanu Puri
 
PDF
FurtherInvestegationOnProbabilisticErrorBounds_final
Mohammad Abdo
 
PDF
2014 IIAG Imputation Assessments
KROTOASA FOUNDATION
 
DOC
Missing Value imputation, Poor man's
Leonardo Auslender
 
PDF
Effect of Feature Selection on Gene Expression Datasets Classification Accura...
IJECEIAES
 
PDF
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
Ian Camacho
 
PDF
A Comparative Analysis on the Evaluation of Classification Algorithms in the ...
IJECEIAES
 
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET Journal
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
IJAEMSJORNAL
 
Sensitivity analysis in a lidar camera calibration
csandit
 
A New SDM Classifier Using Jaccard Mining Procedure (CASE STUDY: RHEUMATIC FE...
Soaad Abd El-Badie
 
A Genetic Algorithm on Optimization Test Functions
IJMERJOURNAL
 
A new sdm classifier using jaccard mining procedure case study rheumatic feve...
ijbbjournal
 
G03405049058
theijes
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
Data Mining using SAS
Tanu Puri
 
FurtherInvestegationOnProbabilisticErrorBounds_final
Mohammad Abdo
 
2014 IIAG Imputation Assessments
KROTOASA FOUNDATION
 
Missing Value imputation, Poor man's
Leonardo Auslender
 
Effect of Feature Selection on Gene Expression Datasets Classification Accura...
IJECEIAES
 
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
Ian Camacho
 
A Comparative Analysis on the Evaluation of Classification Algorithms in the ...
IJECEIAES
 

Similar to Prediction model of algal blooms using logistic regression and confusion matrix (20)

PDF
the unconditional Logistic Regression .pdf
mikaelgirum
 
PDF
Smooth Support Vector Machine for Suicide-Related Behaviours Prediction
IJECEIAES
 
PDF
Short-term load forecasting with using multiple linear regression
IJECEIAES
 
PDF
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
PPTX
correction maximum likelihood estimation method
qazikhanzla
 
PDF
L033054058
ijceronline
 
PDF
Multinomial Logistic Regression.pdf
AlemAyahu
 
PDF
Correation, Linear Regression and Multilinear Regression using R software
shrikrishna kesharwani
 
PDF
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
PDF
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
PDF
A Comparison of Accuracy Measures for Remote Sensing Image Classification: Ca...
CSCJournals
 
PDF
Evaluation measures for models assessment over imbalanced data sets
Alexander Decker
 
PDF
KNN and ARL Based Imputation to Estimate Missing Values
ijeei-iaes
 
PDF
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
IOSR Journals
 
PDF
Supervised Learning.pdf
gadissaassefa
 
PDF
USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING
IJDKP
 
PDF
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
gerogepatton
 
PDF
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
ijaia
 
PDF
International Journal of Pharmaceutica Analytica Acta
SciRes Literature LLC. | Open Access Journals
 
PDF
B04460815
IOSR-JEN
 
the unconditional Logistic Regression .pdf
mikaelgirum
 
Smooth Support Vector Machine for Suicide-Related Behaviours Prediction
IJECEIAES
 
Short-term load forecasting with using multiple linear regression
IJECEIAES
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
correction maximum likelihood estimation method
qazikhanzla
 
L033054058
ijceronline
 
Multinomial Logistic Regression.pdf
AlemAyahu
 
Correation, Linear Regression and Multilinear Regression using R software
shrikrishna kesharwani
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
A Comparison of Accuracy Measures for Remote Sensing Image Classification: Ca...
CSCJournals
 
Evaluation measures for models assessment over imbalanced data sets
Alexander Decker
 
KNN and ARL Based Imputation to Estimate Missing Values
ijeei-iaes
 
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
IOSR Journals
 
Supervised Learning.pdf
gadissaassefa
 
USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING
IJDKP
 
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
gerogepatton
 
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
ijaia
 
International Journal of Pharmaceutica Analytica Acta
SciRes Literature LLC. | Open Access Journals
 
B04460815
IOSR-JEN
 
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
PDF
Neural network optimizer of proportional-integral-differential controller par...
IJECEIAES
 
PDF
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
PDF
A review on features and methods of potential fishing zone
IJECEIAES
 
PDF
Electrical signal interference minimization using appropriate core material f...
IJECEIAES
 
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
IJECEIAES
 
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
IJECEIAES
 
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
IJECEIAES
 
PDF
Smart grid deployment: from a bibliometric analysis to a survey
IJECEIAES
 
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
IJECEIAES
 
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
IJECEIAES
 
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
IJECEIAES
 
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
IJECEIAES
 
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
PDF
Detecting and resolving feature envy through automated machine learning and m...
IJECEIAES
 
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
IJECEIAES
 
PDF
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Neural network optimizer of proportional-integral-differential controller par...
IJECEIAES
 
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
A review on features and methods of potential fishing zone
IJECEIAES
 
Electrical signal interference minimization using appropriate core material f...
IJECEIAES
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Bibliometric analysis highlighting the role of women in addressing climate ch...
IJECEIAES
 
Voltage and frequency control of microgrid in presence of micro-turbine inter...
IJECEIAES
 
Enhancing battery system identification: nonlinear autoregressive modeling fo...
IJECEIAES
 
Smart grid deployment: from a bibliometric analysis to a survey
IJECEIAES
 
Use of analytical hierarchy process for selecting and prioritizing islanding ...
IJECEIAES
 
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
IJECEIAES
 
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
IJECEIAES
 
Adaptive synchronous sliding control for a robot manipulator based on neural ...
IJECEIAES
 
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
Detecting and resolving feature envy through automated machine learning and m...
IJECEIAES
 
Smart monitoring technique for solar cell systems using internet of things ba...
IJECEIAES
 
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 
Ad

Recently uploaded (20)

PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
Water Design_Manual_2005. KENYA FOR WASTER SUPPLY AND SEWERAGE
DancanNgutuku
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Day2 B2 Best.pptx
helenjenefa1
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
Thermal runway and thermal stability.pptx
godow93766
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Water Design_Manual_2005. KENYA FOR WASTER SUPPLY AND SEWERAGE
DancanNgutuku
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 

Prediction model of algal blooms using logistic regression and confusion matrix

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 11, No. 3, June 2021, pp. 2407~2413 ISSN: 2088-8708, DOI: 10.11591/ijece.v11i3.pp2407-2413  2407 Journal homepage: http://ijece.iaescore.com Prediction model of algal blooms using logistic regression and confusion matrix Hongwon Yun Department of Computer Software Engineering, Silla University, Busan, Republic of Korea Article Info ABSTRACT Article history: Received Jul 31, 2020 Revised Sep 22, 2020 Accepted Oct 8, 2020 Algal blooms data are collected and refined as experimental data for algal blooms prediction. Refined algal blooms dataset is analyzed by logistic regression analysis, and statistical tests and regularization are performed to find the marine environmental factors affecting algal blooms. The predicted value of algal bloom is obtained through logistic regression analysis using marine environment factors affecting algal blooms. The actual values and the predicted values of algal blooms dataset are applied to the confusion matrix. By improving the decision boundary of the existing logistic regression, and accuracy, sensitivity and precision for algal blooms prediction are improved. In this paper, the algal blooms prediction model is established by the ensemble method using logistic regression and confusion matrix. Algal blooms prediction is improved, and this is verified through big data analysis. Keywords: Algal blooms Confusion matrix Ensemble method Logistic regression Prediction model This is an open access article under the CC BY-SA license. Corresponding Author: Hongwon Yun Department of Computer Software Engineering Silla University Busan 46958, Republic of Korea Email: [email protected] 1. INTRODUCTION Logistic regression is a special case of a typical model and is similar to linear regression, however it has a difference in the relationship between dependent and independent variables. The dependent variable of logistic regression can be binary or continuous, and it is used as a model for classification or prediction when the dependent variable is binary [1, 2]. If the dependent variable of logistic regression is binary, the range of its value is limited to the bivariate and the distribution of conditional probability follows the Bernoulli distribution. Logistic regression allows dependent variable values to be between 0 and 1 regardless of the range of independent variable values, so it is possible to classify the result of data into a specific classification when input data is given and predict the likelihood of an event occurring [3-5]. In logistic regression, where the dependent variable is binary, the predicted value can be calculated using a linear combination of the independent variables. However, since the value of the dependent variable is classified as pass or fail around the decision boundary, the value close to the decision boundary may be less accurate [6-8]. In binary logistic regression, since the actual value of the dependent variable is present and the predicted value can be calculated, the predicted value can be applied to a confusion matrix that can be compared to the target value [9, 10]. It can be obtained sensitivity and precision from the confusion matrix using the actual and predicted values of the logistic regression, and apply it to algal blooms to create a summary of indicators such as sensitivity and precision including accuracy [11-13]. Sensitivity and precision are as important as accuracy in predicting algal bloom occurrence. This is because high sensitivity and precision can provide indicators that can prevent massive property damage [14-17]. The elements of the marine environment that cause algal blooms are generally known, but no study
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413 2408 can be found to analyze the influence of each element on algal blooms and predict algal blooms. In this study, the predicted value of logistic regression is calculated by machine learning. The actual value used in logistic regression analysis and the prediction calculated through machine running are applied to the confusion matrix to create a prediction model for algal blooms. This paper is organized as follows. The logistic regression and confusion matrix as the background theory of this study are describe in section 2. In section 3, we describe the algal blooms prediction model using the ensemble method of the logistic regression and confusion matrix proposed in this study. Here we describe the process of extracting marine environmental elements using logistic regression, obtaining red tide prediction values, applying improved decision boundaries to logistic regression, and how to improve accuracy, sensitivity and precision through confusion matrix. In section 4, we verify the proposed algal blooms prediction model using the algal blooms dataset, and conclusions are described in section 5. 2. LOGISTIC REGRESSION AND CONFUTION MATRIX 2.1. Logistic regression Linear regression is a model that estimates a regression coefficient that can linearly express the relationship between independent variables X and dependent variables Y with continuous values. If the dependent variable Y is a binary variable, logistic regression is used because linear regression cannot be applied directly. Some regression algorithms can be used for classification, and logistic regression is widely used to estimate the probability that a sample belongs to a particular class. If the estimated probability exceeds 0.5, the sample is predicted to belong to the class, and if it is less than 0.5, it is used as a binary classifier to predict that the sample does not belong to the class [18, 19]. To estimate the probability, logistic regression calculates the weighted sum of the input characteristics, but instead of outputting the result immediately such as linear regression, it outputs the logistic of the result value. Logistic is a sigmoid function that outputs a value between 0 and 1 [20]. The logistic function has the effect of limiting the output result to always between 0 and 1 for numerical values x, and its expression is defined as follows. 𝑦 = 1 1+𝑒−𝑓(𝑥) (1) In (1), 𝑓(𝑥) can be either a simple linear function or a multiple linear function. For classification problems with two categories, if 𝑓(𝑥) > 0 is classified as 𝑦 → 1 and 𝑓(𝑥) < 0 is classified as 𝑦 → 0. The decision boundary of the logistic regression model is the 𝑓(𝑥) = 0 in hyperplane and becomes 𝑦 = 0.5. Errors in prediction usually occur around the decision boundary [21, 22]. 2.2. Confusion matrix The confusion matrix is a tool that easily and effectively shows the performance of the classifier and has the advantage of being easy to interpret the results. A confusion matrix can be used to evaluate the performance of any models or algorithms. As shown in Table 1, the rows in the confusion matrix represent the values of the predictive class and the columns represent the values of the actual class. Each cell is one of the possible combinations of prediction and actuality. In the 2×2 confusion matrix, there are true positive (TP), false positive (FP), false negative (FN), and true false (TF) [23]. The perfect model will only have values on the diagonal, the rest of the cells will be all zeros, and the bad model will be evenly distributed in all cells. The error matrix tells us how bad a model is when it is bad. The value of each cell can identify a misclassified pattern [24]. Table 1. Confusion matrix Confusion matrix True class (Actual) P N Hypothesized class (Predicted) Y True Positives False Positives N False Negatives True Negatives Methods for summarizing the results of the confusion matrix include accuracy, precision, and recall. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 (2) The accuracy is obtained by dividing the accurately predicted number (TP+TN) by the total number of samples, and is represented by (2). Among the methods for summarizing the results in the confusion
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun) 2409 matrix, the most frequently used precision and sensitivity are as shown in (3) and (4), respectively. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 (3) Precision is a positive predictive value that measures how many of the samples (TP+FP) predicted to be positive are true positives (TP). Precision is used as a performance indicator when the goal is to reduce the number of false positives (FP). 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 (4) Sensitivity measures how many of the total positive samples (TP+FN) are classified as positive classes (TP). 3. PREDICTION MODEL After collecting algal blooms dataset from the National Institute of Fisheries Science, it was cleaned and refined. The first multiple logistic regression analysis was performed on the refined algal blooms dataset, and some attributes were removed through a statistical test. A second multiple logistic regression analysis was performed with the exception of the attributes removed and then the regularization was applied. After applying the regularization, a third multiple logistic regression analysis is performed and the results are applied to the confusion matrix. Figure 1 shows this process. Figure 1. Prediction process of algal blooms The probability of occurrence of harmful algal blooms with two or more independent variables is defined as p(x) and the odds as = 𝑝 1−𝑝 . When the range of input values is [0, 1], log it transformation is performed to adjust the range of output values to (−∞, ∞), resulting in log(𝑜𝑑𝑑𝑠) = log ( 𝑝(𝑥) 1−𝑝(𝑥) ) = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 [25]. Therefore, for multiple independent variables that affect harmful algal blooms, the multiple logistic function that allows the dependent variable range to be between [0, 1] is as shown in (5). In (5) calculates the effect of each element of the ocean observation data, which is an independent variable, on the occurrence of a harmful algal blooms as a dependent variable. This is a basic model for estimating the probability of occurrence of harmful algal blooms. 𝑝(𝑥) = 1 1+𝑒−(𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖) (5) The maximum likelihood estimation is used to estimate parameter 𝛽 in regression expression 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 by logit transformation. The log likelihood function can be obtained from the likelihood function [26] expressed as the product of Bernoulli's probability function, and is expressed as (6).
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413 2410 The parameter that maximizes the log likelihood function in (6) is determined from multiple independent variables that affect the harmful algal blooms. ln𝐿 = ∑ 𝑦𝑖(𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖) 𝑖 + ∑ ln 𝑖 (1 + 𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖 ) (6) The L1 regularization [27] used to eliminate low-impact independent variables among multiple independent variables that affect harmful algal blooms is shown in (7). argmin ∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2 + 𝜆 ∑ |𝛽𝑗| 𝑝 𝑗=1 𝑝 𝑗=1 𝑛 𝑖=1 (7) The properties in the marine environment observation dataset are shown in Table 2 and used as independent variables in logistic regression. Table 2. Multiple independent variables for logistic regression Variables Comments T Temperature S Salinity DO Dissolved Oxygen P Phosphate Phosphorus NA Nitrous Acid Nitrogen N Nitric Acid Nitrogen SA Silicic Acid Silicon In (8) is obtained by applying seven independent variables, such as water temperature, salinity, dissolved oxygen, phosphate phosphorus, nitrous acid nitrogen, nitric acid nitrogen, silicic acid silicon, to the basic model of multiple logistic regression (5). log 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝐷𝑂𝑖 + 𝛽4𝑃𝑖 + 𝛽5𝑁𝐴𝑖 + 𝛽6𝑁𝑖 + 𝛽7𝑆𝐴𝑖 (8) P-value is used to determine if any independent variable was statistically significant in the results of multiple logistic regression analysis on the training dataset, and independent variables with a P-value of 0.05 or higher are excluded. The parameters for statistically significant independent variables are as shown in (9). log 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 + 𝛽4𝑁𝑖 (9) The regulation for removing an independent variable close to zero in order to make some coefficients zero is as shown in (7). The result is as shown in (10) when (7) is applied to the result of (9). log 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 (10) In (11) is the logistic regression model for algal blooms prediction obtained by applying the above process to the algal blooms dataset. 𝑝(𝑥) = 1 1+𝑒−(−5.89+0.34𝑇𝑖−0.12𝑆𝑖+0.35𝑃𝑖) (11) The normalization process from the (8) to the (11) is from Step 2 to Step 5 among the algorithms in Table 3, respectively. The algal blooms prediction model was normalized while performing experiments based on the algorithm in Table 3. The detailed experimental process is described in section 4. The equation for obtaining a decision boundary to increase the sensitivity and precision is defined as shown in (12). 𝛥 = |0.5 ± 1 2 ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 + 𝑇𝑃 𝑇𝑃+𝐹𝑃 )| (12) Table 3 shows algorithm for establishing algal blooms prediction model. This algorithm shows the process of performing multiple logistic regression first in a refined dataset, then statistical tests on the results,
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun) 2411 and then removing low-weight independent variables, finally setting up a logistic regression model, and finding the decision boundary finally. Table 3. Algorithm for establishing algal blooms prediction model Step Statements 1 Extraction, Transformation and Loading from collected dataset Prepare training dataset 2 Perform multiple regression analysis using (1) on the training dataset Output regression coefficients and statistical tests 3 Perform a statistical significance test ∙attributes P-value > 0.5 are excluded in the training dataset 4 Perform multiple regression analysis for the training dataset with attributes whose P-value <= 0.5 Output regression coefficients and statistical tests for attributes whose P-value <= 0.5 5 Regularize regression coefficients from step 4 using argmin∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2 + 𝜆 ∑ |𝛽𝑗| 𝑝 𝑗=1 𝑝 𝑗=1 𝑛 𝑖=1 Perform multiple regression analysis using a regularized regression formula Output test dataset 6 Input test dataset from step 5 Predict probability using decision boundary 𝛥 = |0.5 ± 1 2 ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 + 𝑇𝑃 𝑇𝑃+𝐹𝑃 )|based on confusion matrix 4. EXPERIMENT Multiple logistic regression analysis (8) can be performed on the training dataset to obtain the results shown in Table 4. In Table 4, p-value is used to determine whether any independent variable is statistically significant, and independent variables with a p-value of 0.05 or higher are excluded. Parameters 𝛽 are determined for statistically significant independent variables in Table 4 and L1 regularization is applied and then the results shown in Table 5 can be obtained. In (11) of the logistic regression model for algal blooms prediction is obtained from the coefficients in Table 5. Table 4. 1st multiple logistic regression analysis on training dataset Input variables Coefficient Std. error P-value Constant -5.35 1.49 0.00 Temperature 0.33 0.02 0.00 Salinity -0.12 0.03 0.00 Dissolved Oxygen -0.05 0.08 0.54 Phosphate Phosphorus 0.38 0.14 0.01 Nitrous Acid Nitrogen -0.07 0.12 0.58 Nitric Acid Nitrogen -0.06 0.02 0.00 Silicic Acid Silicon -0.02 0.01 0.16 Table 5. Coefficients of logistic regression model for algal blooms prediction Input variables Coefficient Std. error P-value Constant -5.89 1.21 0.00 Temperature 0.34 0.01 0.00 Salinity -0.12 0.03 0.00 Phosphate Phosphorus 0.35 0.14 0.01 Predicting the occurrence of algal blooms from (11) gives 91.84% accuracy. Accuracy alone may not be sufficient to assess the predicted performance of algal blooms. We utilize the confusion matrix since we do not know false negatives or false positives of algal blooms. A confusion matrix for algal blooms shown in Table 6 is obtained from algal blooms dataset. Table 6. Confusion matrix for algal blooms Confusion matrix (Error matrix) Actual values of algal blooms P(Occurrence) N(Not occurrence) Predicted values of algal blooms Y (0.5 or higher) True Positive tp=222 False Positive fp=205 N (less than 0.5 ) False Negative fn=599 True Negative tn=8828
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413 2412 Table 7 shows the sensitivity, specificity, and precision are obtained based on the decision boundary 0.5 using the values of the confusion matrix in Table 6. The prediction rate of false negative of algal blooms is low as the sensitivity is 27.04%, and the prediction rate of false positive is also low because the precision is 51.99%. Since the sensitivity and precision are low in case of the decision boundary is 0.5, we apply proposed the decision boundary (12) in order to solve these problems, and results are as shown in Table 8. When the decision boundary proposed in this paper is applied, the decision boundary becomes 𝛥 = |0.5 ± 0.25|. When this is used as a decision boundary, TP=494, TN=9026, FN=327, FP=7, the sensitivity is 60.17%, and the precision is 98.6% as shown in Table 8. Table 7. Resulting confusion matrix based on decision boundary 0.5 (unit: %) TPR TNR PPV FPR ACC F1 Sensitivity Specificity Precision Fallout Accuracy F1 Score 27.04 97.73 51.99 2.27 91.84 35.58 Table 8. Resulting confusion matrix based on proposed decision boundary 𝛥 (unit: %) TPR TNR PPV FPR ACC F1 Sensitivity Specificity Precision Fallout Accuracy F1 Score 60.17 99.92 98.6 0.08 96.61 74.74 5. CONCLUSION In this paper, logistic regression and confusion matrix were used to predict the occurrence of algal blooms. Algal blooms datasets were collected and refined for experimental analysis of algal blooms prediction. Logistic regression analysis was performed on refined algal blooms dataset and main marine environmental factors affecting algal blooms were found through statistical test and regularization processes. Logistic regression was performed using the marine environmental factors that were influential on algal blooms and the accuracy of algal bloom occurrence was obtained. The values of the confustion matrix were obtained using the dataset for algal blooms prediction and the predicted values obtained from logistic regression. Although the sensitivity and precision for the occurrence of algal blooms can be obtained from the values of the confusion matrix, the sensitivity and precision were low when the existing decision boundary was 0.5. Sensitivity and precision were improved by using the decision boundary proposed in this study. In this paper, the algal blooms prediction model was established by the ensemble method using logistic regression analysis and confusion matrix. Also, the accuracy, sensitivity, and precision for algal blooms prediction were improved, and these were verified through big data analysis. REFERENCES [1] Hosmer Jr., et al., “Applied logistic regression,” John Wiley & Sons, vol. 398, 2013. [2] M. Chang, et al., “Selection of Transformations of Continuous Predictors in Logistic Regression,” Information Technology-New Generations, pp. 443-447, 2018. [3] J. W. Osborne, “Simple Linear Models with Categorical Dependent Variables: Binary Logistic Regression,” SAGE, pp. 97-132, 2017. [4] J. Tolles and J. M. William, “Logistic regression: relating patient characteristics to outcomes,” Jama, vol. 316, no. 5, pp. 533-534, 2016. [5] A. D. Caigny, et al., “A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees,” European Journal of Operational Research, vol. 269, no. 2, pp. 760-772, 2018. [6] X. Wan, “The Influence of Polynomial Order in Logistic Regression on Decision Boundary,” IOP Conference Series: Earth and Environmental Science, vol. 267, no. 4, pp. 1-4, 2019. [7] S. Ghazaal and A. Hakan, “Using the Distance in Logistic Regression Models for Predictor Ranking in Diabetes Detection,” International Conference on Medical and Biological Engineering, 2019, pp. 665-670. [8] J. Friedman, et al., “Additive logistic regression: A statistical view of boosting,” Annals of statistics, vol. 28, no. 2, pp. 337-374, 2000. [9] H. M. Ramos, et al., “A new explanatory index for evaluating the binary logistic regression based on the sensitivity of the estimated model,” Statistics and Probability Letters, vol. 120, pp. 135-140, 2017. [10] M. Ohsaki, et al., “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 9, pp. 1806-1819, 2017. [11] J. A. McGowan, et al., “Predicting coastal algal blooms in southern California,” Ecology, vol. 98, no. 5, pp. 1419-1433, 2017. [12] N. F. Manning, et al., “Extending the forecast model: Predicting Western Lake Erie harmful algal blooms at multiple spatial scales,” Journal of Great Lakes Research, vol. 45, no. 3 pp. 587-595, 2019.
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun) 2413 [13] N. Mellios, et al., “Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes,” Water, vol. 12, no. 4, p. 1191, 2020. [14] L. Wang, et al., “An approach of improved Multivariate Timing-Random Deep Belief Net modelling for algal bloom prediction,” Biosystems engineering, vol. 177, pp. 130-138, 2019. [15] Ghatkar, et al., “Classification of algal bloom species from remote sensing data using an extreme gradient boosted decision tree model,” International Journal of Remote Sensing, vol. 40, no. 24, pp. 9412-9438, 2019. [16] S. Lee and D. Lee, “Improved prediction of harmful algal blooms in four Major South Korea’s Rivers using deep learning models,” International journal of environmental research and public health, vol. 15, no. 7, p. 1322, 2018. [17] X. Sun, et al., “A Bayesian structural model for predicting algal blooms,” Journal of Forecasting, vol. 38, no. 8, pp. 788-802, 2019. [18] F. Thabtah et al., “A machine learning autism classification based on logistic regression analysis,” Health information science and systems, vol. 7, no. 1, p. 12, 2019. [19] D. Menezes, et al., “Data classification with binary response through the Boosting algorithm and logistic regression,” Expert Systems with Applications, vol. 69, pp. 62-73, 2017. [20] J. M. Hilbe, “Practical guide to logistic regression,” CRC Press, 2016. [21] C. Fernández and F. Provost, “Causal Classification: Treatment Effect vs. Outcome Prediction,” Outcome Prediction, 2019. [22] K. Lee, et al., “Unbalanced data, type II error, and nonlinearity in predicting M&A failure,” Journal of Business Research, vol. 109, pp. 271-287, 2010. [23] S. V. Stehman, “Selecting and interpreting measures of thematic classification accuracy,” Remote Sensing of Environment, vol. 62, no. 1, pp. 77-89, 1997. [24] D. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011. [25] J. S. Cramer, “The origins and development of the logit model,” Cambridge UP, pp. 1-19, 2003. [26] I. M. Myung, “Tutorial on Maximum Likelihood Estimation,” Journal of Mathematical Psychology, vol. 47, no. 1, pp. 90-100, 2003. [27] F. Santosa and W. Symes, “Linear inversion of band-limited reflection seismograms,” SIAM Journal on Scientific and Statistical Computing, SIAM, vol. 7, no. 4, pp. 1307-1330, 1986. BIOGRAPHY OF AUTHOR Hongwon Yun is a Professor with the Department of Computer Software Engineering at Silla University, Busan, South Korea. He received his B.S. and the Ph.D. degrees at the Department of Computer Science from Pusan National University, South Korea. His research interests include database system, big data analysis, and machine learning.