SlideShare a Scribd company logo
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
DOI: 10.5121/sipij.2024.15602 23
RANDOM FOREST APPLICATION FOR
CROP YIELD PREDICTION
Abbas Maazallahi 1
, Sreehari Thota 1
, Naga Prasad Kondaboina 1
, Vineetha
Muktineni 1
, Deepthi Annem 1
, Abhi Stephen Rokkam 1
, Mohammad Hossein
Amini 2
, Mohammad Amir Salari 1
, Payam Norouzzadeh 1
, Eli Snir 2
, and
Bahareh Rahmani 1
1
Saint Louis University, Computer Science, Saint Louis University, Saint Louis, USA
2
Washington University in Saint Louis, Business School, Saint Louis, USA
ABSTRACT
This study analyzes crop yield prediction in India from 1997 to 2020, focusing on various crops and key
environmental factors including crop types and years, cropping seasons, specific details for each state,
areas of cultivation, production quantities, annual rainfall, and the usage of fertilizers and pesticides. We
applied advanced machine learning techniques like Logistic Regression, Decision Tree, KNN, Naïve Bayes,
K-Mean Clustering, and Random Forest to predict agricultural yields. The main goal of this study is
offering the best model to predict crop yields. Based on our study, Random Forest demonstrates almost
high accuracy. Naïve Bayes shows high precision indicating the high quality of a positive prediction made
by this model. In this study, we are discovering the best machine learning models to predict the crop yield.
If people know that their yield will be decreased next year, they find a way increase the crop yields.
KEYWORDS
Crop Yield Prediction, Machine Learning Algorithms, Random Forest, Agricultural Data Analysis,
Precision Agriculture
1. INTRODUCTION
Machine learning has significantly influenced agricultural practices, particularly in crop yield
prediction.
In this study, various machine learning techniques have been employed to enhance the accuracy
and efficiency of forecasts. These techniques offer valuable insights into the complex nature of
agricultural data and the prediction of crop yields.
Logistic Regression explores show the probabilistic relationships between variables. The
simplicity and interpretability of Logistic Regression make it a popular choice in many fields,
including agriculture [1].The Decision Tree emerges as a robust method in classification and
regression toolkit. It aids clear decision-making by splitting data into branches based on variable
values [2].
K-Nearest Neighbors (KNN) as a non-parametric method, identifies the similarities between new
and existing data points, making it suitable for classification and regression problems. Random
Forest (RF) is a popular ensemble machine learning algorithm to combine the output of several
decision trees to classify and predict the future outcomes. As the field continues to evolve,
exploring and implementing these techniques remain critical in addressing the complexities of
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
24
crop yield prediction.
2. LITERATURE REVIEW
A study featured in Nature's Scientific Reports presents an "Interaction Regression Model for
Crop Yield Prediction." This research selects robust features and interactions to predict crop
yields, utilizing an elastic net regularization model. This model is instrumental in identifying
high-quality features across various environmental and management categories, due to reducing
the risk of overfitting and increasing the robustness of predictions across different geographic
locations and timeframes [2].
Ziegel's seminal work, "The Elements of Statistical Learning" (2003), remains a foundational text
in the field of machine learning and statistical modeling. Published in Techno-metrics, this book
provides an extensive overview of various statistical learning techniques, including regression,
classification, and ensemble methods. Ziegel’s work is particularly relevant to the domain of crop
yield prediction, as it lays the theoretical groundwork for many of the advanced ML algorithms
employed in agricultural research today. The comprehensive nature of this text makes it an
essential reference for understanding the underlying principles and applications of statistical
learning in diverse fields, including agriculture [3].
Another research paper, "Using Machine Learning for Crop Yield Prediction in the Past or the
Future," published in Frontiers, offers a unique perspective by simulating sunflower and wheat
yields over a twenty-year period from 2000 to 2020. This research emphasizes the significance of
continuous nutrient and water balance in the simulation process and explores the impact of
changes in cultivars and planting densities on crop yields. The detailed simulation models
provide valuable insights into long-term yield prediction and resource management, marking a
significant advancement in the field [4].
The study "Analysis of Crop Yield Prediction using Machine Learning Algorithms" in IEEE
Xplore reporst the uncertainties of weather and its impact on farming. The paper evaluates the
efficacy of machine learning algorithms—K-Nearest Neighbors (KNN), Random Forest, and
Linear Regression—using parameters like state, crop, temperature, and rainfall to predict crop
yields. The results showcase a remarkable 97% accuracy for KNN, outshining the Random
Forest's 75% and Linear Regression's 54%, highlighting the promise of KNN in predictive
agriculture and offering a data-driven example for enhancing agricultural productivity [5].
The article "A Machine Learning Approach to Predict Crop Yield and Success Rate" from IEEE
Xplore details an innovative study within India's agricultural sector.. Focusing on improving
farmers' decision-making by predicting crop yields, this research employs neural network
regression modeling with an extensive dataset drawn from government sources. The researchers
reported a 45% accuracy using RMSprop optimizer, which was substantially improved to 90% by
refining the network architecture and shifting to the Adam optimizer. The model applies a 3-
Layer Neural Network with the Rectified Linear Activation Unit (ReLU) function, and leverages
both backward and forward propagation techniques to establish a robust model for crop yield
prediction [6].
Moreover, "Utilizing Naïve Bayes Algorithm for Crop Yield Prediction" explores the application
of Naïve Bayes algorithm in predicting crop yields based on various agricultural parameters
containing weather information, soil characteristics, and crop management practices. This study
shows that Naïve Bayes in accurately predicting crop yields across different regions and crop
varieties, highlighting its potential as a valuable tool for agricultural decision-making [7].
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
25
Another study "Enhancing Crop Yield Prediction through Random Forest Algorithm"
investigates the use of Random Forest algorithm to improve crop yield prediction’s accuracy. By
constructing an ensemble of decision trees and aggregating their predictions, Random Forest
leverages the strength of multiple models to capture complex nonlinear relationships between
predictor variables and crop yields. This research demonstrates the superior performance of
Random Forest over traditional regression models, making it an asset for precision agriculture
[8].
Zhang et al. (2023) investigate the use of the Random Forest algorithm to improve crop yield
prediction accuracy. By constructing an ensemble of decision trees and aggregating their
predictions, Random Forest captures complex nonlinear relationships between predictor variables
and crop yields. This research illustrates the superior performance of Random Forest over
traditional regression models, making it an asset for precision agriculture and informed decision-
making in farming practices [9].
Dhaliwal and Williams (2024) provide an insightful exploration into the prediction of sweet corn
yield using machine learning models and field-level data. Their study, published in Precision
Agriculture, utilizes a combination of ML algorithms to enhance yield prediction accuracy. By
integrating extensive field-level data, including soil properties, weather conditions, and crop
management practices, the researchers demonstrate a robust framework for predicting sweet corn
yields. Their findings emphasize the importance of high-resolution field data in improving the
predictive performance of ML models in agriculture [10].
Rashid et al. (2021) offer a comprehensive review of crop yield prediction using machine
learning approaches, with a particular emphasis on palm oil yield prediction. Published in IEEE
Access, this review synthesizes existing research and methodologies, providing a detailed
analysis of various ML techniques applied to crop yield prediction. The authors discuss the
challenges and advantages of different ML models, highlighting how advanced algorithms like
deep learning and ensemble methods have been successfully employed to predict yields in
complex agricultural systems. This review serves as a valuable resource for researchers and
practitioners aiming to leverage ML for enhanced agricultural productivity [11].
Hussain, Sarfraz, and Javed (2021) conducted a systematic review on crop-yield prediction
through Unmanned Aerial Vehicles (UAVs) presented at the 16th International Conference on
Emerging Technologies (ICET 2021). The study highlights the prevalent use of Random Forest
(RF), Support Vector Machine (SVM), and Convolutional Neural Networks (CNN) in crop yield
prediction. The review underscores the significance of these algorithms and their adoption in
developing countries, reflecting the growing reliance on UAVs for data collection and analysis in
agriculture [12].
Sarvaiya, Chaudhari, and Verma (2022) discussed the challenges of crop yield prediction and
crop selection based on climatic sensor data and historical yield data in their book section,
"Monitoring Agricultural Essentials," from the "Application of Machine Learning in
Agriculture." The authors emphasize the importance of machine learning in addressing major
agricultural problems and improving crop yield predictions by leveraging climatic and past data
[13].
Van Wart et al. (2015) explored the creation of long-term weather data for crop simulation
modeling in their article published in Agricultural and Forest Meteorology. This study highlights
the necessity of high-quality daily weather data, such as uncorrected gridded solar radiation, for
accurate crop yield simulation and variability prediction. The authors demonstrate how
propagating long-term weather data significantly enhances the reliability of crop simulation
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
26
models [14].
Mahmood (1998) conducted a comparative study on air temperature variations and rice
productivity in Bangladesh, published in Ecological Modelling. The study compares the
performance of the YIELD and CERES-rice models, finding that boro rice productivity
predictions at Mymensingh are higher using the YIELD model. This research underscores the
critical role of accurate temperature data in predicting crop productivity[15].
Venkatesh and Saravanan (2022) investigated the prediction of crop yield using Simple Linear
Regression (SLR) and Polynomial Regression (PR) in their study presented at the 3rd
International Conference on Smart Electronics and Communication (ICOSEC 2022). Their
findings suggest that SLR significantly outperforms PR in predicting crop yields, indicating the
effectiveness of simpler models for specific types of agricultural data[16]. In similar works we
applied machine learning methods to predict weather patterns[17], and customer churn [18].
These studies underscore the dynamic and evolving nature of crop yield prediction research.
They not only highlight the potential of machine learning in agriculture but also set a foundation
for future studies. Our research aims to build upon these methodologies, introducing novel
approaches to further enhance the precision and applicability of crop yield predictions.
3. DATA DESCRIPTION
The dataset used in this study, available at Kaggle
(https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset), includes
extensive agricultural data from India from 1997 to 2020. It covers a wide range of crops grown
across different Indian states. Data includes crop types and years, cropping seasons, specific
details for each state, areas of cultivation, production quantities, annual rainfall, and the usage of
fertilizers and pesticides. The data features are:
Crop: This field identifies the crop type. The dataset includes a diverse array of 55 crops,
reflecting India's 55 rich agricultural variety including rice, maize, onion, potato, coconut, and
banana.
Crop Year: The dataset covers crop years from 1997 to 2020, providing a comprehensive
temporal view of agricultural trends over 24 years.
Season: The data categorizes cultivation of 4 distinct seasons, including major seasons Autumn
and Spring to analyzethe seasonal impacts on agriculture.
State: Includes data from 30 Indian states which offers a wide geographical perspective, to find
the regional agricultural patterns.
Area: Represents the land area under cultivation in hectares. The mean of area is approximately
179,926 hectares, ranging from a minimal 0.5 hectares to a vast 50.8 million hectares toindicate
the varied scale of farming practices across regions.
Production: The quantity of crop production, measured in metric tons, shows an average of
around 16.4 million tons. It varies greatly, with a maximum recorded production of about 6.3
billion tons.
Annual Rainfall: This feature, measured in millimeters, indicates the climatic conditions
affecting crop growth. The average annual rainfall is about 1,438 mm, ranging from 301.3 mm to
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
27
a significant 6k mm.
Fertilizer: The total amount of fertilizer used, in kilograms, with an average of around 24.1
million kg. It shows a diverse nutrient management strategy across different crops and regions.
Pesticide: This field presents the total pesticide usage in kilograms. On average, around 48,848
kg of pesticides are used, with the maximum 15.75 million kg.
Yield: This attribute indicates production per unit area with an average of approximately 79.95
and an extremely varied range, peaking at 21,105. This metric is evaluating the efficiency of
agricultural practices.
4. STATISTICAL ANALYSIS
Figure 1. Scatter Plot of Key Features with Crop Categories
These statistical insights provide a more understanding of the dataset, highlighting the complexity
and diversity of agricultural practices in India.
The scatter plot shows the relationships between various key features of the agricultural dataset
differentiated by color. Each subplot in the matrix compares two different features,area vs
production, annual rainfall vs fertilizer, and so on (Figure 1).From the scatter plot, we can
observe the following:
Area vs Production: There is a positive correlation between the area of cultivation and the
production formost crops, which is expected as larger cultivation areas generally lead to higher
production volumes.
Annual Rainfall vs Production: The relationship between annual rainfall and production varies
among crops, suggesting that some crops may be more sensitive to rainfall than others.
Fertilizer vs Production: There seems to be a positive correlation for some crops, indicating that
increased fertilizer usage may cause higher production. However, this relationship does not hold
uniformly across all crop types.
Pesticide vs Production: Pesticide usage does not show a clear correlation with production in this
visualization which means the effectiveness or necessity of pesticides may vary depending on the
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
28
crop.
Yield: The yield scatter plots across different features show varied patterns for different crops,
indicating that yield is influenced by a complex interplay of factors, not just a single feature.
Figure 2. Aggregated Pesticide Usage by Crop and Year
Each crop type, represented by a unique color, exhibits its own pattern of distribution and
correlation across the different features, which can inform targeted agricultural practices and
policies. The data points for crops like coconut are notably distinct due to high-volume output,
which skews the distribution.
The bar chart would offer a comprehensive view of the trends in pesticide use across different
crops over the years (Figure 2). It shows how pesticide usage has varied over time for each crop
type. This chart is an instrument to identify patterns and potential correlations between pesticide
use and other factors like crop yield, cultivation practices, or environmental changes. It would
serve as a critical tool for understanding the dynamics of pesticide management in agriculture,
aiding in developing more sustainable and efficient farming practices.
4.1. Features Distribution
The feature distribution plots for area, production, annual rainfall, fertilizer, and pesticide from
the agricultural dataset provide a visual summary of the underlying data characteristics and
variability (Figure 3).The histograms reveal the frequency distribution of values for each feature.
The Area histogram shows a concentration of values in smaller land areas, suggesting that most
of the crop cultivation occurs in relatively smaller lands. The Production histogram is rightly
skewed with a few instances of very high yields, indicative of a small number of highly
productive operations. Annual Rainfall appears more consistently distributed, suggesting a level
of predictability in this environmental factor. Fertilizer and Pesticide usage are both right skewed,
indicating that lower usage rates are more common across the dataset.
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
29
Figure 3. Histogram, Density and Box plot of selected features
Density plots provide a smoothed representation of the data distribution, revealing the probability
density of the different values. These plots show the likelihood of specific values occurring
within the dataset and highlight the central tendencies and the spread of data more clearly than
histograms. For area, production, fertilizer, and pesticide, the peaks of the density plots suggest
the most common values and verify the skewness seen in the histograms.
Box plots offer a summary of the data’s statistical distribution, including the median, quartiles,
and outliers. The box plots for area and annual fertilizer and pesticide do not show significant
outliers, which indicates a more homogeneous distribution within the interquartile range
whereproduction, fertilizer, and pesticide box plots display several upper-end outliers. These
outliers represent values that are exceptionally higher than the typical range of the data and may
correspond to instances of intensive farming practices or atypical environmental conditions.
5. NORMALIZATION AND LABELING OF CROP YIELD DATA (TARGET
VARIABLE)
Normalization per crop is a essential preprocessing step in agricultural data analysis. This process
involves scaling the yield data for each crop type within a specified range (commonly 0 to 1) to
ensure a uniform scale across various crops. The primary reasons for this normalization include:
Comparability: Different crops may have inherently different yield scales due to varying
biological and cultivation factors. Normalization allows for a fair comparison of yields across
diverse crop types on a common scale.
Outlier Mitigation: Some crops might have extreme yield values (either high or low) that can
skew the overall analysis. Normalization helps in mitigating the impact of such outliers.
Uniformity in Analysis: It ensures that the yield data across all crops are treated uniformly,
making the subsequent analysis more robust and less biased towards crops with larger or smaller
yield values.
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
30
5.1. Purpose of Labeling into Four Classes
Labeling the normalized yield data into four distinct classes is a method of discretization that
simplifies complex continuous data into categorical segments. This is beneficial for several
reasons:
Simplification of Data: It simplifies the continuous range of yield values into distinct categories,
making it easier to analyze and understand patterns within the data.
Facilitates Classification Analysis: By converting yields into classes, the data is prepared for
classification algorithms in machine learning, predictive modeling or trend analysis.
Enhanced Interpretability: Labeling yields into categories like 'Low', 'Medium', 'High', and
'Very High' provides a more intuitive understanding of the yield performance for each crop.
Labeling is often done using quartiles, dividing the data into four equal parts based on their
distribution. This method ensures that each class has an equal number of data points, providing a
balanced categorization of the yield data.
Figure 4. Yield Distribution per Crop, Before and After Normalization
Figure 4 illustrates the impact of normalization on yield data across various crops. On the left, we
see the yield distribution before normalization, where each crop's yield values span a wide and
disparate range, making it difficult to compare between crops. Outliers and variances are
prominent, and the scales are imbalanced, with some crops showing yields reaching 20,000 units.
On the right, after normalization, all yields are scaled between 0 and 1. This transformation
standardizes the data, bringing all crops onto an even playing field and highlighting the relative
distribution within each crop type without being overshadowed by the absolute yield values. This
normalized view allows for more straightforward comparisons across different crops and a
clearer interpretation of yield performance relative to each crop's potential.
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
31
Figure 5. Normalized Yield Distribution per Crop
Figure 5 presents a boxplot illustrating the normalized yield distribution for a variety of crops,
with yield data segmented into two distinct classes: Low and High. Each crop type is represented
by a series of boxplots along the horizontal axis, with the normalized yield values plotted on the
vertical axis ranging from 0 to 1. The Low-yield class is depicted in blue, and the High-yield
class in red, allowing for a clear visual distinction between the two categories. For each class, the
box plots show the median yield value (the line within the box), the interquartile range (the box
itself), and potential outliers (the individual points beyond the whiskers). This graph effectively
communicates the variability in yield within each crop type, as well as between the two yield
classes, providing insights into the distribution patterns of agricultural productivity across
different crops.
6. METHODOLOGY
This section evaluates and compares the performance of various machine learning classifiers on a
crop yield. The dataset, preprocessed with feature normalization, includes key agricultural
indicators such as area, production, annual rainfall, fertilizer, and pesticide usage, all normalized
to ensure uniformity and comparability across different scales. The target variable,
'Yield_Class_Int', represents yield categories encoded as integers, facilitating a multi-class
classification approach.
The selected classifiers include a diverse array of algorithms: Logistic Regression, Decision Tree,
Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes,
and Gradient Boosting. These methods cover a spectrum from simple linear models to more
complex ensemble methods, each with its strengths in handling different types of data
distributions and relationships. The dataset is split into training and testing sets, with 80% of the
data used for training and the remaining 20% for testing to ensure a robust evaluation framework.
Figure 6 shows one tree of Random Forest, K-Nearest Neighbor (K=3), and Naïve Bayes
Classifiers.
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
32
Figure 6. left to right: One tree of Random Forest, K-Nearest Neighbor (K=3), Naïve Bayes
Classifiers
Each model is trained on the training set and then evaluated on the test set. Performance metrics
such as accuracy, precision, recall, f1-score, and the confusion matrix are computed for each
model. These metrics provide a multi-dimensional view of the models' performance, with
accuracy indicating the overall correctness, precision and recall offering insights into the models'
ability to identify each class correctly, and the f1-score presenting a balance between precision
and recall. The confusion matrix further elucidates the specific areas of strength and weakness for
each classifier, by showing the distribution of predictions across actual classes. This rigorous
assessment allows for a detailed comparison of the models, highlighting their efficacy and fitness
for the crop yield classification task.
7. RESULTS
Figure 7 compares various machine learning models used for classification tasks. The metrics
includes Accuracy, Precision, Recall, and F1 Score.
Accuracy reflects the overall rate of correctly predicted the class labels. Precision indicates the
proportion of true positives among all positive predictions. Precision is a key measure when the
cost of a false positive is high. Recall measures the proportion of actual positives that were
identified correctly, which is particularly important when missing a positive is costly. The F1
Score is the harmonic means of precision and recall, providing a single metric that balances both
the false positives and false negatives.
model with high precision but lower recall might be conservative in its positive predictions but
miss out on several actual positives. In contrast, a model with high recall but lower precision
might capture most of the positives but at the cost of increased false positives. The F1 Score
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
33
helps to balance these aspects and is often a crucial metric when choosing the best deployment
model.
In confusion matrix shown in figure 8, the horizontal axis represents the predicted classifications,
while the vertical axis represents the actual classifications, each divided into 'Positive' and
'Negative' categories. The top left quadrant represents true positives (TP), where the model
correctly predicts the positive classes. The bottom right quadrant represents true negatives (TN),
where the model correctly predicts the negative class. The top right quadrant shows false
negatives (FN). In these instances, the model incorrectly predicts the negative class, and the
bottom left quadrant shows false positives (FP), where the model incorrectly predicts the positive
class.
Figure 7.Performance Metrics of Different Methods
Figure 8. Confusion Matrix
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
34
The intensity of the colors corresponds to the number of observations in each category, with
darker colors typically representing higher numbers. This visualization helps in quickly assessing
the model's performance, particularly in terms of its ability to distinguish between the classes. For
example, if the TP and TN quadrants are much darker than the FN and FP quadrants, this
indicates a high level of accuracy like what we have in figure 8.
Our explorations into machine learning models for agricultural yield prediction have yielded
significant insights. Based on Figure 7 and Table 1 the Random Forest model, tailored to our
specific dataset, has demonstrated almost high accuracy, reaching a 73% success rate in
predicting yield when considering crucial features such as area and production. This high level of
precision underscores the model's capability to handle the discrete nature of our data effectively.
Table 1: Performance Metrics of Different Models Based on Figure 7
Accuracy Precision Recall F1 Score
Logistic Regression 0.52 0.58 0.55 0.5
Decision Tree 0.62 0.62 0.62 0.62
Random Forest 0.73 0.73 0.73 0.73
SVM 0.51 0.58 0.51 0.48
KNN 0.52 0.55 0.52 0.52
Naive Bayes 0.51 0.64 0.51 0.39
Gradient Boosting 0.61 0.6 0.61 0.6
8. DISCUSSION
Our project's findings indicate that Random Forest models surpass in accuracy for discrete data
sets, a characteristic that is particularly relevant to our agricultural domain. With its ensemble
approach, the Random Forest model has complemented the probabilistic predictions of Naïve
Bayes, which is offering a robust alternative for yield classification.
Throughout this project, we have not only applied various machine learning techniques but also
honed our ability to discern the most appropriate methods for our dataset. The process has
enhanced our analytical skills, enabling us to create informative visualizations that succinctly
convey the efficacy of different machine learning strategies. In the future work we will apply
boosting methods to increase the accuracy of the predictor.
Compliance with Ethical Standards
Conflict Interest Statement
There is no conflict of interest declared by authors. All authors have reviewed and agreed with
the manuscript. We state that the submission is an original paper and is not under review at any
other journal.
Research’s human participants and/or animals
There are no humans or animals participating in this project.
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
35
Consent to Participate
Authors consent to participate in this project, and we know that: the research may not have direct
benefit to us. Our participation is entirely volunteer. There is a right to withdraw from the project
at any time without any consequences.
Data Availability
The dataset used for this project is collected from Kaggle.
https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset
Funding
No Funding has been applied for this project.
Ethical Approval
All subjects gave their informed consent for inclusion before they participated in the study.
Consent to Publish
We give our consent for the publication of exclusive details, that could be included figures and
tables and details within the manuscript to be published in Computational Brain & Behavior.
REFERENCES
[1] V. A. Barbur, D. C. Montgomery, and E. A. Peck, “Introduction to Linear Regression Analysis.,”
The Statistician, vol. 43, no. 2, 1994, doi: 10.2307/2348362.
[2] J. Ansarifar, L. Wang, and S. V. Archontoulis, “An interaction regression model for crop yield
prediction,” Sci Rep, vol. 11, no. 1, 2021, doi: 10.1038/s41598-021-97221-7.
[3] E. R. Ziegel, “The Elements of Statistical Learning,” Technimetrics, vol. 45, no. 3, 2003, doi:
10.1198/tech.2003.s770.
[4] A. Morales and F. J. Villalobos, “Using machine learning for crop yield prediction in the past or the
future,” Front Plant Sci, vol. 14, 2023, doi: 10.3389/fpls.2023.1128388.
[5] V. Krishna, T. Reddy, S. Harsha, K. Ramar, S. Hariharan, and A. Bhanuprasad, “Analysis of Crop
Yield Prediction using Machine Learning algorithms,” in Proceedings - 2022 2nd International
Conference on Innovative Sustainable Computational Technologies, CISCT 2022, 2022. doi:
10.1109/CISCT55310.2022.10046581.
[6] S. S. Kale and P. S. Patil, “A Machine Learning Approach to Predict Crop Yield and Success Rate,”
in 2019 IEEE Pune Section International Conference, PuneCon 2019, 2019. doi:
10.1109/PuneCon46936.2019.9105741.
[7] M. Gupta, B. V. Santhosh Krishna, B. Kavyashree, H. R. Narapureddy, N. Surapaneni, and K.
Varma, “Various Crop Yield Prediction Techniques Using Machine Learning Algorithms,” in
Proceedings of the 2nd International Conference on Artificial Intelligence and Smart Energy, ICAIS
2022, 2022. doi: 10.1109/ICAIS53314.2022.9742903.
[8] M. Manafifard, “A new hyperparameter to random forest: application of remote sensing in yield
prediction,” Earth Sci Inform, vol. 17, no. 1, 2024, doi: 10.1007/s12145-023-01156-8.
[9] Q. Zhang et al., “Maize yield prediction using federated random forest,” Compute Electron Agric,
vol. 210, 2023, doi: 10.1016/j.compag.2023.107930.
[10] D. S. Dhaliwal and M. M. Williams, “Sweet corn yield prediction using machine learning models
and field-level data,” Precis Agric, vol. 25, no. 1, 2024, doi: 10.1007/s11119-023-10057-1.
[11] M. Rashid, B. S. Bari, Y. Yusup, M. A. Kamaruddin, and N. Khan, “A Comprehensive Review of
Crop Yield Prediction Using Machine Learning Approaches with Special Emphasis on Palm Oil
Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024
36
Yield Prediction,” 2021. doi: 10.1109/ACCESS.2021.3075159.
[12] N. Hussain, S. Sarfraz, and S. Javed, “A Systematic Review on Crop-Yield Prediction through
Unmanned Aerial Vehicles,” in ICET 2021 - 16th International Conference on Emerging
Technologies 2021, Proceedings, 2021. doi: 10.1109/ICET54505.2021.9689838.
[13] J. P. Sarvaiya, A. P. Chaudhari, and J. P. Verma, “Monitoring agricultural essentials,” in
Application of Machine Learning in Agriculture, 2022. doi: 10.1016/B978-0-323-90550-3.00004-7.
[14] J. Van Wart, P. Grassini, H. Yang, L. Claessens, A. Jarvis, and K. G. Cassman, “Creating long-term
weather data from thin air for crop simulation modeling,” Agric forMeteorology, vol. 209–210,
2015, doi: 10.1016/j.agrformet.2015.02.020.
[15] R. Mahmood, “Air temperature variations and rice productivity in Bangladesh: A comparative study
of the performance of the YEILD and the CERES-rice models,” Ecole Modell, vol. 106, no. 2–3,
1998, doi: 10.1016/S0304-3800(97)00192-0.
[16] N. G. E. Hackland and R. I. Jones, “Voo spelling van seasonableproduce van ergotiscurvelet,”
Proceedings of the Annual Congresses of the Grassland Society of Southern Africa, vol. 15, no. 1,
1980, doi: 10.1080/00725560.1980.9648891.
[17] K Miller, G Yi, E Snir, B Rahmani, Precipitation analysis and forecasting weather of Texas, United
States, International Journal of Information Technology, Springer Nature, 15 (2), 549-556.
[18] Y. M. Meda, B. M. S. Bokka, H. Jamallamudi, P. Norouzzadeh, E. Snir, B. Rahmani, Customer
Churn Prediction with Machine Learning, A. Maazallahi, the proceedings of International Business
Analytics Conference (IBAC).

More Related Content

Similar to RANDOM FOREST APPLICATION FOR CROP YIELD PREDICTION (20)

PDF
Predicting yield of crop type and water requirement for a given plot of land...
International Journal of Reconfigurable and Embedded Systems
 
PDF
An Overview of Crop Yield Prediction using Machine Learning Approach
IRJET Journal
 
PDF
An Efficient and Novel Crop Yield Prediction Method using Machine Learning Al...
IIJSRJournal
 
PDF
Crop yield prediction.pdf
ssuserb22f5a
 
PDF
Agriculture crop yield prediction using inertia based cat swarm optimization
IJECEIAES
 
PDF
IRJET- Analysis of Crop Yield Prediction using Data Mining Technique to Predi...
IRJET Journal
 
PDF
Crop Prediction System using Machine Learning
ijtsrd
 
PPTX
PPT of Crop prediction FROM CSE DEPT. CMR
timip29530
 
PDF
Crop Prediction System using Machine Learning
ijtsrd
 
PDF
Advanced crop yield prediction using machine learning and deep learning: a co...
TELKOMNIKA JOURNAL
 
PDF
Supervise Machine Learning Approach for Crop Yield Prediction in Agriculture ...
IRJET Journal
 
PPTX
Pid_177_IDSCS 2024_research presentation.pptx
AmolBhilare3
 
PDF
IMPLEMENTATION PAPER ON AGRICULTURE ADVISORY SYSTEM
IRJET Journal
 
PDF
IRJET- Crop Prediction System using Machine Learning Algorithms
IRJET Journal
 
PDF
Crop yield prediction using deep learning.pdf
ssuserb22f5a
 
PDF
Bali-Singla2021_Article_EmergingTrendsInMachineLearnin.pdf
mawande sikibi
 
PPTX
ICDATE PPT (2).pptx
ssuser356d4d
 
PDF
A COMPREHENSIVE SURVEY ON AGRICULTURE ADVISORY SYSTEM
IRJET Journal
 
PDF
Farmer's Analytical assistant
IJSRED
 
PDF
IRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
IRJET Journal
 
Predicting yield of crop type and water requirement for a given plot of land...
International Journal of Reconfigurable and Embedded Systems
 
An Overview of Crop Yield Prediction using Machine Learning Approach
IRJET Journal
 
An Efficient and Novel Crop Yield Prediction Method using Machine Learning Al...
IIJSRJournal
 
Crop yield prediction.pdf
ssuserb22f5a
 
Agriculture crop yield prediction using inertia based cat swarm optimization
IJECEIAES
 
IRJET- Analysis of Crop Yield Prediction using Data Mining Technique to Predi...
IRJET Journal
 
Crop Prediction System using Machine Learning
ijtsrd
 
PPT of Crop prediction FROM CSE DEPT. CMR
timip29530
 
Crop Prediction System using Machine Learning
ijtsrd
 
Advanced crop yield prediction using machine learning and deep learning: a co...
TELKOMNIKA JOURNAL
 
Supervise Machine Learning Approach for Crop Yield Prediction in Agriculture ...
IRJET Journal
 
Pid_177_IDSCS 2024_research presentation.pptx
AmolBhilare3
 
IMPLEMENTATION PAPER ON AGRICULTURE ADVISORY SYSTEM
IRJET Journal
 
IRJET- Crop Prediction System using Machine Learning Algorithms
IRJET Journal
 
Crop yield prediction using deep learning.pdf
ssuserb22f5a
 
Bali-Singla2021_Article_EmergingTrendsInMachineLearnin.pdf
mawande sikibi
 
ICDATE PPT (2).pptx
ssuser356d4d
 
A COMPREHENSIVE SURVEY ON AGRICULTURE ADVISORY SYSTEM
IRJET Journal
 
Farmer's Analytical assistant
IJSRED
 
IRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
IRJET Journal
 

Recently uploaded (20)

PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
4 Tier Teamcenter Installation part1.pdf
VnyKumar1
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
PPTX
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
PPTX
ENSA_Module_7.pptx_wide_area_network_concepts
RanaMukherjee24
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Online Cab Booking and Management System.pptx
diptipaneri80
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
4 Tier Teamcenter Installation part1.pdf
VnyKumar1
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
ENSA_Module_7.pptx_wide_area_network_concepts
RanaMukherjee24
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Online Cab Booking and Management System.pptx
diptipaneri80
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Ad

RANDOM FOREST APPLICATION FOR CROP YIELD PREDICTION

  • 1. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 DOI: 10.5121/sipij.2024.15602 23 RANDOM FOREST APPLICATION FOR CROP YIELD PREDICTION Abbas Maazallahi 1 , Sreehari Thota 1 , Naga Prasad Kondaboina 1 , Vineetha Muktineni 1 , Deepthi Annem 1 , Abhi Stephen Rokkam 1 , Mohammad Hossein Amini 2 , Mohammad Amir Salari 1 , Payam Norouzzadeh 1 , Eli Snir 2 , and Bahareh Rahmani 1 1 Saint Louis University, Computer Science, Saint Louis University, Saint Louis, USA 2 Washington University in Saint Louis, Business School, Saint Louis, USA ABSTRACT This study analyzes crop yield prediction in India from 1997 to 2020, focusing on various crops and key environmental factors including crop types and years, cropping seasons, specific details for each state, areas of cultivation, production quantities, annual rainfall, and the usage of fertilizers and pesticides. We applied advanced machine learning techniques like Logistic Regression, Decision Tree, KNN, Naïve Bayes, K-Mean Clustering, and Random Forest to predict agricultural yields. The main goal of this study is offering the best model to predict crop yields. Based on our study, Random Forest demonstrates almost high accuracy. Naïve Bayes shows high precision indicating the high quality of a positive prediction made by this model. In this study, we are discovering the best machine learning models to predict the crop yield. If people know that their yield will be decreased next year, they find a way increase the crop yields. KEYWORDS Crop Yield Prediction, Machine Learning Algorithms, Random Forest, Agricultural Data Analysis, Precision Agriculture 1. INTRODUCTION Machine learning has significantly influenced agricultural practices, particularly in crop yield prediction. In this study, various machine learning techniques have been employed to enhance the accuracy and efficiency of forecasts. These techniques offer valuable insights into the complex nature of agricultural data and the prediction of crop yields. Logistic Regression explores show the probabilistic relationships between variables. The simplicity and interpretability of Logistic Regression make it a popular choice in many fields, including agriculture [1].The Decision Tree emerges as a robust method in classification and regression toolkit. It aids clear decision-making by splitting data into branches based on variable values [2]. K-Nearest Neighbors (KNN) as a non-parametric method, identifies the similarities between new and existing data points, making it suitable for classification and regression problems. Random Forest (RF) is a popular ensemble machine learning algorithm to combine the output of several decision trees to classify and predict the future outcomes. As the field continues to evolve, exploring and implementing these techniques remain critical in addressing the complexities of
  • 2. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 24 crop yield prediction. 2. LITERATURE REVIEW A study featured in Nature's Scientific Reports presents an "Interaction Regression Model for Crop Yield Prediction." This research selects robust features and interactions to predict crop yields, utilizing an elastic net regularization model. This model is instrumental in identifying high-quality features across various environmental and management categories, due to reducing the risk of overfitting and increasing the robustness of predictions across different geographic locations and timeframes [2]. Ziegel's seminal work, "The Elements of Statistical Learning" (2003), remains a foundational text in the field of machine learning and statistical modeling. Published in Techno-metrics, this book provides an extensive overview of various statistical learning techniques, including regression, classification, and ensemble methods. Ziegel’s work is particularly relevant to the domain of crop yield prediction, as it lays the theoretical groundwork for many of the advanced ML algorithms employed in agricultural research today. The comprehensive nature of this text makes it an essential reference for understanding the underlying principles and applications of statistical learning in diverse fields, including agriculture [3]. Another research paper, "Using Machine Learning for Crop Yield Prediction in the Past or the Future," published in Frontiers, offers a unique perspective by simulating sunflower and wheat yields over a twenty-year period from 2000 to 2020. This research emphasizes the significance of continuous nutrient and water balance in the simulation process and explores the impact of changes in cultivars and planting densities on crop yields. The detailed simulation models provide valuable insights into long-term yield prediction and resource management, marking a significant advancement in the field [4]. The study "Analysis of Crop Yield Prediction using Machine Learning Algorithms" in IEEE Xplore reporst the uncertainties of weather and its impact on farming. The paper evaluates the efficacy of machine learning algorithms—K-Nearest Neighbors (KNN), Random Forest, and Linear Regression—using parameters like state, crop, temperature, and rainfall to predict crop yields. The results showcase a remarkable 97% accuracy for KNN, outshining the Random Forest's 75% and Linear Regression's 54%, highlighting the promise of KNN in predictive agriculture and offering a data-driven example for enhancing agricultural productivity [5]. The article "A Machine Learning Approach to Predict Crop Yield and Success Rate" from IEEE Xplore details an innovative study within India's agricultural sector.. Focusing on improving farmers' decision-making by predicting crop yields, this research employs neural network regression modeling with an extensive dataset drawn from government sources. The researchers reported a 45% accuracy using RMSprop optimizer, which was substantially improved to 90% by refining the network architecture and shifting to the Adam optimizer. The model applies a 3- Layer Neural Network with the Rectified Linear Activation Unit (ReLU) function, and leverages both backward and forward propagation techniques to establish a robust model for crop yield prediction [6]. Moreover, "Utilizing Naïve Bayes Algorithm for Crop Yield Prediction" explores the application of Naïve Bayes algorithm in predicting crop yields based on various agricultural parameters containing weather information, soil characteristics, and crop management practices. This study shows that Naïve Bayes in accurately predicting crop yields across different regions and crop varieties, highlighting its potential as a valuable tool for agricultural decision-making [7].
  • 3. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 25 Another study "Enhancing Crop Yield Prediction through Random Forest Algorithm" investigates the use of Random Forest algorithm to improve crop yield prediction’s accuracy. By constructing an ensemble of decision trees and aggregating their predictions, Random Forest leverages the strength of multiple models to capture complex nonlinear relationships between predictor variables and crop yields. This research demonstrates the superior performance of Random Forest over traditional regression models, making it an asset for precision agriculture [8]. Zhang et al. (2023) investigate the use of the Random Forest algorithm to improve crop yield prediction accuracy. By constructing an ensemble of decision trees and aggregating their predictions, Random Forest captures complex nonlinear relationships between predictor variables and crop yields. This research illustrates the superior performance of Random Forest over traditional regression models, making it an asset for precision agriculture and informed decision- making in farming practices [9]. Dhaliwal and Williams (2024) provide an insightful exploration into the prediction of sweet corn yield using machine learning models and field-level data. Their study, published in Precision Agriculture, utilizes a combination of ML algorithms to enhance yield prediction accuracy. By integrating extensive field-level data, including soil properties, weather conditions, and crop management practices, the researchers demonstrate a robust framework for predicting sweet corn yields. Their findings emphasize the importance of high-resolution field data in improving the predictive performance of ML models in agriculture [10]. Rashid et al. (2021) offer a comprehensive review of crop yield prediction using machine learning approaches, with a particular emphasis on palm oil yield prediction. Published in IEEE Access, this review synthesizes existing research and methodologies, providing a detailed analysis of various ML techniques applied to crop yield prediction. The authors discuss the challenges and advantages of different ML models, highlighting how advanced algorithms like deep learning and ensemble methods have been successfully employed to predict yields in complex agricultural systems. This review serves as a valuable resource for researchers and practitioners aiming to leverage ML for enhanced agricultural productivity [11]. Hussain, Sarfraz, and Javed (2021) conducted a systematic review on crop-yield prediction through Unmanned Aerial Vehicles (UAVs) presented at the 16th International Conference on Emerging Technologies (ICET 2021). The study highlights the prevalent use of Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Networks (CNN) in crop yield prediction. The review underscores the significance of these algorithms and their adoption in developing countries, reflecting the growing reliance on UAVs for data collection and analysis in agriculture [12]. Sarvaiya, Chaudhari, and Verma (2022) discussed the challenges of crop yield prediction and crop selection based on climatic sensor data and historical yield data in their book section, "Monitoring Agricultural Essentials," from the "Application of Machine Learning in Agriculture." The authors emphasize the importance of machine learning in addressing major agricultural problems and improving crop yield predictions by leveraging climatic and past data [13]. Van Wart et al. (2015) explored the creation of long-term weather data for crop simulation modeling in their article published in Agricultural and Forest Meteorology. This study highlights the necessity of high-quality daily weather data, such as uncorrected gridded solar radiation, for accurate crop yield simulation and variability prediction. The authors demonstrate how propagating long-term weather data significantly enhances the reliability of crop simulation
  • 4. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 26 models [14]. Mahmood (1998) conducted a comparative study on air temperature variations and rice productivity in Bangladesh, published in Ecological Modelling. The study compares the performance of the YIELD and CERES-rice models, finding that boro rice productivity predictions at Mymensingh are higher using the YIELD model. This research underscores the critical role of accurate temperature data in predicting crop productivity[15]. Venkatesh and Saravanan (2022) investigated the prediction of crop yield using Simple Linear Regression (SLR) and Polynomial Regression (PR) in their study presented at the 3rd International Conference on Smart Electronics and Communication (ICOSEC 2022). Their findings suggest that SLR significantly outperforms PR in predicting crop yields, indicating the effectiveness of simpler models for specific types of agricultural data[16]. In similar works we applied machine learning methods to predict weather patterns[17], and customer churn [18]. These studies underscore the dynamic and evolving nature of crop yield prediction research. They not only highlight the potential of machine learning in agriculture but also set a foundation for future studies. Our research aims to build upon these methodologies, introducing novel approaches to further enhance the precision and applicability of crop yield predictions. 3. DATA DESCRIPTION The dataset used in this study, available at Kaggle (https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset), includes extensive agricultural data from India from 1997 to 2020. It covers a wide range of crops grown across different Indian states. Data includes crop types and years, cropping seasons, specific details for each state, areas of cultivation, production quantities, annual rainfall, and the usage of fertilizers and pesticides. The data features are: Crop: This field identifies the crop type. The dataset includes a diverse array of 55 crops, reflecting India's 55 rich agricultural variety including rice, maize, onion, potato, coconut, and banana. Crop Year: The dataset covers crop years from 1997 to 2020, providing a comprehensive temporal view of agricultural trends over 24 years. Season: The data categorizes cultivation of 4 distinct seasons, including major seasons Autumn and Spring to analyzethe seasonal impacts on agriculture. State: Includes data from 30 Indian states which offers a wide geographical perspective, to find the regional agricultural patterns. Area: Represents the land area under cultivation in hectares. The mean of area is approximately 179,926 hectares, ranging from a minimal 0.5 hectares to a vast 50.8 million hectares toindicate the varied scale of farming practices across regions. Production: The quantity of crop production, measured in metric tons, shows an average of around 16.4 million tons. It varies greatly, with a maximum recorded production of about 6.3 billion tons. Annual Rainfall: This feature, measured in millimeters, indicates the climatic conditions affecting crop growth. The average annual rainfall is about 1,438 mm, ranging from 301.3 mm to
  • 5. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 27 a significant 6k mm. Fertilizer: The total amount of fertilizer used, in kilograms, with an average of around 24.1 million kg. It shows a diverse nutrient management strategy across different crops and regions. Pesticide: This field presents the total pesticide usage in kilograms. On average, around 48,848 kg of pesticides are used, with the maximum 15.75 million kg. Yield: This attribute indicates production per unit area with an average of approximately 79.95 and an extremely varied range, peaking at 21,105. This metric is evaluating the efficiency of agricultural practices. 4. STATISTICAL ANALYSIS Figure 1. Scatter Plot of Key Features with Crop Categories These statistical insights provide a more understanding of the dataset, highlighting the complexity and diversity of agricultural practices in India. The scatter plot shows the relationships between various key features of the agricultural dataset differentiated by color. Each subplot in the matrix compares two different features,area vs production, annual rainfall vs fertilizer, and so on (Figure 1).From the scatter plot, we can observe the following: Area vs Production: There is a positive correlation between the area of cultivation and the production formost crops, which is expected as larger cultivation areas generally lead to higher production volumes. Annual Rainfall vs Production: The relationship between annual rainfall and production varies among crops, suggesting that some crops may be more sensitive to rainfall than others. Fertilizer vs Production: There seems to be a positive correlation for some crops, indicating that increased fertilizer usage may cause higher production. However, this relationship does not hold uniformly across all crop types. Pesticide vs Production: Pesticide usage does not show a clear correlation with production in this visualization which means the effectiveness or necessity of pesticides may vary depending on the
  • 6. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 28 crop. Yield: The yield scatter plots across different features show varied patterns for different crops, indicating that yield is influenced by a complex interplay of factors, not just a single feature. Figure 2. Aggregated Pesticide Usage by Crop and Year Each crop type, represented by a unique color, exhibits its own pattern of distribution and correlation across the different features, which can inform targeted agricultural practices and policies. The data points for crops like coconut are notably distinct due to high-volume output, which skews the distribution. The bar chart would offer a comprehensive view of the trends in pesticide use across different crops over the years (Figure 2). It shows how pesticide usage has varied over time for each crop type. This chart is an instrument to identify patterns and potential correlations between pesticide use and other factors like crop yield, cultivation practices, or environmental changes. It would serve as a critical tool for understanding the dynamics of pesticide management in agriculture, aiding in developing more sustainable and efficient farming practices. 4.1. Features Distribution The feature distribution plots for area, production, annual rainfall, fertilizer, and pesticide from the agricultural dataset provide a visual summary of the underlying data characteristics and variability (Figure 3).The histograms reveal the frequency distribution of values for each feature. The Area histogram shows a concentration of values in smaller land areas, suggesting that most of the crop cultivation occurs in relatively smaller lands. The Production histogram is rightly skewed with a few instances of very high yields, indicative of a small number of highly productive operations. Annual Rainfall appears more consistently distributed, suggesting a level of predictability in this environmental factor. Fertilizer and Pesticide usage are both right skewed, indicating that lower usage rates are more common across the dataset.
  • 7. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 29 Figure 3. Histogram, Density and Box plot of selected features Density plots provide a smoothed representation of the data distribution, revealing the probability density of the different values. These plots show the likelihood of specific values occurring within the dataset and highlight the central tendencies and the spread of data more clearly than histograms. For area, production, fertilizer, and pesticide, the peaks of the density plots suggest the most common values and verify the skewness seen in the histograms. Box plots offer a summary of the data’s statistical distribution, including the median, quartiles, and outliers. The box plots for area and annual fertilizer and pesticide do not show significant outliers, which indicates a more homogeneous distribution within the interquartile range whereproduction, fertilizer, and pesticide box plots display several upper-end outliers. These outliers represent values that are exceptionally higher than the typical range of the data and may correspond to instances of intensive farming practices or atypical environmental conditions. 5. NORMALIZATION AND LABELING OF CROP YIELD DATA (TARGET VARIABLE) Normalization per crop is a essential preprocessing step in agricultural data analysis. This process involves scaling the yield data for each crop type within a specified range (commonly 0 to 1) to ensure a uniform scale across various crops. The primary reasons for this normalization include: Comparability: Different crops may have inherently different yield scales due to varying biological and cultivation factors. Normalization allows for a fair comparison of yields across diverse crop types on a common scale. Outlier Mitigation: Some crops might have extreme yield values (either high or low) that can skew the overall analysis. Normalization helps in mitigating the impact of such outliers. Uniformity in Analysis: It ensures that the yield data across all crops are treated uniformly, making the subsequent analysis more robust and less biased towards crops with larger or smaller yield values.
  • 8. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 30 5.1. Purpose of Labeling into Four Classes Labeling the normalized yield data into four distinct classes is a method of discretization that simplifies complex continuous data into categorical segments. This is beneficial for several reasons: Simplification of Data: It simplifies the continuous range of yield values into distinct categories, making it easier to analyze and understand patterns within the data. Facilitates Classification Analysis: By converting yields into classes, the data is prepared for classification algorithms in machine learning, predictive modeling or trend analysis. Enhanced Interpretability: Labeling yields into categories like 'Low', 'Medium', 'High', and 'Very High' provides a more intuitive understanding of the yield performance for each crop. Labeling is often done using quartiles, dividing the data into four equal parts based on their distribution. This method ensures that each class has an equal number of data points, providing a balanced categorization of the yield data. Figure 4. Yield Distribution per Crop, Before and After Normalization Figure 4 illustrates the impact of normalization on yield data across various crops. On the left, we see the yield distribution before normalization, where each crop's yield values span a wide and disparate range, making it difficult to compare between crops. Outliers and variances are prominent, and the scales are imbalanced, with some crops showing yields reaching 20,000 units. On the right, after normalization, all yields are scaled between 0 and 1. This transformation standardizes the data, bringing all crops onto an even playing field and highlighting the relative distribution within each crop type without being overshadowed by the absolute yield values. This normalized view allows for more straightforward comparisons across different crops and a clearer interpretation of yield performance relative to each crop's potential.
  • 9. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 31 Figure 5. Normalized Yield Distribution per Crop Figure 5 presents a boxplot illustrating the normalized yield distribution for a variety of crops, with yield data segmented into two distinct classes: Low and High. Each crop type is represented by a series of boxplots along the horizontal axis, with the normalized yield values plotted on the vertical axis ranging from 0 to 1. The Low-yield class is depicted in blue, and the High-yield class in red, allowing for a clear visual distinction between the two categories. For each class, the box plots show the median yield value (the line within the box), the interquartile range (the box itself), and potential outliers (the individual points beyond the whiskers). This graph effectively communicates the variability in yield within each crop type, as well as between the two yield classes, providing insights into the distribution patterns of agricultural productivity across different crops. 6. METHODOLOGY This section evaluates and compares the performance of various machine learning classifiers on a crop yield. The dataset, preprocessed with feature normalization, includes key agricultural indicators such as area, production, annual rainfall, fertilizer, and pesticide usage, all normalized to ensure uniformity and comparability across different scales. The target variable, 'Yield_Class_Int', represents yield categories encoded as integers, facilitating a multi-class classification approach. The selected classifiers include a diverse array of algorithms: Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes, and Gradient Boosting. These methods cover a spectrum from simple linear models to more complex ensemble methods, each with its strengths in handling different types of data distributions and relationships. The dataset is split into training and testing sets, with 80% of the data used for training and the remaining 20% for testing to ensure a robust evaluation framework. Figure 6 shows one tree of Random Forest, K-Nearest Neighbor (K=3), and Naïve Bayes Classifiers.
  • 10. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 32 Figure 6. left to right: One tree of Random Forest, K-Nearest Neighbor (K=3), Naïve Bayes Classifiers Each model is trained on the training set and then evaluated on the test set. Performance metrics such as accuracy, precision, recall, f1-score, and the confusion matrix are computed for each model. These metrics provide a multi-dimensional view of the models' performance, with accuracy indicating the overall correctness, precision and recall offering insights into the models' ability to identify each class correctly, and the f1-score presenting a balance between precision and recall. The confusion matrix further elucidates the specific areas of strength and weakness for each classifier, by showing the distribution of predictions across actual classes. This rigorous assessment allows for a detailed comparison of the models, highlighting their efficacy and fitness for the crop yield classification task. 7. RESULTS Figure 7 compares various machine learning models used for classification tasks. The metrics includes Accuracy, Precision, Recall, and F1 Score. Accuracy reflects the overall rate of correctly predicted the class labels. Precision indicates the proportion of true positives among all positive predictions. Precision is a key measure when the cost of a false positive is high. Recall measures the proportion of actual positives that were identified correctly, which is particularly important when missing a positive is costly. The F1 Score is the harmonic means of precision and recall, providing a single metric that balances both the false positives and false negatives. model with high precision but lower recall might be conservative in its positive predictions but miss out on several actual positives. In contrast, a model with high recall but lower precision might capture most of the positives but at the cost of increased false positives. The F1 Score
  • 11. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 33 helps to balance these aspects and is often a crucial metric when choosing the best deployment model. In confusion matrix shown in figure 8, the horizontal axis represents the predicted classifications, while the vertical axis represents the actual classifications, each divided into 'Positive' and 'Negative' categories. The top left quadrant represents true positives (TP), where the model correctly predicts the positive classes. The bottom right quadrant represents true negatives (TN), where the model correctly predicts the negative class. The top right quadrant shows false negatives (FN). In these instances, the model incorrectly predicts the negative class, and the bottom left quadrant shows false positives (FP), where the model incorrectly predicts the positive class. Figure 7.Performance Metrics of Different Methods Figure 8. Confusion Matrix
  • 12. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 34 The intensity of the colors corresponds to the number of observations in each category, with darker colors typically representing higher numbers. This visualization helps in quickly assessing the model's performance, particularly in terms of its ability to distinguish between the classes. For example, if the TP and TN quadrants are much darker than the FN and FP quadrants, this indicates a high level of accuracy like what we have in figure 8. Our explorations into machine learning models for agricultural yield prediction have yielded significant insights. Based on Figure 7 and Table 1 the Random Forest model, tailored to our specific dataset, has demonstrated almost high accuracy, reaching a 73% success rate in predicting yield when considering crucial features such as area and production. This high level of precision underscores the model's capability to handle the discrete nature of our data effectively. Table 1: Performance Metrics of Different Models Based on Figure 7 Accuracy Precision Recall F1 Score Logistic Regression 0.52 0.58 0.55 0.5 Decision Tree 0.62 0.62 0.62 0.62 Random Forest 0.73 0.73 0.73 0.73 SVM 0.51 0.58 0.51 0.48 KNN 0.52 0.55 0.52 0.52 Naive Bayes 0.51 0.64 0.51 0.39 Gradient Boosting 0.61 0.6 0.61 0.6 8. DISCUSSION Our project's findings indicate that Random Forest models surpass in accuracy for discrete data sets, a characteristic that is particularly relevant to our agricultural domain. With its ensemble approach, the Random Forest model has complemented the probabilistic predictions of Naïve Bayes, which is offering a robust alternative for yield classification. Throughout this project, we have not only applied various machine learning techniques but also honed our ability to discern the most appropriate methods for our dataset. The process has enhanced our analytical skills, enabling us to create informative visualizations that succinctly convey the efficacy of different machine learning strategies. In the future work we will apply boosting methods to increase the accuracy of the predictor. Compliance with Ethical Standards Conflict Interest Statement There is no conflict of interest declared by authors. All authors have reviewed and agreed with the manuscript. We state that the submission is an original paper and is not under review at any other journal. Research’s human participants and/or animals There are no humans or animals participating in this project.
  • 13. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 35 Consent to Participate Authors consent to participate in this project, and we know that: the research may not have direct benefit to us. Our participation is entirely volunteer. There is a right to withdraw from the project at any time without any consequences. Data Availability The dataset used for this project is collected from Kaggle. https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset Funding No Funding has been applied for this project. Ethical Approval All subjects gave their informed consent for inclusion before they participated in the study. Consent to Publish We give our consent for the publication of exclusive details, that could be included figures and tables and details within the manuscript to be published in Computational Brain & Behavior. REFERENCES [1] V. A. Barbur, D. C. Montgomery, and E. A. Peck, “Introduction to Linear Regression Analysis.,” The Statistician, vol. 43, no. 2, 1994, doi: 10.2307/2348362. [2] J. Ansarifar, L. Wang, and S. V. Archontoulis, “An interaction regression model for crop yield prediction,” Sci Rep, vol. 11, no. 1, 2021, doi: 10.1038/s41598-021-97221-7. [3] E. R. Ziegel, “The Elements of Statistical Learning,” Technimetrics, vol. 45, no. 3, 2003, doi: 10.1198/tech.2003.s770. [4] A. Morales and F. J. Villalobos, “Using machine learning for crop yield prediction in the past or the future,” Front Plant Sci, vol. 14, 2023, doi: 10.3389/fpls.2023.1128388. [5] V. Krishna, T. Reddy, S. Harsha, K. Ramar, S. Hariharan, and A. Bhanuprasad, “Analysis of Crop Yield Prediction using Machine Learning algorithms,” in Proceedings - 2022 2nd International Conference on Innovative Sustainable Computational Technologies, CISCT 2022, 2022. doi: 10.1109/CISCT55310.2022.10046581. [6] S. S. Kale and P. S. Patil, “A Machine Learning Approach to Predict Crop Yield and Success Rate,” in 2019 IEEE Pune Section International Conference, PuneCon 2019, 2019. doi: 10.1109/PuneCon46936.2019.9105741. [7] M. Gupta, B. V. Santhosh Krishna, B. Kavyashree, H. R. Narapureddy, N. Surapaneni, and K. Varma, “Various Crop Yield Prediction Techniques Using Machine Learning Algorithms,” in Proceedings of the 2nd International Conference on Artificial Intelligence and Smart Energy, ICAIS 2022, 2022. doi: 10.1109/ICAIS53314.2022.9742903. [8] M. Manafifard, “A new hyperparameter to random forest: application of remote sensing in yield prediction,” Earth Sci Inform, vol. 17, no. 1, 2024, doi: 10.1007/s12145-023-01156-8. [9] Q. Zhang et al., “Maize yield prediction using federated random forest,” Compute Electron Agric, vol. 210, 2023, doi: 10.1016/j.compag.2023.107930. [10] D. S. Dhaliwal and M. M. Williams, “Sweet corn yield prediction using machine learning models and field-level data,” Precis Agric, vol. 25, no. 1, 2024, doi: 10.1007/s11119-023-10057-1. [11] M. Rashid, B. S. Bari, Y. Yusup, M. A. Kamaruddin, and N. Khan, “A Comprehensive Review of Crop Yield Prediction Using Machine Learning Approaches with Special Emphasis on Palm Oil
  • 14. Signal & Image Processing: An International Journal (SIPIJ) Vol.16, No.6, December 2024 36 Yield Prediction,” 2021. doi: 10.1109/ACCESS.2021.3075159. [12] N. Hussain, S. Sarfraz, and S. Javed, “A Systematic Review on Crop-Yield Prediction through Unmanned Aerial Vehicles,” in ICET 2021 - 16th International Conference on Emerging Technologies 2021, Proceedings, 2021. doi: 10.1109/ICET54505.2021.9689838. [13] J. P. Sarvaiya, A. P. Chaudhari, and J. P. Verma, “Monitoring agricultural essentials,” in Application of Machine Learning in Agriculture, 2022. doi: 10.1016/B978-0-323-90550-3.00004-7. [14] J. Van Wart, P. Grassini, H. Yang, L. Claessens, A. Jarvis, and K. G. Cassman, “Creating long-term weather data from thin air for crop simulation modeling,” Agric forMeteorology, vol. 209–210, 2015, doi: 10.1016/j.agrformet.2015.02.020. [15] R. Mahmood, “Air temperature variations and rice productivity in Bangladesh: A comparative study of the performance of the YEILD and the CERES-rice models,” Ecole Modell, vol. 106, no. 2–3, 1998, doi: 10.1016/S0304-3800(97)00192-0. [16] N. G. E. Hackland and R. I. Jones, “Voo spelling van seasonableproduce van ergotiscurvelet,” Proceedings of the Annual Congresses of the Grassland Society of Southern Africa, vol. 15, no. 1, 1980, doi: 10.1080/00725560.1980.9648891. [17] K Miller, G Yi, E Snir, B Rahmani, Precipitation analysis and forecasting weather of Texas, United States, International Journal of Information Technology, Springer Nature, 15 (2), 549-556. [18] Y. M. Meda, B. M. S. Bokka, H. Jamallamudi, P. Norouzzadeh, E. Snir, B. Rahmani, Customer Churn Prediction with Machine Learning, A. Maazallahi, the proceedings of International Business Analytics Conference (IBAC).