Machine Learning Predictive Modeling of Agricultural Sustainability Indicators

. Modern-day researchers are provided with data abundance that has its drawback: increased analysis complexity. Approaching this issue through traditional data analysis techniques provides only partial solutions to the complex situation. This research offers analytical and predictive models based on machine‐learning algorithms (linear regression, random forest, and generalized additive model) that can be used to assess and improve the Common Agricultural Policy (CAP) impact over agricultural sustainability in European Union (EU) countries, providing the identification of proper instruments that can be adopted by EU policymakers and CAP Council in financial management of the policy. The chosen methodology elaborates custom‐developed models based on a dataset containing 22 relevant indicators, considering three main dimensions contributing to the EU sustainable agriculture development goals in the CAP context: social, environment, and economic. The results showed that sustainable agriculture parameters influenced by the relevant indicators could be modeled with both linear and non-linear regression approaches by utilization of real-time data using machine learning. The predictive analytic models provide satisfactory performance and could be adopted by researchers and practitioners as policy impact monitoring and controlling tools, not only the EU but also for other countries that have or plan to adopt similar agricultural policies.


INTRODUCTION
The United Nations Sustainable Development Goals (UN SDGs) optimistically commit to ensure year-round safe, nutritious, and sufficient food access (SDG target 2.1) and to eradicate all kinds of malnutrition (SDG target 2.2) for all people by 2030.Yet, according to the Food and Agriculture Organization (UN FAO), the world has not progressed an inch closer to it [1].The 2020 State of Food Security and Nutrition in the World (SOFI) reported that the COVID-19 pandemic continues exposing the world's food system weaknesses, which threaten food security status of many [2], as it will result in devastating impacts on health, productivity, and economy.Sustainable agriculture practices are needed to end world hunger and achieve food security, but recent studies showed that the current agriculture practices are far from sustainable [3].Submitted: 03 August 2022; Accepted: 18 January 2024 Sustainability has three dimensions: social, environment, and economic [4].Being the sector with strong influence on those three dimensions, agriculture is considered one of the engines of sustainable development.For over six decades, in partnership between agriculture and society, the Common Agricultural Policy (CAP) has shaped the European countries' agricultural sector [5].
The CAP is structured into two pillars aiding the delineation of policy's elements which focused on direct payments and market measure to support agriculture produductivity; and promote rural areas' development, competitiveness, and natural resources sustainable management [46].Since agriculture is an essential sector with significant influence on rural areas, the implementation of CAP and its positive impact on rural employment and economic growth [6] demonstrates that it is indeed shaping rural development despite the critics.Research interest in agricultural policy reforms has increased, especially in investigating the CAP and its effects on different output performances.However, the scientific debates around CAP were fragmented due to the different approaches and perspectives evolved on the topic [5].As to help policy-makers evaluate indicators being used in the current policy, this research aims to enrich the body of knowledge of the common monitoring and evaluation framework (CMEF).The CMEF is updated annually [54] as it provides policy-makers with set of evidence and guideline for decision-making to improve CAP's utility and efficiency, transparency, effectiveness, and learning.
Machine learning (ML) has been implemented in several agricultural research to model and predict specific scenarios.It is considered a useful approach in providing a better understanding of the complexity that lies behind the input, process, and output of studied phenomenon; especially when working with real-time data which are non-linear in nature.Statistical modeling approaches based on ML algorithms can be used as an alternative to traditional methods and overcome their limitations [6].While several previous studies have shown that using non-linear approaches has given better predictive performance than linear modeling [6]- [9], real-time changes causing alteration of data behavior need to be handled and interpreted with caution.Hence, to make sure that the available equations are reproducible, it needs further tuning.This research aims to propose predictive analytic models of the CAP's impact on agricultural production support (Pillar I) and rural development (Pillar II) using non-linear (random forest and generalized additive model) and linear (multiple linear regression) algorithms.It also gives insight into which variables are best to represent the predictive models better to assess the effects of CAP at hand, and to offer future researchers and policymakers valuable information based on digital technologies necessary to develop a similar agricultural supporting program for non-European countries with lower AOI level [10], which considered as far behind in achieving the agricultural-related goals of sustainable development.

Data Exploration
To investigate the CAP's impact on agricultural sector and to identify proper indicators accounting for the instruments that can be used by policymakers, especially in the CAP financial management, this research developed ML models based on variables described in Table 1.The data from 2011-2019 financial period being used were collected from official databases such as the Eurostat, European Commission, and FAO.For the relevance and comparability of present Submitted: 03 August 2022; Accepted: 18 January 2024 research's primary objective of analyzing the CAP: 2014-20 and with respect to data availability of all variables observed, this time range is deemed appropriate.
The agricultural sector's economic situation is related to sustainability: it should be economically viable and not degrade the environment [11].In [12], the authors focused on the agriculture's sustainability dimensions and used a series of variables such as employment rate and income, workforce stability, agriculture inputs, insured area, agricultural risks, economic share to GDP, and economic dependence of agricultural activity.Other study [7] with different perspectives in elaborating agriculture economic development used agriculture share of GDP, GVA, farming subsidies, and agriculture production as the variables.Therefore, to assess the sector's rural environment and economic importance, financial variables such as consumer price index (CPI), GDP, GDP in rural areas (GDP_Rural), GVA, agricultural GVA (GVA_Agri), and GVA registered for rural areas (GVA_Rural) were included; as well as productivity variables consisted of crop yield, fertilizer usage, labor productivity (LPA), total factor productivity (TFP), ammonia (NH3) and carbon dioxide (CO2) emissions from farming.
The European countries had high AOI levels, reflecting their government's high orientation toward the sector through a share of spending relative to the agricultural GVA [10].This helped increasing productivity and growth by reducing budget constraints and increasing capital [13].One of the spending schemes is through research and development (R&D).To narrow the gap of possible future food insecurity caused by projected high-calorie demand in 2050 is the adaptation of advanced technology by agricultural producers [14].Hence, agriculture R&D (ARD) is included as indicator.The CAP's social goal is to tackle unemployment and poverty rate in rural areas since it was reported higher than in urban areas [15].Recent evidence showed that CAP successfully increased GDP per capita, rural employment rate, and decreasing rural poverty rate [6].In this sense, these variables were considered to channel the social impact of CAP: direct payment (DP), agricultural entrepreneurial income (AEI), agricultural factor income (AFI), degree of rural poverty (DRP), rural employment rate (RER), and agriculture employment (AgriEmployment).

Data Preprocessing
This research used an 80:20 ratio for data splitting.The training data were used for fitting the model or algorithm training.The testing data were used to validate the fitted models or the trained algorithm for prediction purposes [16].The 10-Fold approach was chosen as it delivers good model assessment in ML [17] (Figure 1).Three model fit parameters were assessed: root mean square error (RMSE), mean absolute percentage value (MAPE), and mean absolute error (MAE).

Linear Modeling
Regression analysis is used to model single dependent variable and one (simple regression) or more (multiple regression) independent variables [17].In multiple linear regression (MLR), the model being fitted is linear (Equation 1) and it was computed using RStudio with package: lm and stepAIC.Feature selection step using stepwise regression (stepAIC) technique is used to identify the optimal linear models from variables being studied [16].To determine the best-fitted model, regression related metrics were analyzed: R-sq and adjusted R-sq.

Non-Linear Modeling
Many scholars have adopted non-linear ML method, which are more flexible and useful where econometric models fail to deliver relevant interpretability [18].Developing ML models require the selection of optimal 'features' (variables/predictors) used to obtain better learning performance and more cost-effective data generation process [19].Feature selection was performed for reducing model overfit, improving model accuracy, and providing predictors with interpretability results.This research used random forest (RF) and generalized additive model (GAM), computed using the RStudio with libraries: randomForest, gam, and mgcv.RF is an extension of decision tree where it will collect each tree classification and choose the most voted one as a result [16], [20].RF has several hyperparameters that must be set in advance with parameter tuning [21] for a better learning process [22].The result of parameter tuning was used in the algorithm for data training and testing in each RF model.GAM modeling has been applied in agricultural studies such as ecological, land allocation, climate-related issues [23], [24], and for predictive analytics of crop yield [25] or pesticide use [26].The advantage of GAM stems from its interpretability which similar to MLR, presented as a sum of functions obtained from fitting the additive model, replacing the beta coefficient from linear regression with flexible functions that allow for non-linear modeling relationships [17].

Gross Value Added (GVA).
The analysis of GVA revealed notable disparities between EU countries (Figure 3b).GVA registered from the agricultural (consists of agriculture, forestry, fishing, and hunting) sector influence on the national economy, which presented by GDP, is considered moderate (Figure 3c).This emphasizes that most EU countries were not entirely dependent on agricultural activities.The result showed similar with previous study [7] evaluating rural economic accounts with financial data from 2007 to 2013.Hence, overall regional growth rate was stagnant.The CAP: 2014-20 reform has had a positive impact on rural GVA (Figure 3a), observed in all studied EU countries.
Agricultural Land and Productivity.In terms of average cereal crop yield, the best results were recorded in Belgium, followed by Finland and Austria (Figure 4a).This projects similarity with previous study [7], where Central and Western EU produced high crop yields, even though not all countries in the region live up to the same claim.For example, in the Western EU, France projects a low yield while it had higher land size and production quantity (Figure 3e).For this case, other crops might contribute more to the total production.Thus, crop yield as a single measure of productivity could lead to a misleading indication of the degree of agricultural productivity [27].The International Food Policy Research Institute (IFPRI) stated that the TFP index is considered the most suitable indicator to measure the agricultural system.From the data, the Eastern EU region had a higher TFP index (Figure 3e), even with lesser crop yields.This might be explained by lower production factor prices in the Eastern EU, as it had smaller agricultural land to manage (Figure 3f).
Direct Payment.Direct payment (DP) is considered the main instrument in the CAP aimed at supporting farmers' income and contributing to rural vitality.Many farms depend heavily on this subsidy as farming is a risky and costly business [28].From the data, countries with high DP resulted in higher agricultural entrepreneurship income (AEI) (Figure 3d).It is also consistent with the size of agricultural land and the production quantity of all crops (Figure 3f).Similar to the previous study [7], Western EU received the largest amount of DP and had higher AEI.
Labor Productivity, Income, Fertilizer Use, and Emissions.Agricultural labor productivity (LPA) plays a fundamental role in economic development processes since higher LPA extends agricultural labor from food production to the production of other forms of goods and services [29].High labor and land productivity are related to fertilizer usage and high income, respectively.High productivity was observed in Eastern EU countries, consistent with their high TFP (Figure 3e, 3h) and fertilizer usage (Figure 3g).However, the agricultural factor income (AFI) was lower.A possible explanation is related to smaller agricultural land that is linked to lesser DP subsidies granted.The fertilizer usage was synchronized with emissions from production activities (Figure 3g), except Bulgaria.Its emissions were lower compared to their fertilizer usage.This might indicate a better sustainable use of resources.This finding further emphasizes the country's high TFP index despite receiving a lesser amount of DP subsidies.
Consumer Price Index, Rural Poverty, Employment Rate, and R&D.Both degree of rural poverty (DRP) and consumer price evolution index (CPI) (Figure 3i), or in this context: food products, were lower in countries with lower AFI (Figure 3h) and higher DP subsidies (Figure 3d), such as Germany, France, and Denmark.While in Eastern EU countries, both DRP and CPI were higher.This reveals that the rural economy is strongly related to agricultural production activities as high food prices may affect the affordability of consumption, labor demand, and income.Noticeably in Figure 3j, the Western EU countries had higher government budget allocated for agriculture expenditure and R&D spending than others.The findings correlated with previous study [7], concluding that the trend has not changed even during CAP: 2014-20.To improve this situation, R&D to execute innovation is vital, and the government was proven to be the enabler for R&D outcomes [30].Submitted: 03 August 2022; Accepted: 18

The Social Dimension
The degree of rural poverty (DRP) is an important aspect to be understood as its reduction could bring benefits not only for the rural population but for the country.The model's linear regression (Equation 2) shows very good accuracy metrics, having R-sq at 87.2% and adjusted R-sq at 85.9%.The result indicates that DRP decreases with the increase in female employment rate, employment in agriculture, CPI, and GDP per capita registered for rural areas.Previous study [31] identified that the agricultural GVA and RER are important factors in alleviating rural poverty.However, considering Equation (2) below, the DP and LPA negatively influence the DRP.This might be due to the high employment rate that can decrease the sector's average income as well as its labor productivity [7].
Ln(DRP) = 0,442 -0,008Ln(AgriEmployment) -1,384Ln(CPI) -0,0007Ln(GDP_Rural) + 0,002Ln(DP) + 0,045Ln(LPA) + 0,025Ln(RER_M) -0,031Ln(RER_F) (2) GDP per capita (GDP) is a prosperity measure of a nation.Hence, it is an important parameter to be studied.The model has very good accuracy metrics: R-sq 85,5% and adjusted Rsq 84,8%.Considering Equation (3), the reduction in the DRP can improve the GDP per capita.This indicator is also affected by the consumer price evolution of food products or the consumer price index (CPI) [7].While the rural population's income remains constant and the price increase, people's consumption automatically decreases.This will lead to a direct negative impact on the GDP.As presented on Figure 3c which compares the contribution of GVA_Agri to GDP, most EU countries were not dependent on their agricultural activities.In terms of employment rate and GDP per capita, other authors [32] reported a strong relation between the two.The 'GVA_Rural' parameter was modeled with random forest (RF).Through feature selection analysis, it was concluded that the following five parameters were the most important (based on their weights): DRP, GDP, Fertilizer, GDP_Rural, AFI (Table 2).The 'GDP' and 'Fertilizer' displays higher weights, indicating agriculture played a vital role in the development of agriculture activities in rural areas.'AFI' is also observed to have a significant influence on rural GVA, enforcing the hypothesis for the European rural development.AFI is best suited for evaluating the impact of changes in the level of public support on the capacity of farmers to reimburse capital, pay for wages, and reward their production (Bank 2008).The CAP, in World Trade Organization definition, is a multifunctional policy-it gives numerous benefits for a country.In agricultural context, it includes food security, environmental protection, and rural employment [33].As illustrated in Figure 3i, the RER for women is noticeably lower than the male one.Additionally, it is easy to spot countries with the highest RER, such as France and Germany, with also higher DP subsidies.Through RF, this research presents three rural employment rate (RER, RER_M, RER_F) parameters with the following characteristics described in Table 2.According to [34], large-scale commercial farms show little impact on the reduction of rural poverty compared to areas dominated by small farms, which provide job opportunities for the locals.Thus, in the CAP: 2013-27 reform [35], targeting a fairer distribution of DP was the right move.DP being one of the important variables is shown in the RER model (Table 2).

The Environment Dimension
Fertilizers are being used by the farmers to help crops to produce more yield.But still, the excessive use of fertilizer could result in environmental degradation [36].'Fertilizer' was modeled using MLR.The model on Equation 4below displayed a satisfactory R-sq of 98,5% and an adjusted R-sq of 97,9%.
Ln(Fertilizer) = -0,00006 -0,425Ln(GDP) + 0,143Ln(GVA) + 0,001Ln(LPA) + 6,330Ln(NH3E) + 0,004Ln(TFP) (4) Theoretically, higher TFP values are related to lower inputs (e.g., fertilizer) usage due to efficiency improvements [37].Although the model's beta coefficient is low, it is worth an investigation.Taken from a study [38] that used the US Department of Agriculture data, the authors stated that measuring TFP growth is challenging, and knowledge gaps still exist in understanding indicators contributed to the TFP index measurement, as there are inputs other than fertilizer usage.This research emphasizes that the agricultural sector's ammonia (NH3) emission is directly affected by agricultural investment, such as DP.Additionally, fertilizer usage, labor productivity, GDP, and GVA positively influence the rate of NH3 emission.This could be explained by when the DP were high, farmers purchased more agriculture inputs which resulted in higher productivity.However, this model has peculiarities.The DRP and employment in agriculture are observed to have a negative influence on NH3 emission.This implies that an increase in agricultural employment rate will decrease emissions.A possible explanation might be when there are more people employed in the agricultural sector, the use of fertilizers will decrease as people will depend more on physical labor in farming.A similar finding was also found in [39], where farmers used less fertilizers in a situation where the rural employment rate is high.Equation ( 5) delivers excellent accuracy metrics with R-sq at 99,7% and adjusted R-sq at 99,6%.The environmental dimension includes carbon dioxide (CO2) emissions, as several studies showed that agricultural activities are responsible for an overall increase in carbon emissions [40], [41], much more than total NH3 emitted from the sector (Figure 6a).A study [40] stressed the need for a reliable quantification to assess necessary resource allocation for climate research and land-use management in the agricultural context.

The Economic Dimension
The present research modeled GVA in agriculture (GVA_Agri) with GAM method as MLR gave low accuracy level due to non-normal data distribution.The model of 'GVA_Agri' gave a satisfactory model fit with an adjusted R-sq of 98,6%.As the model shows, the GVA_Agri parameter is significantly influenced by: DP, LPA, NH3 emission, crop yield, and employment in agriculture (Equation 7).
GVA_Agri = s(DP) + s(LPA) + s(NH3E) + s(Yield) + s(AgriEmployment) (7) In [42], the authors conducted the study across EU-27 countries using the k-means clustering method resulted in the agricultural production intensities after extra investments given.The overall 'GVA' parameter can be expressed through MLR with satisfactory model accuracy of R-sq at 98,9% and adjusted R-sq at 98,8%.According to the model, an increase in GDP, DP, employment in agriculture, and the sector's GVA would lead to the overall 'GVA' increase.An increase in government expenditure on R&D will decrease the GVA.A possible explanation for this finding is the unclear cut of R&D contribution to agricultural productivity.It is reported [43] that while there is a positive relationship between knowledge created from R&D investment and productivity growth, there is still a challenge to exploit the knowledge with capability to capture the utmost benefits of the investment.Therefore, a high level of investment in R&D does not guarantee a positive return in GVA.
Ln(GVA) = 0,016 -0,0007Ln(CO2E) -0,212Ln(ARD) + 0,009Ln(GDP) + 4,097Ln(DP) + 0,113Ln(GVA_Agri) + 0,531Ln(AgriEmployment) (8) From Equation (8) above, it is observed that CO2 emission negatively impacts GVA.Similar finding can be seen in [44], where it reported higher GVA in agriculture negatively impacted CO2 emission.In contrast, industrial and service sector positively impacting CO2 emission.The crop yield measures agricultural production harvested per unit of land area (kg/ha).It helps optimize food security programs as part of sustainable development goals.Modeling crop yield is necessary as it offers the policymakers with timely information for rapid decision-making during the growing season in relation to trade and policy changes related to food security.Previous authors [30] modeled crop yield through RF, and the current research adopted the same method.The 'Yield' predictive analytic model consists of the following important indicators based on its weight: TFP, GDP_Rural, AEI, LPA, AgriEmployment (Table 2).Intensive Submitted: 03 August 2022; Accepted: 18 January 2024 agriculture activity associated with a high value of crop yield will generate higher LPA due to the high automatization degree of production processes included [7].This implies that higher TFP index results in higher yield as it measures productivity, a critical condition for the sector's development and the overall economy.Thus, 'TFP' was modeled with RF, resulting in ARD, Yield, CPI, AEI, CO2 as important indicators to study (Table 2).Similar findings from [9] support this result where agriculture R&D clearly influences the TFP index.The unavailability of appropriate machinery and farmers' dependency on traditional methods were found as the main barrier to production growth [14].
The CAP supports the agriculture sector development through a combination of DP schemes to farmers, financial aid in rural development, and environmental protection measures.DP is meant to support European farmers and receive even more attention during the CAP: 2014-20 to accelerate the transition towards green agriculture [45], [46].Other authors [47] examined the role of government intervention for the financial efficiency of the organic farms, resulting in DP indeed playing a vital role in the financial viability of organic farms.
Ln(DP) = 0,398 + 0,086Ln(GVA) + 0,0005Ln(CO2E) -0,152Ln(AgriEmployment) -0,007Ln(AFI) + 0,185Ln(RER) -0,001Ln(GDP) (9) Based on the previously described dataset, the 'DP' parameter was modeled through MLR approach.The model (Equation 9) shows excellent accuracy metrics with R-sq at 98.9% and an adjusted R-sq of 98,8%.In [48], DP can generate negative factors regarding rural sustainability, this might explain the negative influence of GDP to the model.A significant financial influx into unprepared economies might cause a deficit of skilled labor forces, which eventually cause a decrease in GDP.Similar with [7], the DP value increases with the increase of GVA, CO2, and RER, which represent productivity.However, the negative influence of AFI on DP might explain the need for a fairer DP redistribution [35] in the future CAP: 2023-27 reform.
Agriculture R&D (ARD) investment importance was emphasized in several studies [30].The 'ARD' parameter was modeled using RF and shows excellent accuracy with an adjusted Rsq of 99.9%.In Equation (10), it can be noticed that overall GVA, DP, GDP per capita registered for rural areas, agriculture GVA, and CO2 emission from agricultural production are important indicators in predicting ARD.Agricultural R&D is critical for ensuring food security in the coming decades; thus, policymakers should pay more attention to R&D and its impact on agriculture sector performance and overall economic growth.ARD = s(GVA) + s(DP) + s(GDP_Rural) + s(GVA_Agri) + s(CO2E) + s(GDP) (10) Although the development and practice of CAP have been using an instrument that covers the wide range of policy options, there are still issues found regarding its implementation, which led to the sector's low economic, social, and environmental sustainability [49], [50].ML technique can offer the ability to measure the policy's impact in real-time and geo-specific manner.The challenge for policymakers at different administrative levels is to capture all heterogeneous territorial contexts in the EU and to include the spatial dimension in future policy development [51].The synergic approach of applying ML and big data utilization could reduce information asymmetry and the transactional cost of policy implementation [52], as stated in [53], an integrated CAP is a key to achieve food security both in the EU and the UN sustainable development goals.

CONCLUSION
The present research proposed predictive analytic models, both linear and non-linear, characterized by very good to excellent performance metrics computed through machine learning algorithms.In this context, future iterations of this research should consider re-training the models based on the extended dataset to reconfirm and enhance its predictive strength.Other than contributing to the agriculture and sustainability literature, this research offers managerial implications of possible policy impact monitoring and controlling tools for strategy evaluation.Additionally, non-European countries with similar or plan to adopt the CAP could refer to this research findings for policy development.This study is not without limitations.Firstly, since CAP has multidimensional implications, future research should consider exploring more variables and try to model the phenomenon with another approach, such as structural equation modeling, with the aim of capturing the complexity.Secondly, the CAP is specific to European countries in nature, so does the results.However, the developed models can be extended, and the research methodology can be adopted to further geo-specific elaboration or more advanced model development.The frameworks developed were intended to have general specificity since they are categorized as monitoring and controlling tools based on predictive analytic algorithms.Therefore, other countries which plan to develop similar policy could refer to CAP studies including this research.

Figure 1 .
Figure 1.Dataset proportion for data training, testing, and cross-validation

Table 2 .
Random Forest Feature Importance and Its Weight