Motivation

Agriculture and food production make up about 11% of global CO2 emission.

It needs scrutiny to understand this block better and therefore the following questions shall be answered in this work:
1. What are the main area inside agro-food industry that drive CO2 emissions?
2. Which countries contribute the most to these emissions?
3. How did emissions and temperatures evolve over time?
4. How are emissions and temperatures probably evolving during the next years?

Data preparation and exploration

Features explanation

Features

Savanna fires: Emissions from fires in savanna ecosystems.
Forest fires: Emissions from fires in forested areas.
Crop Residues: Emissions from burning or decomposing leftover plant material after crop harvesting.
Rice Cultivation: Emissions from methane released during rice cultivation.
Drained organic soils (CO2): Emissions from carbon dioxide released when draining organic soils.
Pesticides Manufacturing: Emissions from the production of pesticides.
Food Transport: Emissions from transporting food products.
Forestland: Land covered by forests.
Net Forest conversion: Change in forest area due to deforestation and afforestation.
Food Household Consumption: Emissions from food consumption at the household level.
Food Retail: Emissions from the operation of retail establishments selling food.
On-farm Electricity Use: Electricity consumption on farms.
Food Packaging: Emissions from the production and disposal of food packaging materials.
Agrifood Systems Waste Disposal: Emissions from waste disposal in the agrifood system.
Food Processing: Emissions from processing food products.
Fertilizers Manufacturing: Emissions from the production of fertilizers.
IPPU: Emissions from industrial processes and product use.
Manure applied to Soils: Emissions from applying animal manure to agricultural soils.
Manure left on Pasture: Emissions from animal manure on pasture or grazing land.
Manure Management: Emissions from managing and treating animal manure.
Fires in organic soils: Emissions from fires in organic soils.
Fires in humid tropical forests: Emissions from fires in humid tropical forests.
On-farm energy use: Energy consumption on farms.
Rural population: Number of people living in rural areas.
Urban population: Number of people living in urban areas.
Total Population - Male: Total number of male individuals in the population.
Total Population - Female: Total number of female individuals in the population.
total_emission: Total greenhouse gas emissions from various sources.
Average Temperature °C: The average increasing of temperature (by year) in degrees Celsius,

Preparation and exploration

library(data.table)
library(naniar)
library(WDI)
library(dplyr)
library(ggplot2)
library(geomtextpath)
library(caTools)
library(gbm)
library(rpart)
library(ipred) 
library(stats)
library(caret)
library(forecast)
library(relaimpo)
library(tidyr)
library(treemapify)
library(corrplot)
library(fabletools)
library(DT)
# Set working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Reading the data
cotwo = fread(paste0(getwd(), "/Agrofood_co2_emission.csv"))
addit = WDI(indicator = "NY.GDP.MKTP.CD", start = 1990, end = 2020, extra = TRUE)
# Change feature names for merging
colnames(addit)[colnames(addit) == "year"] = "Year"
colnames(addit)[colnames(addit) == "country"] = "Country"
colnames(addit)[colnames(addit) == "NY.GDP.MKTP.CD"] = "GDP"
cotwo = cotwo %>% rename(Country = Area,
                         temp_change = `Average Temperature °C`,
                         savanna_fires = `Savanna fires`,
                         forest_fires = `Forest fires`,
                         crop_residues = `Crop Residues`,
                         rice_cultiv = `Rice Cultivation`,
                         drained_organ_soil = `Drained organic soils (CO2)`,
                         pesticides_man = `Pesticides Manufacturing`,
                         food_transport = `Food Transport`,
                         net_forest_conversion = `Net Forest conversion`,
                         food_household_cons = `Food Household Consumption`,
                         food_retail = `Food Retail`,
                         onfarm_el_use = `On-farm Electricity Use`,
                         food_packaging = `Food Packaging`,
                         agrifood_sys_waste_disp = `Agrifood Systems Waste Disposal`,
                         food_processing = `Food Processing`,
                         fertil_manuf = `Fertilizers Manufacturing`,
                         manure_applied_soils = `Manure applied to Soils`,
                         manure_left_pasture = `Manure left on Pasture`,
                         manure_management = `Manure Management`,
                         fires_org_soils = `Fires in organic soils`,
                         fires_hum_trop = `Fires in humid tropical forests`,
                         of_en_use = `On-farm energy use`,
                         rural_population = `Rural population`,
                         urban_population = `Urban population`,
                         total_pop_M = `Total Population - Male`,
                         total_pop_F = `Total Population - Female`)
cotwo$Country = replace(cotwo$Country, cotwo$Country == "United Kingdom of Great Britain and Northern Ireland", "United Kingdom")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "United Republic of Tanzania", "Tanzania")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "United States of America", "United States")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "United States Virgin Islands", "Virgin Islands (U.S.)")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "Bolivia (Plurinational State of)", "Bolivia")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "Venezuela (Bolivarian Republic of)", "Venezuela")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "Iran (Islamic Republic of)", "Iran")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "Republic of Korea", "South Korea")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "Democratic People's Republic of Korea", "North Korea")
cotwo$Country = replace(cotwo$Country, cotwo$Country == "United Kingdom", "UK")
addit$Country = replace(addit$Country, addit$Country == "Venezuela, RB", "Venezuela")
addit$Country = replace(addit$Country, addit$Country == "Congo, Dem. Rep.", "Congo")
addit$Country = replace(addit$Country, addit$Country == "Egypt, Arab Rep.", "Egypt")
addit$Country = replace(addit$Country, addit$Country == "Iran, Islamic Rep.", "Iran")
addit$Country = replace(addit$Country, addit$Country == "Korea, Rep.", "South Korea")
addit$Country = replace(addit$Country, addit$Country == "United Kingdom", "UK")

cotwo = merge(cotwo, addit, by = c("Country", "Year")) # merge the emission dataset and the dataset with GDP and country codes
datatable(cotwo)
# Overview about the data
summary(cotwo)
##    Country               Year      savanna_fires        forest_fires     
##  Length:5872        Min.   :1990   Min.   :     0.00   Min.   :    0.00  
##  Class :character   1st Qu.:1998   1st Qu.:     0.00   1st Qu.:    0.00  
##  Mode  :character   Median :2005   Median :     3.84   Median :    1.36  
##                     Mean   :2005   Mean   :  1188.94   Mean   :  885.62  
##                     3rd Qu.:2013   3rd Qu.:   194.78   3rd Qu.:   89.05  
##                     Max.   :2020   Max.   :114616.40   Max.   :52227.63  
##                                                        NA's   :62        
##  crop_residues       rice_cultiv       drained_organ_soil pesticides_man   
##  Min.   :    0.00   Min.   :     0.0   Min.   :     0.0   Min.   :    0.0  
##  1st Qu.:   11.79   1st Qu.:   134.1   1st Qu.:     0.0   1st Qu.:    5.0  
##  Median :   98.32   Median :   335.2   Median :     0.0   Median :   16.0  
##  Mean   :  918.58   Mean   :  3965.5   Mean   :  3978.7   Mean   :  331.3  
##  3rd Qu.:  410.39   3rd Qu.:  1410.5   3rd Qu.:   980.1   3rd Qu.:  131.2  
##  Max.   :33490.07   Max.   :164915.3   Max.   :241025.1   Max.   :16459.0  
##  NA's   :834                                                               
##  food_transport       Forestland        net_forest_conversion
##  Min.   :    0.00   Min.   :-797183.1   Min.   :      0      
##  1st Qu.:   40.53   1st Qu.:  -3578.1   1st Qu.:      0      
##  Median :  255.59   Median :    -89.5   Median :    110      
##  Mean   : 2092.62   Mean   : -17863.6   Mean   :  17797      
##  3rd Qu.: 1336.29   3rd Qu.:      0.0   3rd Qu.:   5870      
##  Max.   :67945.76   Max.   : 171121.1   Max.   :1605106      
##                     NA's   :307         NA's   :307          
##  food_household_cons  food_retail        onfarm_el_use      
##  Min.   :     0.0    Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:    12.9    1st Qu.:    26.79   1st Qu.:     8.34  
##  Median :   194.1    Median :   194.39   Median :    29.73  
##  Mean   :  4256.2    Mean   :  2051.42   Mean   :  1533.90  
##  3rd Qu.:  1517.5    3rd Qu.:  1115.03   3rd Qu.:   467.21  
##  Max.   :466288.2    Max.   :133784.07   Max.   :165676.30  
##  NA's   :194                                                
##  food_packaging      agrifood_sys_waste_disp food_processing   
##  Min.   :     0.00   Min.   :     0.34       Min.   :     0.0  
##  1st Qu.:    67.63   1st Qu.:   148.03       1st Qu.:   209.6  
##  Median :    67.63   Median :  1062.47       Median :   320.8  
##  Mean   :  1328.23   Mean   :  5921.27       Mean   :  3431.7  
##  3rd Qu.:   277.28   3rd Qu.:  3403.95       3rd Qu.:  1282.2  
##  Max.   :175741.31   Max.   :213289.70       Max.   :274253.5  
##                                                                
##   fertil_manuf            IPPU           manure_applied_soils
##  Min.   :     0.03   Min.   :      0.0   Min.   :    0.106   
##  1st Qu.:   356.49   1st Qu.:     52.6   1st Qu.:   22.888   
##  Median :   954.39   Median :    980.2   Median :  130.097   
##  Mean   :  2677.34   Mean   :  17266.8   Mean   :  862.938   
##  3rd Qu.:  2240.39   3rd Qu.:   6594.5   3rd Qu.:  506.727   
##  Max.   :170826.42   Max.   :1861640.7   Max.   :29865.389   
##                      NA's   :433         NA's   :617         
##  manure_left_pasture manure_management  fires_org_soils  fires_hum_trop    
##  Min.   :    0.0     Min.   :    0.57   Min.   :     0   Min.   :    0.00  
##  1st Qu.:  181.2     1st Qu.:   52.09   1st Qu.:     0   1st Qu.:    0.00  
##  Median : 1164.4     Median :  308.46   Median :     0   Median :    0.00  
##  Mean   : 3589.5     Mean   : 2129.13   Mean   :  1436   Mean   :  612.50  
##  3rd Qu.: 2742.0     3rd Qu.: 1188.42   3rd Qu.:     0   3rd Qu.:   23.89  
##  Max.   :92630.8     Max.   :70592.65   Max.   :991718   Max.   :51771.26  
##                      NA's   :617                         NA's   :124       
##    of_en_use         rural_population    urban_population   
##  Min.   :     0.03   Min.   :        0   Min.   :     3661  
##  1st Qu.:    18.30   1st Qu.:   222210   1st Qu.:   375578  
##  Median :   200.48   Median :  2105166   Median :  2841196  
##  Mean   :  2721.55   Mean   : 16658218   Mean   : 16416135  
##  3rd Qu.:  1281.21   3rd Qu.:  9057854   3rd Qu.:  8793654  
##  Max.   :139388.92   Max.   :900099113   Max.   :902077760  
##  NA's   :773                                                
##   total_pop_M         total_pop_F        total_emission     temp_change     
##  Min.   :     4410   Min.   :     4579   Min.   :-391884   Min.   :-1.4158  
##  1st Qu.:   421068   1st Qu.:   393446   1st Qu.:   5514   1st Qu.: 0.5202  
##  Median :  2858045   Median :  2832316   Median :  13441   Median : 0.8518  
##  Mean   : 16764119   Mean   : 16553710   Mean   :  60877   Mean   : 0.8876  
##  3rd Qu.: 10051261   3rd Qu.: 10161008   3rd Qu.:  40593   3rd Qu.: 1.2190  
##  Max.   :743586579   Max.   :713341908   Max.   :3115114   Max.   : 3.5581  
##                                                                             
##     iso2c              iso3c                GDP               status         
##  Length:5872        Length:5872        Min.   :9.543e+06   Length:5872       
##  Class :character   Class :character   1st Qu.:3.190e+09   Class :character  
##  Mode  :character   Mode  :character   Median :1.362e+10   Mode  :character  
##                                        Mean   :2.743e+11                     
##                                        3rd Qu.:9.505e+10                     
##                                        Max.   :2.152e+13                     
##                                        NA's   :211                           
##  lastupdated           region            capital           longitude        
##  Length:5872        Length:5872        Length:5872        Length:5872       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    latitude            income            lending         
##  Length:5872        Length:5872        Length:5872       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
cotwo = cotwo[, -c("iso2c", "status", "capital", "lending", "lastupdated")] # remove unneeded columns
# Change from character to numeric
cotwo$longitude = as.numeric(cotwo$longitude)
cotwo$latitude = as.numeric(cotwo$latitude)

Data consistency check

# Check for consistency in the data
(rowSums(cotwo[, 3:25], na.rm = TRUE) == cotwo$total_emission)[1:10] # get first impression
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
check = data.table("sum" = rowSums(cotwo[, 3:25], na.rm = TRUE),
                   "actual" = cotwo$total_emission) # get deeper insights
sum(round(rowSums(cotwo[, 3:25], na.rm = TRUE), 1) == round(cotwo$total_emission, 1)) / length(round(rowSums(cotwo[, 3:25], na.rm = TRUE), 1) == round(cotwo$total_emission, 1)) # proof that everything fits
## [1] 1
rm(check)


The single features values sum up to the total value.

Missing values

# Missing values
gg_miss_var(cotwo)

There are missing values which will not be removed per sé, but dealt with in accordance with the necessities in the specific application.

Feature engineering

# Feature engineering
cotwo = merge(x = cotwo, y = cotwo[, mean(temp_change), by = Year], by = "Year")
setorder(cotwo, Country, Year)
colnames(cotwo)[length(cotwo)] = "world_temp_change" # change for every year, worldwide average
cotwo$gdp_capita = cotwo$GDP / (cotwo$total_pop_M + cotwo$total_pop_F) # GDP/capita
cotwo$total_pop = cotwo$total_pop_M + cotwo$total_pop_F # total population

Drivers of emissions

Direct

The items that are under scrutiny, here, are parts of complete emissions and sum up to the total emission as shown in the data preparation.

What are the main areas that contribute directly to emissions?

# Main drivers of emissions
# 1. Direct
driver_direct = data.table("name" = colnames(cotwo)[3:25],
                           "sum" = 0) # take the directly contributing features
for (i in 3:25){
  driver_direct[i-2,2] = sum(cotwo[,..i], na.rm = TRUE)
}

driver_direct = driver_direct[sum > 0] # filter out those countries that hav a negative balance
driver_direct = driver_direct[order(-rank(sum), name)]
driver_direct$name = factor(driver_direct$name, levels = rev(as.character(driver_direct$name)))
non_grey_names = levels(driver_direct$name)[15:length(levels(driver_direct$name))]
# Plot the result
ggplot(driver_direct, aes(x="", y=sum, fill=name)) +
  geom_bar(width = 1, size = 1, color = "white", stat = "identity") +
  coord_polar(theta = "y", start=0) +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid  = element_blank(),
        legend.position = "right") +
  labs(x = NULL, y = NULL, fill = NULL, 
       title = "Contributing areas") +
  guides(fill = guide_legend(reverse = TRUE)) +
  scale_fill_manual(values = c("cyan2", "cadetblue", "dodgerblue", "darkorchid", "blueviolet", "blue", "#254290", "black", rep("grey", 14)), breaks = non_grey_names) +
  geom_text(aes(label = c(scales::percent(round(sum / sum(driver_direct$sum),2))[1:8], rep("",14)), x = 1.65),
            position = position_stack(vjust = 0.5)
            )



Which types of fire and where are especially prominent when it comes to emissions?
Fires contribute with 6.7% to agricultural emissions, which is a lot but considering that agricultural emissions themselves make up 11% of overall emissions.

fire_data = cotwo[, c("Year", "Country", "region", "savanna_fires", "forest_fires", "fires_org_soils", "fires_hum_trop", "temp_change", "world_temp_change")] # take all the fire-relevant feature
fire_data2 = data.table(fire_data[, -c("temp_change", "world_temp_change", "sum_fire", "Year", "Country")] %>% gather(key = "variable", value = "value", -region)) # re-shape the data
fire_data2 = fire_data2[, sum(value, na.rm = TRUE), by = c("variable","region")] # aggregate by fire-type (variable) and region
fire_data2 = fire_data2[complete.cases(fire_data2), ] # remove cases with NAs
colnames(fire_data2)[3] = "Value"
print(paste("Contribution of fires to agricultural CO2 emissions:", round((sum(fire_data2$Value, na.rm = TRUE)/sum(cotwo$total_emission))*100,2), "%"))
## [1] "Contribution of fires to agricultural CO2 emissions: 6.72 %"
print(paste("Contribution of fires to overall CO2 emissions:", round((sum(fire_data2$Value,na.rm = TRUE)/sum(cotwo$total_emission))*0.11*100,2), "%"))
## [1] "Contribution of fires to overall CO2 emissions: 0.74 %"
ggplot(fire_data2, aes(area = Value, fill = region, label = variable, subgroup = region)) + 
  geom_treemap(layout = "squarified") + 
  geom_treemap_text(place = "centre", size = 12) +
  scale_fill_manual(values = c("South Asia" = "blue",
                               "Europe & Central Asia" = "red",
                               "Middle East & North Africa" = "green",
                               "East Asia & Pacific" = "yellow",
                               "Sub-Saharan Africa" = "purple",
                               "Latin America & Caribbean" = "brown",
                               "North America" = "orange")) +
  labs(title = "Absolute contribution to CO2 emissions")


The largest emission caused by fire of any kind in the agriculture can be found in East Asia & Pacific and Sub-Saharan Africa.

Indirect


Those drivers are not parts of the overall sum but relate to the characteristics of a country, like gdp or capita.
They will be assessed in several machine learning models to quantify their explanatory power.

# 2. Indirect

# Preparation of for ML models
set.seed(1)
ml_drivers = cotwo[, -c("of_en_use")]

# Feature engineering
ml_drivers$income = as.factor(ml_drivers$income)
ml_drivers$region = as.factor(ml_drivers$region)
ml_drivers$GDP = ml_drivers$GDP / 1000000000 # otherwise the function will give an error
ml_drivers$total_pop = ml_drivers$total_pop_F + ml_drivers$total_pop_M
ml_drivers$rural_population = ml_drivers$rural_population / (ml_drivers$urban_population + ml_drivers$rural_population)
ml_drivers$total_pop_M = ml_drivers$total_pop_M / (ml_drivers$total_pop_M + ml_drivers$total_pop_F)
ml_drivers = ml_drivers[complete.cases(ml_drivers)]
ml_drivers = cbind(Id = seq(from = 1, to = nrow(ml_drivers)), ml_drivers)

# Prepare training and test data
split = sample.split(ml_drivers$Id, SplitRatio = 0.7)
train = ml_drivers[split, ]
test = ml_drivers[!split, ]

ml_summary = data.table("true" = test$total_emission,
                        "lm" = 0,
                        "gbm" = 0)

M = cor(ml_drivers[complete.cases(ml_drivers), c("Year", "rural_population", "total_pop_M", "GDP", "longitude" , "latitude", "gdp_capita")])
col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )


The correlation matrix looks acceptable. There are surprising or concerning correlations between the predictors. A positive correlation between latitude and GDP per capita can be observed what makes intuitively sense as northern countries tend to be industrialized and developed. In addition, it seems that the higher the percentage of rural population, the lower is the gdp per capita what is also intuitive.

Linear Model

# LM
lm_mod = lm(total_emission ~ Year + rural_population + total_pop_M + GDP + region + income + longitude + latitude + gdp_capita,
         data = train) # fit the model
summary(lm_mod)
## 
## Call:
## lm(formula = total_emission ~ Year + rural_population + total_pop_M + 
##     GDP + region + income + longitude + latitude + gdp_capita, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -965084  -55204  -11346   28869 1663085 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -1.988e+05  7.467e+05  -0.266  0.79012    
## Year                              2.542e+01  3.707e+02   0.069  0.94533    
## rural_population                 -4.165e+04  2.415e+04  -1.725  0.08461 .  
## total_pop_M                       8.464e+05  1.582e+05   5.351 9.36e-08 ***
## GDP                               1.044e+02  2.810e+00  37.175  < 2e-16 ***
## regionEurope & Central Asia      -2.150e+05  2.144e+04 -10.028  < 2e-16 ***
## regionLatin America & Caribbean  -3.044e+05  4.201e+04  -7.246 5.35e-13 ***
## regionMiddle East & North Africa -2.474e+05  2.177e+04 -11.365  < 2e-16 ***
## regionNorth America              -4.718e+05  4.772e+04  -9.889  < 2e-16 ***
## regionSouth Asia                 -9.758e+04  1.965e+04  -4.965 7.24e-07 ***
## regionSub-Saharan Africa         -2.416e+05  2.664e+04  -9.070  < 2e-16 ***
## incomeLow income                  4.388e+04  1.805e+04   2.431  0.01511 *  
## incomeLower middle income         3.984e+04  1.358e+04   2.934  0.00337 ** 
## incomeNot classified              1.322e+05  4.388e+04   3.012  0.00261 ** 
## incomeUpper middle income         7.628e+04  1.100e+04   6.933 4.96e-12 ***
## longitude                        -9.020e+02  1.977e+02  -4.562 5.25e-06 ***
## latitude                         -2.798e+02  2.604e+02  -1.074  0.28269    
## gdp_capita                       -1.742e+00  3.048e-01  -5.716 1.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 173800 on 3193 degrees of freedom
## Multiple R-squared:  0.4259, Adjusted R-squared:  0.4228 
## F-statistic: 139.3 on 17 and 3193 DF,  p-value: < 2.2e-16
ml_summary$lm = predict(lm_mod, test) # make predictions

rel_imp = calc.relimp(lm_mod, type = "lmg", importance = TRUE) # get the relative importance
lm_summary = data.table("Feature" = rownames(as.data.frame(rel_imp@lmg)),
                        "Value" = as.numeric(rel_imp@lmg))

ggplot(lm_summary, aes(x = reorder(Feature, -Value), y = Value)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
        panel.background = element_rect(fill = "white", colour="white"),
        axis.line = element_line(colour = "black"))  +
  labs(title = "Feature importance for lm",
       x = "Features",
       y = "Importance")


The by far most important featur according to the linear model is the GDP. A minor influence has the region. All other features seem to be irrelevant.

Boosting

# Boosting
gb_mod = gbm(total_emission ~ Year + rural_population + total_pop_M + GDP + region + income + longitude + latitude + gdp_capita,
          data = train,
          distribution = "gaussian",
          n.trees = 1000, shrinkage = 0.01,
          interaction.depth = 4,
          bag.fraction = 0.7,
          n.minobsinnode = 5) # fit the model
gb_summary = data.table("Feature" = summary(gb_mod)$var,
                        "Value" = summary(gb_mod)$rel.inf)
ggplot(gb_summary, aes(x = reorder(Feature, -Value), y = Value)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
        panel.background = element_rect(fill = "white", colour="white"),
        axis.line = element_line(colour = "black"))  +
  labs(title = "Feature importance for boosting",
       x = "Features",
       y = "Importance")

ml_summary$gbm = predict(gb_mod, test) # make predictions


Boosting gives the greatest importance to the GDP as well, as could be observed for the linear model. GDP per capita is also considered very important. All other features do not seem to have a big influence.

Bagging

# Bagging
ba_mod = bagging( 
  formula = total_emission ~ Year + rural_population + total_pop_M + GDP + region + income + longitude + latitude + gdp_capita, 
  data = train, 
  nbagg = 50,    
  coob = TRUE, 
  control = rpart.control(minsplit = 2, cp = 0,
                          min_depth=2) 
) # fit the mdoel
summary(ba_mod)
##        Length Class      Mode   
## y      3211   -none-     numeric
## X         9   data.frame list   
## mtrees   50   -none-     list   
## OOB       1   -none-     logical
## comb      1   -none-     logical
## err       1   -none-     numeric
## call      6   -none-     call
ml_summary$ba = predict(ba_mod, test) # make predictions
ba_summary = varImp(ba_mod)

ggplot(ba_summary, aes(x = reorder(rownames(ba_summary), -Overall), y = Overall)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
        panel.background = element_rect(fill = "white", colour="white"),
        axis.line = element_line(colour = "black")) +
  labs(title = "Feature importance for bagging",
       x = "Features",
       y = "Importance")


In contrast to the models before, bagging gives nearly the same high importance to 5 features, GDP being one of them, but not the most important feature. Regionality and income seem to unimportant.

Summary

# Get the Mean Absolute Error of all methods
MAE(ml_summary$true-ml_summary$lm)
## [1] 79000.25
MAE(ml_summary$true-ml_summary$gbm)
## [1] 24439.34
MAE(ml_summary$true-ml_summary$ba)
## [1] 10168.13


Indeed, the bagging model delivers the lowest MAE (Mean Absolute Error) and therefore the features named by this model can be seen important.

Contributing countries

In this chapter, it shall be assessed, which countries contribute the most to agricultural CO2 emission.

# Treemap
treemap_data = cotwo[, sum(total_emission), by = Country]
colnames(treemap_data)[2] = "Sum"
treemap_data = merge(x = treemap_data, y = cotwo[,c("Country","region","iso3c","total_pop_M","total_pop_F")], by = "Country", all.x = TRUE)
treemap_data = as.data.frame(treemap_data[!duplicated(Country)])
treemap_data = treemap_data[complete.cases(treemap_data),]
treemap_data$emission_per_capita = treemap_data$Sum / (treemap_data$total_pop_M + treemap_data$total_pop_F)

ggplot(treemap_data["Sum" > 0, ], aes(area = Sum, fill = region, label = iso3c, subgroup = region)) + 
  geom_treemap(layout = "squarified") + 
  geom_treemap_text(place = "centre", size = 12) +
  scale_fill_manual(values = c("South Asia" = "blue",
                               "Europe & Central Asia" = "red",
                               "Middle East & North Africa" = "green",
                               "East Asia & Pacific" = "yellow",
                               "Sub-Saharan Africa" = "purple",
                               "Latin America & Caribbean" = "brown",
                               "North America" = "orange")) +
  labs(title = "Absolute contribution to CO2 emissions")


East Asia and Pacific together with Latin America and the Caribbean make up more than half of the global emissions, with China being the largest pollutant in absolute numbers.

Emissions over time

In this chapter, it shall be examined how emission in general in different regions developed over time.

How did overall emissions develop worldwide?

# Which countries contributed most to emissions?
options(scipen = 999) # avoid scientific notation
em_overall = cotwo[, sum(total_emission), by = Year] # aggregate total emissions by year (sum)
colnames(em_overall)[2] = "World"
em_regions = cotwo[, sum(total_emission), by = c("Year","region")]
em_regions = em_regions[complete.cases(em_regions),]

# Transform data so that they can be merged
em_regions = em_regions %>%
  mutate(region = as.factor(region)) %>%
  spread(key = region, value = V1)

em_regions = merge(x = em_overall, em_regions, by = "Year")

# Re-transform data
em_regions = em_regions %>%
  gather(key = "variable", value = "value", -Year)

em_regions = em_regions[seq(0, nrow(em_regions), 2),] # delete every second row
colnames(em_regions)[2] = "Region"

ggplot(em_regions, aes(x = Year, y = value)) +
  geom_point(aes(color = Region)) +
  geom_line(aes(color = Region)) +
  scale_color_manual(values = c("World" = "black",
                                "South Asia" = "blue",
                                "Europe & Central Asia" = "red",
                                "Middle East & North Africa" = "green",
                                "East Asia & Pacific" = "yellow",
                                "Sub-Saharan Africa" = "purple",
                                "Latin America & Caribbean" = "brown",
                                "North America" = "orange")) +
  labs(title = "Emission development over time",
       x = "Year",
       y = "Emission in kt") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black"))


There is a clear upward tendency for all bigger regions but most significant for East Asia and Pacific.

How did emissions change in absolute values for single countries?

# Development
emission_dev = data.table("Country" = cotwo$Country,
                          "Year" = cotwo$Year,
                          "emission" = cotwo$total_emission)
emission_dev = emission_dev[Year < 1993 | Year > 2017]
emission_dev$Year = ifelse(emission_dev$Year < 2000, "Beginning", "Ending")
emission_dev = emission_dev[,mean(emission), by = c("Country","Year")]
colnames(emission_dev)[3] = "Mean"
emission_dev2 = emission_dev[, diff(Mean), by = Country]
colnames(emission_dev2)[2] = "Abs"
emission_dev2 = merge(x = emission_dev2, y = emission_dev[, min(Mean), by = Country], by = "Country")
colnames(emission_dev2)[3] = "Min"
emission_dev2$Rel = emission_dev2$Abs / abs(emission_dev2$Min)

options(scipen = 999)
# Absolute
abs_plot = emission_dev2[order(emission_dev2$Abs),]
abs_plot = abs_plot[c(1:5, ((nrow(abs_plot)-4):nrow(abs_plot)))]
abs_plot$label = as.factor(abs_plot$Abs > 0)

ggplot(abs_plot, aes(x = reorder(Country, -Abs), y = Abs)) +
  geom_bar(stat = "identity", aes(fill = label)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black"),
        legend.position="none") +
  labs(title = "Absolute change in emission for top and bottom countries",
       x = "Countries",
       y = "Change in kt") +
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "darkgreen"))


From the beginning of the dataset until 2020, the biggest increase in emissions can be found in China, India and the USA. Brazil is the only country that could achieve a significant decrease.

How did emissions change in relative values for single countries?

# Relative
rel_plot = emission_dev2[order(emission_dev2$Rel),]
rel_plot = rel_plot[c(1:5, ((nrow(rel_plot)-4):nrow(rel_plot)))]
rel_plot$label = as.factor(rel_plot$Rel > 0)

ggplot(rel_plot, aes(x = reorder(Country, -Rel), y = Rel)) +
  geom_bar(stat = "identity", aes(fill = label)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black"),
        legend.position="none") +
  labs(title = "Relative change in emission for top and bottom countries",
       x = "Countries",
       y = "Relative change in %") +
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "darkgreen"))


Compared to the their starting point, Vietnam experienced the highest increase (by 10x). Several countries managed to cut emission compared to their starting point.

Temperature development

As the development of emissions has been examined before, accordingly will the temperature be examined closer in this paragraph.

How did the temperature in different regions of the world develop compared to the world-wide average?

# Which countries experience the biggest increase in temperature?

temp_regions = cotwo[, mean(temp_change), by = c("Year","region")]
temp_regions = temp_regions[complete.cases(temp_regions),]

temp_regions = temp_regions %>%
  mutate(region = as.factor(region)) %>%
  spread(key = region, value = V1)

temp_regions = data.table(merge(x = temp_regions, y = cotwo[,c("Year", "world_temp_change")], by = "Year", all.y = FALSE))
temp_regions = temp_regions[!duplicated(Year),]

temp_regions = temp_regions %>%
  gather(key = "variable", value = "value", -Year)

temp_regions = temp_regions[seq(0, nrow(temp_regions), 3),] # delete every second row
colnames(temp_regions)[2] = "Region"

ggplot(temp_regions, aes(x = Year, y = value)) +
  geom_point(aes(color = Region)) +
  geom_line(aes(color = Region, size = ifelse(Region == "world_temp_change", 0.5, 0.1))) +
  scale_color_manual(values = c("South Asia" = "blue",
                                "Europe & Central Asia" = "red",
                                "Middle East & North Africa" = "green",
                                "East Asia & Pacific" = "yellow",
                                "Sub-Saharan Africa" = "purple",
                                "Latin America & Caribbean" = "brown",
                                "North America" = "orange",
                                "world_temp_change" = "black")) +
  labs(title = "Temperature development over time",
       x = "Year",
       y = "Tmperature increase [K]") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black")) +
  guides(size = "none")


Europe, Middle East & Africa lay mostly above the world average while South Asia and Latin America & Caribean tend to be below.

temp_regions2 = cotwo[, mean(temp_change), by = "region"]
colnames(temp_regions2)[2] = "Average"
temp_regions2 = temp_regions2[complete.cases(temp_regions2),]
de = data.frame("World",mean(cotwo$world_temp_change))
names(de) = c("region","Average")
temp_regions2 = rbind(temp_regions2, de)
temp_regions2$class = ifelse(temp_regions2$Average > as.numeric(temp_regions2[region == "World",2]), "above", "below")
temp_regions2$class = ifelse(temp_regions2$Average == as.numeric(temp_regions2[region == "World",2]), "World", temp_regions2$class)

ggplot(temp_regions2, aes(x = reorder(region, -Average), y = Average, fill = class)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("above" = "red",
                                "World" = "grey",
                                "below" = "blue")) +
  labs(title = "Temperature increase in regions (1990-2020)",
       x = "Region",
       y = "Temperature increase [K]") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black")) +
  guides(fill = "none")


The results from the previous graph can be confirmed in the average of the temperature increase over the last 20 years.

Worldwide overview

How much CO2 in agriculture are single countries emitting?

# Worldmap
world = map_data("world")

world_data = cotwo[, sum(total_emission), by = Country]
colnames(world_data)[length(world_data)] = "Emission"
world_data = merge(x = world_data, y = cotwo[, mean(total_pop), by = Country])
colnames(world_data)[length(world_data)] = "Mean_pop"

world_data$Country = replace(world_data$Country, world_data$Country == "United States", "USA")
world$region = replace(world$region, world$region == "Democratic Republic of the Congo", "Congo")
world$region = replace(world$region, world$region == "Russia", "Russian Federation")

world_data = merge(x = world_data, y = world, by.x = "Country", by.y = "region")
world_data$Em_per_capita = world_data$Emission / world_data$Mean_pop
# world_data = world_data[!duplicated(Country)]

ggplot(data = world_data[Mean_pop > 1000000], mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) +
  geom_polygon(aes(fill = Emission)) +
  scale_fill_distiller(palette ="RdBu", direction = -1) + # or direction=1
  ggtitle("Emissions by country") +
  theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank(),
  panel.background = element_rect(fill = "white"),
  plot.title = element_text(hjust = 0.5)
)


China appears to have the by far biggest part, closely followed by Brazil and the US.

How much CO2 in agriculture are single countries emitting per capita?

ggplot(data = world_data[Mean_pop > 1000000], mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) +
  geom_polygon(aes(fill = Em_per_capita)) +
  scale_fill_distiller(palette ="RdBu", direction = -1) + # or direction=1
  ggtitle("Emissions per capita by country") +
  theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank(),
  panel.background = element_rect(fill = "white"),
  plot.title = element_text(hjust = 0.5)
)


One can see very many blue/low emitting countries but this is due to the fact that single countries like Botsuana have a very high value what strechtes the scale. From the industrialized countries, Canada and Australia also have a relatively high value.

How did temperatures change worldwide?

# Temperature
world_data = cotwo[, mean(temp_change), by = Country]
colnames(world_data)[length(world_data)] = "Temp"

world_data$Country = replace(world_data$Country, world_data$Country == "United States", "USA")
world$region = replace(world$region, world$region == "Democratic Republic of the Congo", "Congo")
world$region = replace(world$region, world$region == "Russia", "Russian Federation")

world_data = merge(x = world_data, y = world, by.x = "Country", by.y = "region")
# world_data = world_data[!duplicated(Country)]

ggplot(data = world_data, mapping = aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) +
  geom_polygon(aes(fill = Temp)) +
  scale_fill_distiller(palette ="RdBu", direction = -1) + # or direction=1
  ggtitle("Temperature increase by country") +
  theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank(),
  panel.background = element_rect(fill = "white"),
  plot.title = element_text(hjust = 0.5)
)


One can clearly observe that countries in the northern hemisphere suffer from a much greater temperature increase than the countries in the south.

Forecasting

How does the future of emissions in agriculture and temperature development look like?

# Transform the data to a timeseries object
forecasting_data = ts(data.table("Year" = unique(cotwo$Year),
                                 "temp" = unique(cotwo$world_temp_change),
                                 "emission" = cotwo[, sum(total_emission), by = Year]$V1))


# Emission
arima_em = auto.arima(forecasting_data[,3]) # fit the arima model
fc_em = forecast(arima_em, h = 10) # forecast 10 steps
plot_data = data.table("Year" = c(forecasting_data[,1], seq(2021,2030,1)),
                       "old" = c(forecasting_data[,3], fc_em$mean),
                       "Lower 80%" = c(rep(NA, nrow(forecasting_data)), fc_em$lower[1:10]),
                       "Lower 95%" = c(rep(NA, nrow(forecasting_data)), fc_em$lower[11:20]),
                       "Upper 80%" = c(rep(NA, nrow(forecasting_data)), fc_em$upper[1:10]),
                       "Upper 95%" = c(rep(NA, nrow(forecasting_data)), fc_em$upper[11:20]))

plot_data = plot_data %>% gather(key = "variable", value = "value", -Year)

ggplot(plot_data, aes(x = Year, y = value)) +
  geom_line(aes(color = variable)) +
  scale_color_manual(values = c("black" = "grey",
                                "Lower 80%" = "lightblue",
                                "Upper 80%" = "lightblue",
                                "Lower 95%" = "blue",
                                "Upper 95%" = "blue")) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black")) +
  labs(title = "Emission forecast",
       x = "Year",
       y = "Emission forecast") +
  guides(color = guide_legend(title = "Confidence level"))


There is no seasonality and the ARIMA model captures the trend appropriately. The predicted trend is positive not perfectly linear and so are the intervals.

arima_temp = auto.arima(forecasting_data[,2]) # fit the ARIMA model
fc_temp = forecast(arima_temp, h = 10) # forecast 10 steps
plot_data = data.table("Year" = c(forecasting_data[,1], seq(2021,2030,1)),
                       "old" = c(forecasting_data[,2], fc_temp$mean),
                       "Lower 80%" = c(rep(NA, nrow(forecasting_data)), fc_temp$lower[1:10]),
                       "Lower 95%" = c(rep(NA, nrow(forecasting_data)), fc_temp$lower[11:20]),
                       "Upper 80%" = c(rep(NA, nrow(forecasting_data)), fc_temp$upper[1:10]),
                       "Upper 95%" = c(rep(NA, nrow(forecasting_data)), fc_temp$upper[11:20]))

plot_data = plot_data %>% gather(key = "variable", value = "value", -Year)

ggplot(plot_data, aes(x = Year, y = value)) +
  geom_line(aes(color = variable)) +
  scale_color_manual(values = c("black" = "grey",
                                "Lower 80%" = "lightblue",
                                "Upper 80%" = "lightblue",
                                "Lower 95%" = "blue",
                                "Upper 95%" = "blue")) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.background = element_rect(fill = "white", colour = "white"),
        axis.line = element_line(colour = "black")) +
  labs(title = "Temperature increase forecast",
       x = "Year",
       y = "temperature increase [K]") +
  guides(color = guide_legend(title = "Confidence level"))


The ARIMA model clearly captures the trend. As there is not seasonality, just a straight line including a positive trend can be observed as forecast. So are the prediction intervals.

References

Packages

Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T (2024). data.table: Extension of data.frame. R package version 1.15.2, https://CRAN.R-project.org/package=data.table.

Tierney N, Cook D (2023). “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software, 105(7), 1-31. doi:10.18637/jss.v105.i07 https://doi.org/10.18637/jss.v105.i07.

Arel-Bundock V (2022). WDI: World Development Indicators and Other World Bank Data. R package version 2.7.8, https://CRAN.R-project.org/package=WDI.

Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Cameron A, van den Brand T (2024). geomtextpath: Curved Text in ‘ggplot2’. R package version 0.1.4, https://CRAN.R-project.org/package=geomtextpath.

Tuszynski J (2024). caTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc. R package version 1.18.3, https://CRAN.R-project.org/package=caTools.

Ridgeway G, Developers G (2024). gbm: Generalized Boosted Regression Models. R package version 2.2.2, https://CRAN.R-project.org/package=gbm.

Therneau T, Atkinson B (2023). rpart: Recursive Partitioning and Regression Trees. R package version 4.1.23, https://CRAN.R-project.org/package=rpart.

Peters A, Hothorn T (2024). ipred: Improved Predictors. R package version 0.9-15, https://CRAN.R-project.org/package=ipred.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05\
Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’Hara-Wild M, Petropoulos F, Razbash S, Wang E, Yasmeen F (2024). forecast: Forecasting functions for time series and linear models. R package version 8.23.0, https://pkg.robjhyndman.com/forecast/.

Hyndman RJ, Khandakar Y (2008). “Automatic time series forecasting: the forecast package for R.” Journal of Statistical Software, 27(3), 1-22. doi:10.18637/jss.v027.i03 https://doi.org/10.18637/jss.v027.i03.

Ulrike Groemping (2006). Relative Importance for Linear Regression in R: The Package relaimpo. Journal of Statistical Software, 17(1), 1–27.

Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.

Wilkins D (2023). treemapify: Draw Treemaps in ‘ggplot2’. R package version 2.5.6, https://CRAN.R-project.org/package=treemapify.

Taiyun Wei and Viliam Simko (2024). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.95). Available from https://github.com/taiyun/corrplot\
O’Hara-Wild M, Hyndman R, Wang E (2024). fabletools: Core Tools for Packages in the ‘fable’ Framework. R package version 0.4.2, https://CRAN.R-project.org/package=fabletools.