Visual Analytics for Motor Insurance Portfolio Discovery

Author

Wu Jiayan

Published

May 10, 2026

Modified

May 19, 2026

1 Introduction

1.1 Background and Objectives

This report documents the complete visual analytics investigation of a real-world motor vehicle insurance portfolio from a Spanish non-life insurer. The dataset, sourced from Mendeley Data, contains policy-year records spanning 2022–2024 with information on policyholder demographics, vehicle characteristics, coverage types, premiums, and claims outcomes.

Investigation objectives:

  1. Apply univariate visual analytics to characterise the distributions and composition of numerical and categorical variables.
  2. Apply bivariate visual analytics to investigate and statistically validate factors influencing portfolio profitability (loss ratio, profit per policy) and volatility (claim frequency, claim severity).

1.2 Dataset Description

The dataset is organised into four variable groups as defined in the official data cookbook:

Policy and Insured Characteristics — variables describing the contract itself:

Variable Description
insured_id Unique sequential ID per policy number; repeated policies share the same ID
year Calendar year of the record
policy_type Coverage structure: TP (basic liability), TPG (liability + glass), CC (liability + 2+ extra covers), COMP_N (comprehensive without excess), COMP_E (comprehensive with excess)
policy_status Active (A) or Cancelled (C) at time of analysis
business_type New Business (NB) or Portfolio renewal (P)
payment_frequency Annual (A), Semi-annual (S), or Quarterly (Q) billing
bonus_score Past claims experience: Good/G (favourable), Neutral/N, Bad/B (poor history)

Driver and Vehicle Characteristics — risk-rating variables:

Variable Description
driver_age Age of the main driver in years
vehicle_age Age of the insured vehicle in years
age_driving_licence Calendar year the licence was issued (not years of experience held)
fuel_type Gasoline (G) or Diesel (D) only — other fuel types excluded
vehicle_value Declared insured value; some entries missing
seats Number of seats in the vehicle
power_to_weight_ratio Approximate kg per horsepower — proxy for performance
vehicle_brand Brand as reported in policy data
municipality_type Inland (I), Coastal (C), or Islands (IS)
circulation_area Predominant driving environment: Urban (U) or Rural (R)

Premiums — eight coverage-level premium components that sum to total_premium: liability, property damage, theft, fire, glass, legal protection, occupants, plus the total.

Exposure, Claims and Incurred — outcome variables: total_exposure (policy years on risk, 0–1 for partial years, may exceed 1 for multi-year contracts), claim counts per coverage line, and incurred costs (paid amounts plus outstanding reserves) per coverage line.

2 Getting Started

Following R packages will be used:

  • tidyverse: Core data manipulation (dplyr, tidyr, readr, forcats, stringr)

  • ggplot2: Grammar of graphics visualisation engine

  • scales: Axis formatting helpers (percent, comma, dollar)

  • patchwork: Multi-panel plot composition

  • ggridges: Ridge / joy plots for distributional comparison

  • ggcorrplot: Correlation matrix heat maps

  • ggrepel: Non-overlapping text labels knitr: Table rendering in Quarto

  • kableExtra: Html table styling

pacman::p_load(ggcorrplot, ggplot2, tidyverse, scales, patchwork, ggridges, ggrepel, knitr, kableExtra)

Following code chunk defines visual style and color palettes to ensure all charts in the report look consistent. 

# ── Global ggplot2 theme ─────────────────────────────────────────────────────
theme_ins <- function(base_size = 13) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title    = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(colour = "grey50", size = base_size - 1),
      panel.grid.minor = element_blank(),
      legend.position  = "bottom"
    )
}

theme_set(theme_ins())

# ── Consistent colour palette ────────────────────────────────────────────────
pal_main  <- c("#1565C0","#1E88E5","#42A5F5","#90CAF9","#E3F2FD")
pal_div   <- c("#D32F2F","#EF5350","#EF9A9A","#BBDEFB","#42A5F5","#1565C0")
pal_cat5  <- c("#1565C0","#FF9800","#4CAF50","#F44336","#9C27B0")
pal_bonus <- c("Good (Favourable History)" = "#4CAF50", Neutral = "#FF9800", "Bad (Poor History)" = "#F44336")
pal_area  <- c(Urban = "#FF9800", Rural = "#1E88E5")
pal_fuel  <- c(Gasoline = "#FF9800", Diesel = "#1E88E5")

3 Data Preparation

3.1 Loading Data

Following code chunk loads data into R environment as df_raw, read_delim() is used instead of read_csv() because the file uses semicolons as separators — common in European CSVs where commas are used as decimal points. show_col_types = FALSE suppresses the column-type printout.

#Reading the data into R environment
df_raw <- read_delim(
  "data/Dataset_of_motor_insurance_portfolio.csv",
  delim = ";",
  show_col_types = FALSE
)

3.2 Variable Classification

Following code chunk builds a metadata table about the dataset itself, enriched with variable descriptions from the official data cookbook. Key functions:

  • names(df_raw) — extracts all column names

  • sapply(df_raw, class) — applies class() to every column to get its data type

  • case_when() — tidyverse’s multi-condition if-else, here used to classify each variable into the four analytical groups defined in the cookbook: Policy and insured characteristics, Driver and vehicle characteristics, Premiums, and Exposure, claims and incurred

  • str_detect(Variable, "premium") — regex match to catch all premium-related columns without listing them individually

  • colSums(is.na()) — counts missing values per column

Variable Classification and Missing Data Summary (Source: Mendeley Data Cookbook)
Variable Type Description Group Missing_N Missing_Pct
insured_id numeric Unique sequential identifier per policy number Policy & Insured Characteristics 0 0.00%
year numeric Calendar year of the policy record Policy & Insured Characteristics 0 0.00%
policy_type character Coverage structure: TP / TPG / CC / COMP_N / COMP_E Policy & Insured Characteristics 0 0.00%
policy_status character Active (A) or Cancelled (C) at time of analysis Policy & Insured Characteristics 0 0.00%
business_type character New Business (NB) or Portfolio renewal (P) Policy & Insured Characteristics 0 0.00%
payment_frequency character Billing frequency: Annual (A), Semi-annual (S), Quarterly (Q) Policy & Insured Characteristics 0 0.00%
bonus_score character Claims history indicator: Good (G), Neutral (N), Bad (B) Policy & Insured Characteristics 0 0.00%
driver_age numeric Age of the main driver in years Driver & Vehicle Characteristics 0 0.00%
vehicle_age numeric Age of the insured vehicle in years Driver & Vehicle Characteristics 2 0.00%
age_driving_licence numeric Calendar year licence was issued (not years held) Driver & Vehicle Characteristics 2 0.00%
fuel_type character Vehicle fuel: Gasoline (G) or Diesel (D) Driver & Vehicle Characteristics 1287 0.36%
vehicle_value numeric Declared insured value of the vehicle (some missing) Driver & Vehicle Characteristics 513 0.14%
seats numeric Number of seats in the insured vehicle Driver & Vehicle Characteristics 0 0.00%
power_to_weight_ratio numeric Approx. kg per horsepower — proxy for vehicle performance Driver & Vehicle Characteristics 0 0.00%
vehicle_brand character Brand of the insured vehicle Driver & Vehicle Characteristics 0 0.00%
municipality_type character Geographic classification: Inland (I), Coastal (C), Islands (IS) Driver & Vehicle Characteristics 0 0.00%
circulation_area character Predominant environment: Urban (U) or Rural (R) Driver & Vehicle Characteristics 0 0.00%
total_premium numeric Sum of all individual coverage premiums Premiums 0 0.00%
liability_premium numeric Premium charged for third-party liability coverage Premiums 0 0.00%
property_damage_premium numeric Premium for property damage coverage Premiums 0 0.00%
theft_premium numeric Premium for theft coverage Premiums 0 0.00%
fire_premium numeric Premium for fire coverage Premiums 0 0.00%
glass_premium numeric Premium for glass coverage Premiums 0 0.00%
legal_protection_premium numeric Premium for legal protection coverage Premiums 0 0.00%
occupants_premium numeric Premium for personal-injury coverage of occupants Premiums 0 0.00%
total_claims numeric Total number of claims reported during exposure period Exposure, Claims & Incurred 0 0.00%
liability_claims numeric Count of liability claims Exposure, Claims & Incurred 0 0.00%
liability_property_claims numeric Count of liability_property claims Exposure, Claims & Incurred 0 0.00%
liability_injury_claims numeric Count of liability_injury claims Exposure, Claims & Incurred 0 0.00%
property_claims numeric Count of property claims Exposure, Claims & Incurred 0 0.00%
theft_claims numeric Count of theft claims Exposure, Claims & Incurred 0 0.00%
fire_claims numeric Count of fire claims Exposure, Claims & Incurred 0 0.00%
glass_claims numeric Count of glass claims Exposure, Claims & Incurred 0 0.00%
legal_protection_claims numeric Count of legal_protection claims Exposure, Claims & Incurred 0 0.00%
occupants_claims numeric Count of occupants claims Exposure, Claims & Incurred 0 0.00%
total_incurred numeric Total incurred cost: paid amounts plus outstanding reserves Exposure, Claims & Incurred 0 0.00%
liability_incurred numeric Incurred cost (paid + reserved) for liability claims Exposure, Claims & Incurred 0 0.00%
liability_property_incurred numeric Incurred cost (paid + reserved) for liability_property claims Exposure, Claims & Incurred 0 0.00%
liability_injury_incurred numeric Incurred cost (paid + reserved) for liability_injury claims Exposure, Claims & Incurred 0 0.00%
property_incurred numeric Incurred cost (paid + reserved) for property claims Exposure, Claims & Incurred 0 0.00%
theft_incurred numeric Incurred cost (paid + reserved) for theft claims Exposure, Claims & Incurred 0 0.00%
fire_incurred numeric Incurred cost (paid + reserved) for fire claims Exposure, Claims & Incurred 0 0.00%
glass_incurred numeric Incurred cost (paid + reserved) for glass claims Exposure, Claims & Incurred 0 0.00%
legal_protection_incurred numeric Incurred cost (paid + reserved) for legal_protection claims Exposure, Claims & Incurred 0 0.00%
occupants_incurred numeric Incurred cost (paid + reserved) for occupants claims Exposure, Claims & Incurred 0 0.00%
total_exposure numeric Effective time-on-risk in policy years (0–1, may exceed 1) Exposure, Claims & Incurred 0 0.00%
liability_exposure numeric Exposure associated with liability coverage Exposure, Claims & Incurred 0 0.00%
var_types <- tibble(
  Variable = names(df_raw),
  Type = sapply(df_raw, class),
  Description = case_when(
    Variable == "insured_id"                  ~ "Unique sequential identifier per policy number",
    Variable == "year"                        ~ "Calendar year of the policy record",
    Variable == "policy_type"                 ~ "Coverage structure: TP / TPG / CC / COMP_N / COMP_E",
    Variable == "policy_status"               ~ "Active (A) or Cancelled (C) at time of analysis",
    Variable == "business_type"               ~ "New Business (NB) or Portfolio renewal (P)",
    Variable == "payment_frequency"           ~ "Billing frequency: Annual (A), Semi-annual (S), Quarterly (Q)",
    Variable == "bonus_score"                 ~ "Claims history indicator: Good (G), Neutral (N), Bad (B)",
    Variable == "driver_age"                  ~ "Age of the main driver in years",
    Variable == "vehicle_age"                 ~ "Age of the insured vehicle in years",
    Variable == "age_driving_licence"         ~ "Calendar year licence was issued (not years held)",
    Variable == "fuel_type"                   ~ "Vehicle fuel: Gasoline (G) or Diesel (D)",
    Variable == "vehicle_value"               ~ "Declared insured value of the vehicle (some missing)",
    Variable == "seats"                       ~ "Number of seats in the insured vehicle",
    Variable == "power_to_weight_ratio"       ~ "Approx. kg per horsepower — proxy for vehicle performance",
    Variable == "vehicle_brand"               ~ "Brand of the insured vehicle",
    Variable == "municipality_type"           ~ "Geographic classification: Inland (I), Coastal (C), Islands (IS)",
    Variable == "circulation_area"            ~ "Predominant environment: Urban (U) or Rural (R)",
    Variable == "total_premium"               ~ "Sum of all individual coverage premiums",
    Variable == "total_claims"                ~ "Total number of claims reported during exposure period",
    Variable == "total_incurred"              ~ "Total incurred cost: paid amounts plus outstanding reserves",
    Variable == "total_exposure"              ~ "Effective time-on-risk in policy years (0–1, may exceed 1)",
    Variable == "liability_exposure"          ~ "Exposure associated with liability coverage",
    str_detect(Variable, "^liability_p.*prem") ~ "Premium for liability property damage coverage",
    str_detect(Variable, "^liability_i.*prem") ~ "Premium for liability bodily injury coverage",
    str_detect(Variable, "liability_premium")  ~ "Premium charged for third-party liability coverage",
    str_detect(Variable, "property_damage_premium") ~ "Premium for property damage coverage",
    str_detect(Variable, "theft_premium")     ~ "Premium for theft coverage",
    str_detect(Variable, "fire_premium")      ~ "Premium for fire coverage",
    str_detect(Variable, "glass_premium")     ~ "Premium for glass coverage",
    str_detect(Variable, "legal_protection_premium") ~ "Premium for legal protection coverage",
    str_detect(Variable, "occupants_premium") ~ "Premium for personal-injury coverage of occupants",
    str_detect(Variable, "_claims$")          ~ paste0("Count of ", str_remove(Variable, "_claims"), " claims"),
    str_detect(Variable, "_incurred$")        ~ paste0("Incurred cost (paid + reserved) for ", str_remove(Variable, "_incurred"), " claims"),
    TRUE ~ Variable
  ),
  Group = case_when(
    Variable %in% c("insured_id","year","policy_type","policy_status",
                    "business_type","payment_frequency","bonus_score") ~ "Policy & Insured Characteristics",
    Variable %in% c("driver_age","vehicle_age","age_driving_licence","fuel_type",
                    "vehicle_value","seats","power_to_weight_ratio",
                    "vehicle_brand","municipality_type","circulation_area") ~ "Driver & Vehicle Characteristics",
    str_detect(Variable, "premium") ~ "Premiums",
    str_detect(Variable, "claims|incurred|exposure") ~ "Exposure, Claims & Incurred",
    TRUE ~ "Other"
  ),
  Missing_N = colSums(is.na(df_raw)),
  Missing_Pct = scales::percent(colSums(is.na(df_raw)) / nrow(df_raw), accuracy = 0.01)
)

kbl(var_types, caption = "Variable Classification and Missing Data Summary (Source: Mendeley Data Cookbook)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  column_spec(4, bold = TRUE) |>
  scroll_box(height = "400px")

3.3 Feature Engineering

Derived variables are constructed to support the analytical objectives. The loss ratio (incurred ÷ premium) is the primary profitability metric in non-life insurance; claim indicator enables frequency modelling; profit facilitates absolute value analysis. Note that total_exposure — the effective time-on-risk expressed in policy years — ranges from 0 to 1 for partial-year policies and may exceed 1 for multi-year contracts. total_incurred captures both paid amounts and outstanding reserves, so it represents the full economic cost of claims, not just settled payments.

df <- df_raw |>
  mutate(
    # ── Temporal ──────────────────────────────────────────────────────────────
    year = factor(year),

    # ── Categorical recoding ──────────────────────────────────────────────────
    policy_type = factor(policy_type,
                         levels = c("TP","TPG","CC","COMP_N","COMP_E"),
                         labels = c("Third Party","Third Party+Glass",
                                    "Third Party+Multi-Cover","Comprehensive (No Excess)","Comprehensive (With Excess)")),
    policy_status = factor(policy_status, levels = c("A","C"),
                           labels = c("Active","Cancelled")),
    business_type = factor(business_type, levels = c("NB","P"),
                           labels = c("New Business","Portfolio")),
    payment_frequency = factor(payment_frequency,
                               levels = c("A","S","Q"),
                               labels = c("Annual","Semi-Annual","Quarterly")),
    bonus_score = factor(bonus_score, levels = c("G","N","B"),
                         labels = c("Good (Favourable History)","Neutral","Bad (Poor History)")),
    fuel_type = factor(fuel_type, levels = c("G","D"),
                       labels = c("Gasoline","Diesel")),
    municipality_type = factor(municipality_type,
                               levels = c("I","C","IS"),
                               labels = c("Inland","Coastal","Islands")),
    circulation_area = factor(circulation_area, levels = c("U","R"),
                              labels = c("Urban","Rural")),

    # ── Derived variables ─────────────────────────────────────────────────────
    has_claim   = total_claims > 0,
    loss_ratio  = if_else(total_premium > 0,
                          total_incurred / total_premium, NA_real_),
    profit      = total_premium - total_incurred,
    severity    = if_else(has_claim, total_incurred / total_claims, NA_real_),
    driver_age_group = cut(driver_age,
                           breaks = c(17,25,35,45,55,65,90),
                           labels = c("18-25","26-35","36-45","46-55","56-65","66+"),
                           right = TRUE),
    veh_value_q = ntile(vehicle_value, 5)
  )

# Summary of key derived variables
df |>
  select(has_claim, loss_ratio, profit, severity) |>
  summary()
 has_claim         loss_ratio            profit              severity       
 Mode :logical   Min.   :    0.000   Min.   :-571077.99   Min.   :     0.0  
 FALSE:303333    1st Qu.:    0.000   1st Qu.:     94.92   1st Qu.:   119.7  
 TRUE :50807     Median :    0.000   Median :    229.90   Median :   498.4  
                 Mean   :    0.787   Mean   :     88.14   Mean   :   848.0  
                 3rd Qu.:    0.000   3rd Qu.:    346.67   3rd Qu.:  1012.0  
                 Max.   :11348.859   Max.   :   5107.64   Max.   :244091.2  
                 NA's   :49                               NA's   :303333    

3.4 Data Quality Assessment

Variables with Missing Values
Variable Missing Count Missing %
vehicle_age 2 0.00%
age_driving_licence 2 0.00%
fuel_type 1287 0.36%
vehicle_value 513 0.14%
# Missing values by variable
missing_summary <- df_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") |>
  filter(Missing > 0) |>
  mutate(Pct = Missing / nrow(df_raw))

kbl(missing_summary |>
      mutate(Pct = scales::percent(Pct, accuracy = 0.01)),
    caption = "Variables with Missing Values",
    col.names = c("Variable","Missing Count","Missing %")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Missing data is minimal: fuel_type (0.36%) and vehicle_age (< 0.001%) have negligible missingness and are handled by dropping NA rows in fuel-specific analyses. No imputation is required given the tiny proportion.


4 Univariate Visual Analytics

Univariate analysis examines each variable in isolation to understand its distribution, central tendency, dispersion, and shape. This establishes a baseline understanding of the portfolio before investigating relationships between variables.

4.1 Numerical Variables

4.1.1 Justification of Technique

For continuous numerical variables, histograms reveal the full distributional shape including skewness, modality, and outlier presence. They are preferred over box plots for initial exploration when sample sizes are large (n > 10,000) as they expose tail behaviour critical for insurance pricing. Density plots are overlaid where smooth distributional form aids interpretation.

4.1.2 Total Annual Premium

Distribution of Total Annual Premium (truncated at 99.5th percentile)
q995_prem <- quantile(df$total_premium, 0.995, na.rm = TRUE)

p1 <- df |>
  filter(total_premium > 0, total_premium <= q995_prem) |>
  ggplot(aes(total_premium)) +
  geom_histogram(aes(y = after_stat(density)), bins = 70,
                 fill = "#1E88E5", colour = NA, alpha = 0.85) +
  geom_density(colour = "#FF9800", linewidth = 0.9) +
  geom_vline(xintercept = median(df$total_premium[df$total_premium > 0], na.rm = TRUE),
             colour = "#F44336", linetype = "dashed", linewidth = 1) +
  annotate("text", x = median(df$total_premium[df$total_premium > 0], na.rm=TRUE) + 15,
           y = 0.002, label = paste0("Median: €",
             round(median(df$total_premium[df$total_premium>0],na.rm=TRUE))),
           colour = "#F44336", hjust = 0, size = 3.5) +
  scale_x_continuous(labels = scales::dollar_format(prefix = "€"),
                     name = "Total Annual Premium (€)") +
  scale_y_continuous(name = "Density") +
  labs(title = "Distribution of Total Annual Premium",
       subtitle = "Right-skewed; median ~€258; long upper tail from high-value covers")

p1

The premium distribution is strongly right-skewed with a long upper tail. The median annual premium of approximately €258 indicates most policies are competitively priced, while the extended tail reflects high-value comprehensive covers.

4.1.3 Total Incurred Claims

Following code chunk uses the ggridges package, it stacks density curves for each age group on the y-axis:

  • scale = 1.3 — controls how much the ridges overlap (higher = more overlap)

  • alpha = 0.75 — transparency so overlapping ridges remain visible

  • The x-axis is premium, y-axis is the categorical grouping

Distribution of Incurred Losses — claimants only (truncated at 99.5th percentile)
q995_inc <- quantile(df$total_incurred[df$total_incurred > 0], 0.995, na.rm = TRUE)

p2 <- df |>
  filter(total_incurred > 0, total_incurred <= q995_inc) |>
  ggplot(aes(total_incurred)) +
  geom_histogram(aes(y = after_stat(density)), bins = 70,
                 fill = "#F44336", colour = NA, alpha = 0.8) +
  geom_density(colour = "#FF9800", linewidth = 0.9) +
  scale_x_continuous(labels = scales::dollar_format(prefix = "€")) +
  labs(title = "Distribution of Total Incurred Claims (Claimants Only)",
       subtitle = "Extreme right skew; heavy upper tail; consistent with Pareto/log-normal loss models",
       x = "Incurred Claims (€)", y = "Density")

p2

Claims incurred among policyholders who made at least one claim exhibit extreme right skewness, consistent with actuarial loss models (log-normal, gamma, or Pareto distributions). The heavy upper tail represents catastrophic losses and has significant reserve implications.

4.1.4 Driver Age and Vehicle Age

Distributions of driver age (left) and vehicle age (right)
p_dage <- df |>
  ggplot(aes(driver_age)) +
  geom_histogram(bins = 50, fill = "#42A5F5", colour = NA, alpha = 0.85) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Driver Age Distribution",
       subtitle = "Approximately normal; mean ~49 years; mode at age 40",
       x = "Driver Age (years)", y = "Count")

p_vage <- df |>
  filter(!is.na(vehicle_age)) |>
  ggplot(aes(vehicle_age)) +
  geom_histogram(bins = 50, fill = "#4CAF50", colour = NA, alpha = 0.85) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Vehicle Age Distribution",
       subtitle = "Bimodal pattern; high frequency of older vehicles",
       x = "Vehicle Age (years)", y = "Count")

p_dage + p_vage

Driver age is approximately normally distributed with a mean of 49 years and mode at age 40, reflecting a mature policyholder base. Vehicle age shows a right-skewed distribution with the highest concentration of vehicles aged 18–22 years, suggesting the portfolio is dominated by older vehicles rather than new registrations. Note that age_driving_licence records the calendar year the licence was issued — not the number of years the licence has been held — and should be transformed (e.g., current year minus licence year) before use as an experience proxy in any model.

4.1.5 Numerical Summary Statistics

Descriptive Statistics for Numerical Variables
Variable Min Q1 Median Mean Q3 Max
driver_age 18.00 39.00 48.00 49.01 58.00 90.00
vehicle_age 0.00 17.00 25.00 25.92 35.00 68.00
vehicle_value 1630.12 18607.58 24729.60 26977.44 32252.32 374034.21
power_to_weight_ratio 0.00 10.74 12.05 12.33 13.73 64.81
total_premium 0.00 146.14 257.90 296.72 384.98 5107.64
total_incurred 0.00 0.00 0.00 208.58 0.00 571128.32
total_claims 0.00 0.00 0.00 0.22 0.00 18.00
total_exposure 0.00 0.44 0.84 0.71 1.00 1.00
df |>
  select(driver_age, vehicle_age, vehicle_value, power_to_weight_ratio,
         total_premium, total_incurred, total_claims, total_exposure) |>
  summarise(across(everything(),
                   list(
                     Min    = ~min(., na.rm=TRUE),
                     Q1     = ~quantile(., 0.25, na.rm=TRUE),
                     Median = ~median(., na.rm=TRUE),
                     Mean   = ~mean(., na.rm=TRUE),
                     Q3     = ~quantile(., 0.75, na.rm=TRUE),
                     Max    = ~max(., na.rm=TRUE)
                   ))) |>
  pivot_longer(everything(), names_sep = "_(?=[^_]+$)", names_to = c("Variable","Stat")) |>
  pivot_wider(names_from = Stat, values_from = value) |>
  mutate(across(where(is.numeric), ~round(., 2))) |>
  kbl(caption = "Descriptive Statistics for Numerical Variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |>
  scroll_box(height = "350px")

4.2 Categorical Variables

4.2.1 Justification of Technique

For categorical variables, bar charts provide the clearest representation of frequency distributions and proportional composition. Horizontal bars are preferred when category labels are long to avoid rotation. Stacked and proportional bar charts expose compositional relationships across grouping dimensions.

4.2.2 Coverage Type and Policy Status

Coverage type distribution and policy status composition
p_pt <- df |>
  count(policy_type, sort = TRUE) |>
  mutate(policy_type = fct_reorder(policy_type, n),
         pct = n / sum(n)) |>
  ggplot(aes(n, policy_type, fill = policy_type)) +
  geom_col(colour = NA, alpha = 0.9) +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)),
            hjust = -0.1, size = 3.5) +
  scale_fill_manual(values = pal_cat5, guide = "none") +
  scale_x_continuous(labels = scales::comma,
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Policies by Coverage Type",
       x = "Number of Policies", y = NULL)

p_bs_status <- df |>
  count(policy_status, business_type) |>
  ggplot(aes(business_type, n, fill = policy_status)) +
  geom_col(position = "fill", colour = NA, width = 0.6) +
  scale_fill_manual(values = c(Active = "#4CAF50", Cancelled = "#F44336")) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Policy Status by Business Type",
       x = "Business Type", y = "Proportion", fill = "Status")

p_pt + p_bs_status

Comprehensive with excess (COMP_E) dominates at 54% of policies, reflecting consumer preference for broad coverage. Third-party liability with glass (TPG) is the second most common at ~25%, catering to budget-conscious customers seeking basic plus glass cover. The CC category — third-party liability combined with two or more additional coverages such as theft, fire, or total loss — represents a mid-tier segment between basic TP and full comprehensive. Notably, Cancelled policies are slightly more prevalent among New Business than Portfolio renewals, which is expected given higher lapse rates in the first policy year.

4.2.3 Bonus Score and Payment Frequency

Bonus score distribution and payment frequency by coverage type
p_bonus <- df |>
  count(bonus_score) |>
  filter(!is.na(bonus_score)) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(bonus_score, pct, fill = bonus_score)) +
  geom_col(colour = NA, width = 0.6) +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)),
            vjust = -0.5, size = 4) +
  scale_fill_manual(values = pal_bonus, guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1.05)) +
  labs(title = "Bonus Score Distribution",
       subtitle = "96.5% hold Good status — positive portfolio quality signal",
       x = "Bonus Score", y = "Proportion")

p_pf <- df |>
  count(payment_frequency, policy_type) |>
  ggplot(aes(payment_frequency, n, fill = policy_type)) +
  geom_col(position = "fill", colour = NA, width = 0.6) +
  scale_fill_manual(values = pal_cat5) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Payment Frequency by Coverage",
       x = "Payment Frequency", y = "Proportion", fill = "Coverage")

p_bonus + p_pf

The bonus score reflects the policyholder’s past claims experience: “Good (Favourable History)” indicates a favourable claims history, “Neutral” a moderate profile, and “Bad (Poor History)” denotes a poor claims record. The distribution is heavily concentrated toward Good (96.5%), limiting the discriminatory power of this variable for the majority of the portfolio. However, the 3.5% with Neutral or Bad scores are disproportionately important for risk segmentation, as demonstrated in the bivariate analysis. Annual payment is the dominant method across all coverage types, with quarterly payment more common in budget products (TP, TPG).

4.2.4 Vehicle Fuel Type and Circulation Area

Fuel type and geographic area distributions
p_fuel <- df |>
  filter(!is.na(fuel_type)) |>
  count(fuel_type) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(fuel_type, pct, fill = fuel_type)) +
  geom_col(colour = NA, width = 0.55) +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)),
            vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = pal_fuel, guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.65)) +
  labs(title = "Fuel Type Distribution",
       x = "Fuel Type", y = "Proportion")

p_area <- df |>
  count(circulation_area, municipality_type) |>
  ggplot(aes(circulation_area, n, fill = municipality_type)) +
  geom_col(position = "stack", colour = NA, width = 0.6) +
  scale_fill_manual(values = c("#1565C0","#42A5F5","#90CAF9")) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Area × Municipality Type",
       x = "Circulation Area", y = "Policies", fill = "Municipality")

p_fuel + p_area

Gasoline vehicles (57%) slightly outnumber diesel (43%), broadly reflecting the Spanish vehicle fleet composition. Note that fuel_type records only these two categories — other fuel types (electric, hybrid) are not present in this dataset. The circulation_area variable classifies the predominant driving environment as Urban (U) or Rural (R). Continental urban policies dominate the geographic mix, consistent with Spain’s population concentration in metropolitan areas. The municipality_type distinguishes Inland (I), Coastal (C), and Islands (IS) — an important consideration for territorial risk analysis since island policies show distinct risk profiles from mainland ones.


5 Bivariate Visual Analytics

Bivariate analysis investigates relationships between pairs of variables to identify factors influencing portfolio profitability and volatility. Each analysis is accompanied by a justification of the chosen visualisation technique and a discussion of statistical implications.

5.2 Driver Age vs Claim Frequency and Severity

Technique justification: Grouped bar charts with ordered categories (young to old) clearly communicate the monotonic relationship between age and risk. Separating frequency (incidence) from severity (cost given claim) follows the standard two-part actuarial model, which is more informative than aggregated loss ratios alone.

Claim frequency and average severity by driver age group
age_summary <- df |>
  group_by(driver_age_group) |>
  summarise(
    n = n(),
    claim_freq   = mean(has_claim, na.rm = TRUE),
    avg_severity = mean(total_incurred[total_incurred > 0], na.rm = TRUE),
    loss_ratio   = sum(total_incurred, na.rm=TRUE) / sum(total_premium, na.rm=TRUE),
    .groups = "drop"
  )

p_freq_age <- age_summary |>
  ggplot(aes(driver_age_group, claim_freq, fill = claim_freq)) +
  geom_col(colour = NA, width = 0.7) +
  geom_text(aes(label = scales::percent(claim_freq, accuracy=0.1)),
            vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_fill_gradient(low = "#42A5F5", high = "#F44336", guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.25)) +
  labs(title = "Claim Frequency by Driver Age Group",
       subtitle = "Monotonic decline from youngest (20.5%) to oldest (12.1%)",
       x = "Age Group", y = "Claim Frequency")

p_sev_age <- age_summary |>
  ggplot(aes(driver_age_group, avg_severity, fill = driver_age_group)) +
  geom_col(colour = NA, width = 0.7) +
  scale_fill_brewer(palette = "Blues", guide = "none") +
  geom_text(aes(label = scales::dollar(avg_severity, prefix = "€", accuracy = 1)),
            vjust = -0.4, size = 3.8) +
  scale_y_continuous(
    labels = scales::dollar_format(prefix = "€"),
    expand = expansion(mult = c(0, 0.15))   # 15% headroom above tallest bar
  ) +
  labs(title    = "Average Claim Severity by Driver Age Group",
       subtitle = "Severity does not mirror frequency — non-linear risk profile",
       x = "Age Group", y = "Avg Incurred per Claim (€)")

p_freq_age / p_sev_age

Premium distribution by driver age group (ridge plot)
df |>
  filter(total_premium > 0, total_premium < 1500, !is.na(driver_age_group)) |>
  ggplot(aes(total_premium, driver_age_group, fill = driver_age_group)) +
  geom_density_ridges(alpha = 0.75, scale = 1.3, colour = NA) +
  scale_fill_brewer(palette = "RdYlBu", direction = -1, guide = "none") +
  scale_x_continuous(labels = scales::dollar_format(prefix="€")) +
  labs(title = "Premium Distribution by Driver Age Group",
       subtitle = "Ridge plot reveals premium spread and pricing differentiation by age",
       x = "Total Premium (€)", y = "Driver Age Group")

Statistical insight: The monotonic decline in claim frequency from 20.5% (18–25) to 12.1% (66+) is statistically robust across the large sample. The ridge plot confirms that younger drivers (18–25) are charged substantially higher median premiums (~€474) than all other age groups (~€250–€262), validating that the pricing model incorporates age-related frequency risk. However, the severity analysis reveals an important counter-pattern: claim cost increases with driver age, with the 66+ cohort generating the highest average severity at ~€1,982 versus the lowest in the 26–35 group at ~€1,658. This means older drivers file claims less often but generate more expensive claims when they do — a frequency-severity trade-off that a simple loss ratio approach obscures. A two-part GLM (frequency × severity) is strongly recommended to correctly price both dimensions.

5.3 Bonus Score vs Loss Ratio

Technique justification: Side-by-side bar charts with explicit loss ratio labels enable direct comparison across the three bonus-score categories. The ridge plot supplement shows premium adequacy — whether higher-risk segments are charged commensurately.

Loss ratio and claim frequency by bonus score
bonus_summary <- df |>
  filter(!is.na(bonus_score)) |>
  group_by(bonus_score) |>
  summarise(
    n = n(),
    total_prem   = sum(total_premium, na.rm = TRUE),
    total_inc    = sum(total_incurred, na.rm = TRUE),
    claim_freq   = mean(has_claim, na.rm = TRUE),
    loss_ratio   = total_inc / total_prem,
    avg_premium  = mean(total_premium, na.rm = TRUE),
    .groups = "drop"
  )

p_bonus_lr <- bonus_summary |>
  ggplot(aes(bonus_score, loss_ratio, fill = bonus_score)) +
  geom_col(colour = NA, width = 0.6) +
  geom_text(aes(label = scales::percent(loss_ratio, accuracy = 0.1)),
            vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = pal_bonus, guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.9)) +
  labs(title = "Loss Ratio by Bonus Score",
       x = "Bonus Score", y = "Loss Ratio (%)")

p_bonus_cf <- bonus_summary |>
  ggplot(aes(bonus_score, claim_freq, fill = bonus_score)) +
  geom_col(colour = NA, width = 0.6) +
  geom_text(aes(label = scales::percent(claim_freq, accuracy = 0.1)),
            vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = pal_bonus, guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.4)) +
  labs(title = "Claim Frequency by Bonus Score",
       x = "Bonus Score", y = "Claim Frequency (%)")

p_bonus_lr + p_bonus_cf

Premium distribution by bonus score — are higher-risk segments adequately priced?
df |>
  filter(!is.na(bonus_score), total_premium > 0, total_premium < 1500) |>
  ggplot(aes(total_premium, bonus_score, fill = bonus_score)) +
  geom_density_ridges(alpha = 0.75, scale = 1.2, colour = NA) +
  scale_fill_manual(values = pal_bonus, guide = "none") +
  scale_x_continuous(labels = scales::dollar_format(prefix="€")) +
  labs(title = "Premium Distribution by Bonus Score",
       subtitle = "Bad-scored policyholders show slightly higher premium density at upper tail",
       x = "Annual Premium (€)", y = "Bonus Score")

Statistical insight: The “Bad (Poor History)” bonus score has a loss ratio of 79.0% versus 69.7% for “Good (Favourable History)” — a 9.3 percentage-point gap. Notably, the “Neutral” category has an even higher loss ratio at 81.4%, suggesting it captures policyholders in transition from Good to Bad rather than a genuinely moderate-risk group. The claim frequency gap between Bad (21.9%) and Good (14.2%) is substantial at 7.7 percentage points. The premium ridge plot confirms that Bad-scored policyholders pay higher premiums on average (mean €589 vs €290 for Good), but the persistent loss ratio gap indicates the current loading is still insufficient. A minimum 12–15% additional loading would be required to bring Bad scores to portfolio-average profitability.

5.4 Coverage Type vs Profitability

Technique justification: Horizontal bar charts ordered by loss ratio enable rapid identification of the most and least profitable product lines. The scatter plot of premium vs profit reveals volatility concentration — essential for capital allocation decisions.

Loss ratio and profitability metrics by coverage type
cov_summary <- df |>
  group_by(policy_type) |>
  summarise(
    n = n(),
    avg_premium  = mean(total_premium, na.rm = TRUE),
    avg_incurred = mean(total_incurred, na.rm = TRUE),
    avg_profit   = mean(profit, na.rm = TRUE),
    loss_ratio   = sum(total_incurred, na.rm=TRUE) / sum(total_premium, na.rm=TRUE),
    claim_freq   = mean(has_claim, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(policy_type = fct_reorder(policy_type, loss_ratio))

p_cov_lr <- cov_summary |>
  ggplot(aes(loss_ratio, policy_type, fill = loss_ratio)) +
  geom_col(colour = NA) +
  geom_vline(xintercept = sum(df$total_incurred,na.rm=TRUE)/sum(df$total_premium,na.rm=TRUE),
             colour = "#FF9800", linetype = "dashed", linewidth = 1) +
  geom_text(aes(label = scales::percent(loss_ratio, accuracy=0.1)),
            hjust = -0.1, size = 3.8) +
  scale_fill_gradient2(low = "#4CAF50", mid = "#FF9800", high = "#F44336",
                       midpoint = 0.70, guide = "none") +
  scale_x_continuous(labels = scales::percent,
                     expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Loss Ratio by Coverage Type",
       subtitle = "Dashed line = portfolio average (70.3%)",
       x = "Loss Ratio", y = NULL)

p_profit_box <- df |>
  filter(profit > -5000, profit < 2000) |>
  ggplot(aes(policy_type, profit, fill = policy_type)) +
  geom_violin(alpha = 0.7, colour = NA, draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_hline(yintercept = 0, colour = "#F44336", linetype = "dashed") +
  scale_fill_manual(values = pal_cat5, guide = "none") +
  scale_y_continuous(labels = scales::dollar_format(prefix="€")) +
  coord_flip() +
  labs(title = "Profit Distribution by Coverage Type",
       subtitle = "Violin plot shows spread and skewness of per-policy profits",
       x = NULL, y = "Profit per Policy (€)")

p_cov_lr / p_profit_box

Statistical insight: Third Party achieves the best loss ratio (~54.3%) owing to its limited coverage scope, while Comprehensive (No Excess) (COMP_N) is the most loss-making at ~93.9% — far above the portfolio average of 70.3%. This is actuarially expected: COMP_N policyholders bear no excess on claims, removing the cost-sharing incentive that suppresses small claims. Third Party+Multi-Cover (CC) performs better than expected at ~63.4% despite its multi-coverage structure, likely due to selective underwriting of this mid-tier segment. Comprehensive (With Excess) (COMP_E) sits close to the portfolio average at ~73.8%, reflecting that the excess loading partially compensates for the broader coverage. The violin plots confirm that COMP_N and COMP_E carry the widest downside profit distributions, representing the greatest per-policy volatility.

5.5 Vehicle Value vs Loss Ratio

Technique justification: Quintile-based bar charts reduce noise from the continuous vehicle value variable while maintaining ordinal structure. This technique, known as “equal-frequency binning”, is standard in actuarial generalised linear modelling for rating factor visualisation.

Loss ratio by vehicle value quintile
veh_quintiles <- df |>
  filter(!is.na(vehicle_value), vehicle_value > 0) |>
  group_by(veh_value_q) |>
  summarise(
    min_val    = min(vehicle_value),
    max_val    = max(vehicle_value),
    n          = n(),
    loss_ratio = sum(total_incurred, na.rm=TRUE) / sum(total_premium, na.rm=TRUE),
    claim_freq = mean(has_claim, na.rm = TRUE),
    avg_premium = mean(total_premium, na.rm=TRUE),
    .groups = "drop"
  ) |>
  mutate(
    label = paste0("Q", veh_value_q, "\n€",
                   scales::comma(round(min_val/1000)),"k-",
                   scales::comma(round(max_val/1000)),"k")
  )

p_vval_lr <- veh_quintiles |>
  ggplot(aes(label, loss_ratio, fill = loss_ratio)) +
  geom_col(colour = NA, width = 0.7) +
  geom_text(aes(label = scales::percent(loss_ratio, accuracy=0.1)),
            vjust = -0.5, size = 4, fontface = "bold") +
  scale_fill_gradient2(low = "#4CAF50", mid = "#FF9800", high = "#F44336",
                       midpoint = 0.70, guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.88)) +
  labs(title = "Loss Ratio by Vehicle Value Quintile",
       subtitle = "Q1 = lowest value vehicles; Q5 = highest value vehicles",
       x = "Vehicle Value Quintile", y = "Loss Ratio (%)")

p_vval_cf <- veh_quintiles |>
  ggplot(aes(label, claim_freq, fill = claim_freq)) +
  geom_col(colour = NA, width = 0.7) +
  geom_text(aes(label = scales::percent(claim_freq, accuracy=0.1)),
            vjust = -0.5, size = 4) +
  scale_fill_gradient(low = "#90CAF9", high = "#1565C0", guide = "none") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.22)) +
  labs(title = "Claim Frequency by Vehicle Value Quintile",
       x = "Vehicle Value Quintile", y = "Claim Frequency (%)")

p_vval_lr + p_vval_cf

Statistical insight: Contrary to the initial hypothesis, vehicle value does not produce a simple monotonic relationship with loss ratio. Loss ratios across quintiles are Q1: 72.3%, Q2: 69.3%, Q3: 69.0%, Q4: 67.2%, Q5: 73.3% — a U-shaped pattern where mid-value vehicles (Q3–Q4) are actually the most profitable. The top quintile (Q5, vehicles over ~€34,000) has a slightly elevated loss ratio at 73.3%, primarily driven by severity on expensive repairs. Claim frequency, however, does rise monotonically from 10.9% (Q1) to 17.9% (Q5), confirming higher-value vehicles are more frequently claimed. The vehicle value loading in theft and property damage premiums (correlation r ≈ 0.33–0.41) appears to partially but not fully compensate for the elevated frequency at the top end.

5.6 Urban vs Rural Dimension

Urban vs Rural loss ratio, claim frequency, and premium trends
area_year <- df |>
  group_by(circulation_area, year) |>
  summarise(
    n = n(),
    loss_ratio = sum(total_incurred,na.rm=TRUE)/sum(total_premium,na.rm=TRUE),
    claim_freq = mean(has_claim, na.rm=TRUE),
    avg_premium = mean(total_premium, na.rm=TRUE),
    .groups = "drop"
  )

p_lr_area_yr <- area_year |>
  ggplot(aes(year, loss_ratio, colour=circulation_area, group=circulation_area)) +
  geom_line(linewidth=1.4) +
  geom_point(size=4) +
  geom_text(aes(label=scales::percent(loss_ratio,accuracy=0.1)),
            vjust=-1.1, size=3.5) +
  scale_colour_manual(values=pal_area) +
  scale_y_continuous(labels=scales::percent, limits=c(0.6,0.82)) +
  labs(title="Loss Ratio Trend: Urban vs Rural",
       x="Year", y="Loss Ratio", colour="Area")

p_cf_age_area <- df |>
  group_by(circulation_area, driver_age_group) |>
  summarise(claim_freq=mean(has_claim,na.rm=TRUE), .groups="drop") |>
  ggplot(aes(driver_age_group, claim_freq, fill=circulation_area)) +
  geom_col(position="dodge", colour=NA, width=0.7) +
  scale_fill_manual(values=pal_area) +
  scale_y_continuous(labels=scales::percent) +
  labs(title="Claim Frequency: Urban vs Rural by Age",
       x="Age Group", y="Claim Frequency", fill="Area")

p_lr_area_yr + p_cf_age_area

Statistical insight: Urban policyholders have a higher overall loss ratio (71.4% vs 68.9% rural), a gap of 2.5 percentage points. However, the year-by-year breakdown reveals an important nuance: in 2022, rural policies (70.5%) were actually less profitable than urban (61.8%), with the positions stabilising in 2023 and then diverging sharply in 2024 (urban: 76.3%, rural: 72.6%). The age-stratified analysis confirms the urban premium persists across all age groups, with young urban drivers (18–25) showing the highest claim frequency at 22.2% versus 18.9% for young rural drivers. This urban-rural interaction with driver age supports a two-dimensional rating structure rather than treating the two factors independently.

5.7 Correlation Analysis

Technique justification: The Pearson correlation heatmap provides a compact overview of linear relationships across multiple numerical variables simultaneously. This is a standard first-pass tool in feature engineering for actuarial models, identifying redundant variables (high multicollinearity) and informative predictors (moderate correlations with claims outcomes).

Pearson correlation matrix for premium components and risk variables
cor_vars <- df |>
  select(
    Total_Premium   = total_premium,
    Liab_Prem       = liability_premium,
    Property_Prem   = property_damage_premium,
    Theft_Prem      = theft_premium,
    Glass_Prem      = glass_premium,
    Total_Incurred  = total_incurred,
    Total_Claims    = total_claims,
    Driver_Age      = driver_age,
    Vehicle_Age     = vehicle_age,
    Veh_Value       = vehicle_value,
    Pwr_Weight      = power_to_weight_ratio,
    Exposure        = total_exposure
  )

corr_mat <- cor(cor_vars, use = "complete.obs", method = "pearson")

ggcorrplot(
  corr_mat,
  method    = "square",
  type      = "lower",
  lab       = TRUE,
  lab_size  = 3,
  colors    = c("#D32F2F","#FAFAFA","#1565C0"),
  outline.color = "white",
  tl.cex    = 10,
  tl.col    = "grey20"
) +
  labs(title = "Pearson Correlation Matrix",
       subtitle = "Premium components, claims outcomes, and risk characteristics") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Statistical insight: Premium components are moderately to strongly correlated with total premium (liability r = 0.68, property damage r = 0.75, theft r = 0.49, glass r = 0.52), reflecting their derivation from a common pricing model. Total incurred claims show a very weak positive correlation with premium (r ≈ 0.07) — considerably weaker than commonly assumed — underscoring the highly stochastic nature of insurance outcomes where premium is a prospective expectation and incurred is a realised outcome dominated by rare large losses. Vehicle value is positively correlated with total premium (r ≈ 0.32) and with theft and property damage premiums (r ≈ 0.33–0.41), validating actuarially sound rating factor application, though the relationship is weaker than the original text implied. Driver age has a very weak negative correlation with claim counts (r ≈ −0.03), consistent with the frequency-age relationship identified in Section 5.2 but indicating that age alone explains minimal variance in claims — multivariate modelling is essential.


6 Summary of Visual Analytics Techniques

Visual Analytics Techniques Applied in This Investigation
Technique ggplot2 / Extension Variable Types Analytical Purpose
Histogram + Density Overlay geom_histogram + geom_density Continuous Reveal distributional shape and skewness
Ridge Plot (ggridges) geom_density_ridges (ggridges) Continuous × Categorical Compare premium/loss distributions across groups
Bar Chart (sorted) geom_col + fct_reorder Categorical Rank category frequencies; spot imbalances
Stacked/Proportional Bar geom_col (position=fill) Categorical × Categorical Reveal compositional differences across groups
Line Chart (temporal) geom_line + geom_point Continuous × Temporal Identify trends, acceleration, and inflection points
Grouped Bar Chart geom_col (position=dodge) Categorical × Continuous Compare risk metrics across ordered categories
Violin Plot geom_violin Categorical × Continuous Show full profit distribution including tails
Correlation Heatmap (ggcorrplot) ggcorrplot() Continuous × Continuous (multi) Identify collinearity and informative predictors
Scatter Plot geom_point Continuous × Continuous Explore premium-profit relationships and outliers

7 Key Findings and Analytical Conclusions

7.1 Profitability

  1. Overall loss ratio of 70.3% was stable at ~66% in 2022–2023, then spiked to 75% in 2024 — a 9 pp single-year deterioration requiring urgent root cause investigation before pricing action.
  2. Comprehensive (No Excess) (COMP_N) is the least profitable product line (~93.9% loss ratio); the absence of any excess removes the policyholder’s cost-sharing incentive.
  3. Third Party policies generate the best underwriting margins (~54.3%) but represent only 1.6% of the portfolio.
  4. Vehicle value shows a U-shaped relationship with loss ratio — mid-value vehicles (Q3–Q4) are the most profitable; claim frequency rises monotonically from 10.9% (Q1) to 17.9% (Q5).

7.2 Risk Segmentation Factors

  1. Bonus score discriminates clearly: both Bad (79.0%) and Neutral (81.4%) policyholders significantly exceed Good (69.7%), with Neutral actually the worst-performing segment.
  2. Driver age shows a monotonic inverse relationship with claim frequency (20.5% → 12.1%) but a positive relationship with severity (€1,658 for 26–35 up to €1,982 for 66+) — the two dimensions move in opposite directions and must be priced separately.
  3. Urban location adds ~2.5 pp to the overall loss ratio, with a sharp 2024 divergence (urban 76.3% vs rural 72.6%).

7.3 Volatility

  1. Claim severity is highly right-skewed — a small proportion of large claims drive total incurred costs disproportionately.
  2. The correlation between premium and incurred (r ≈ 0.07) is extremely weak, confirming individual policy outcomes are largely unpredictable and portfolio diversification is essential.