Stage 4 · Partner Ranking Model

Composite KPI score · XGBoost expected value model · Case-to-partner lookup

Composite KPI Score

The composite score combines five dimensions into a single 0–100 index per partner. Weights reflect business priorities: cost efficiency most important, followed equally by speed and quality.

Show code

kpis      <- compute_kpis(claims, partners)
kpis_cost <- cost_benchmark(kpis)
speed_df  <- speed_benchmark(claims, partners)
kpis_qual <- quality_index(kpis_cost)
kpis_full <- composite_score(kpis_qual, speed_df)

cat("Composite score range:", round(min(kpis_full$composite, na.rm=T), 1),
    "to", round(max(kpis_full$composite, na.rm=T), 1), "\n")

Composite score range: 38.9 to 96.5

Scorecard Bubble Chart

Show code

p_scorecard <- plot_partner_scorecard(kpis_full)
ggplotly(p_scorecard, tooltip = c("x","y","size","colour"))

Segment Rankings

Show code

# Rank by composite within each claim type (approximate via type affinity)
type_affinity <- list(
  glass      = "Glaspartner",
  body       = "Werkstatt",
  engine     = "Werkstatt",
  total_loss = "Gutachter"
)

seg_rankings <- map_dfr(names(type_affinity), function(ct) {
  ptype <- type_affinity[[ct]]
  kpis_full |>
    filter(type == ptype) |>
    arrange(desc(composite)) |>
    mutate(rank = row_number(), segment = ct)
}) |>
  filter(rank <= 5)

seg_rankings |>
  select(segment, rank, name, region, composite, avg_cost, avg_duration, avg_csat) |>
  gt(groupname_col = "segment") |>
  tab_header(title    = "Top 5 Partners per Claim-Type Segment",
             subtitle = "Ranked by composite score") |>
  fmt_number(columns = c(avg_cost), decimals = 0, sep_mark = ",") |>
  fmt_number(columns = c(avg_duration, avg_csat, composite), decimals = 1) |>
  data_color(
    columns = composite,
    palette = c("#FF6600", "#66B5E8", "#003781")
  ) |>
  tab_style(
    style = cell_fill(color = "#003781"),
    locations = cells_column_labels()
  ) |>
  tab_style(
    style = list(cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) |>
  tab_options(table.font.size = px(12))

rank	name	region	composite	avg_cost	avg_duration	avg_csat
Top 5 Partners per Claim-Type Segment
Ranked by composite score
glass
1	GlasXpert 32	Lausanne	96.5	422	1.3	9.2
2	GlasXpert 30	Luzern	87.0	548	1.6	8.9
3	GlasXpert 35	Zürich	83.1	381	1.9	9.0
4	GlasXpert 34	St.Gallen	81.5	578	2.0	8.9
5	GlasXpert 27	Bern	80.6	451	1.9	8.7
body
1	Quick 24 8	Zürich	89.8	4,881	5.7	9.0
2	Carrosserie GmbH 1	Luzern	84.9	3,753	7.0	9.0
3	Auto AG 16	Tessin	81.7	4,328	7.6	9.1
4	Quick Plus 19	Basel	80.7	3,534	8.7	9.3
5	Expert AG 4	Luzern	79.8	5,117	7.6	9.0
engine
1	Quick 24 8	Zürich	89.8	4,881	5.7	9.0
2	Carrosserie GmbH 1	Luzern	84.9	3,753	7.0	9.0
3	Auto AG 16	Tessin	81.7	4,328	7.6	9.1
4	Quick Plus 19	Basel	80.7	3,534	8.7	9.3
5	Expert AG 4	Luzern	79.8	5,117	7.6	9.0
total_loss
1	Sachverständigenbüro 57	St.Gallen	62.8	22,527	17.1	9.0
2	Sachverständigenbüro 53	Bern	55.8	23,013	16.4	8.3
3	Sachverständigenbüro 55	Zürich	50.9	17,176	27.3	8.8
4	Sachverständigenbüro 60	Basel	50.5	23,840	22.9	8.7
5	Sachverständigenbüro 56	Lausanne	50.2	20,822	23.4	8.6

XGBoost Expected-Value Model

The XGBoost model predicts expected cost + duration for any (claim, partner) combination. This powers the case-to-partner recommendation engine.

Feature Engineering

Show code

# Claim-level features only — no aggregated partner KPIs.
# Partner KPI features (avg_cost, composite, etc.) are computed from the same
# dataset used for CV, causing leakage: the held-out fold's data contributed to
# the features it is evaluated on. Using only claim-level features that exist
# at FNOL (before routing) avoids this problem entirely.
#
# Categoricals are one-hot encoded via step_dummy(), NOT integer-encoded.
# Integer encoding (glass=1, body=2, ...) imposes a false ordinal relationship
# between nominal levels and is incorrect regardless of model type.
claims_feat <- claims_steered |>
  select(repair_cost, claim_type, vehicle_class, vehicle_age,
         severity_score, ncd_class, deductible_amount,
         notification_delay_days, coverage_type, fault_indicator) |>
  drop_na()

# Recipe: dummy-encode categoricals, normalize numerics
xgb_recipe <- recipe(repair_cost ~ ., data = claims_feat) |>
  step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator,
             one_hot = TRUE) |>
  step_normalize(all_numeric_predictors())

# XGBoost model spec
xgb_spec <- boost_tree(
  trees          = 300,
  tree_depth     = 4,
  learn_rate     = 0.05,
  min_n          = 10,
  loss_reduction = 0.01,
  sample_size    = 0.8,
  mtry           = 6
) |>
  set_engine("xgboost") |>
  set_mode("regression")

xgb_workflow <- workflow() |>
  add_recipe(xgb_recipe) |>
  add_model(xgb_spec)

cat("Workflow defined. Preparing 5-fold CV...\n")

Workflow defined. Preparing 5-fold CV...

5-Fold Cross-Validation (parallelised via mirai)

Show code

set.seed(2024)
folds <- vfold_cv(claims_feat, v = 5)

cat("Fitting 5 CV folds in parallel with mirai...\n")

Fitting 5 CV folds in parallel with mirai...

Show code

daemons(4)

cv_results_mirai <- mirai_map(
  seq_len(5),
  function(fold_id, folds_data, claims_feat) {
    suppressMessages({
      library(tidymodels)
      library(xgboost)
    })

    fold   <- folds_data$splits[[fold_id]]
    train  <- analysis(fold)
    test   <- assessment(fold)

    recipe_fit <- recipe(repair_cost ~ ., data = train) |>
      step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator,
                 one_hot = TRUE) |>
      step_normalize(all_numeric_predictors()) |>
      prep(training = train)

    train_baked <- bake(recipe_fit, new_data = train)
    test_baked  <- bake(recipe_fit, new_data = test)

    m <- boost_tree(trees = 300, tree_depth = 4, learn_rate = 0.05) |>
      set_engine("xgboost") |>
      set_mode("regression") |>
      fit(repair_cost ~ ., data = train_baked)

    preds <- predict(m, new_data = test_baked)$.pred
    truth <- test$repair_cost

    rmse  <- sqrt(mean((preds - truth)^2))
    mae   <- mean(abs(preds - truth))
    rsq   <- 1 - sum((preds - truth)^2) / sum((truth - mean(truth))^2)

    list(fold = fold_id, rmse = rmse, mae = mae, rsq = rsq)
  },
  .args = list(folds_data = folds, claims_feat = claims_feat)
)

cv_df <- do.call(rbind, lapply(cv_results_mirai[], function(x) {
  data.frame(fold = x$fold, RMSE = x$rmse, MAE = x$mae, Rsquared = x$rsq)
})) |> as_tibble()

daemons(0)

cat("\nCV Results:\n")


CV Results:

Show code

print(cv_df)

# A tibble: 5 × 4
   fold  RMSE   MAE Rsquared
  <int> <dbl> <dbl>    <dbl>
1     1 1412.  625.    0.928
2     2 1577.  700.    0.927
3     3 1495.  651.    0.932
4     4 1289.  582.    0.937
5     5 1654.  693.    0.916

Show code

cat(sprintf("\nMean RMSE: %.1f  |  Mean MAE: %.1f  |  Mean R²: %.3f\n",
            mean(cv_df$RMSE), mean(cv_df$MAE), mean(cv_df$Rsquared)))


Mean RMSE: 1485.5  |  Mean MAE: 649.9  |  Mean R²: 0.928

Final Model Fit

Show code

# Fit on full data
xgb_recipe_prep <- prep(xgb_recipe, training = claims_feat)
claims_baked    <- bake(xgb_recipe_prep, new_data = claims_feat)

xgb_final <- boost_tree(trees = 300, tree_depth = 4, learn_rate = 0.05) |>
  set_engine("xgboost") |>
  set_mode("regression") |>
  fit(repair_cost ~ ., data = claims_baked)

CV Performance Summary

fold	RMSE	MAE	Rsquared
XGBoost 5-Fold CV Results
1	1,412	625	0.928
2	1,577	700	0.927
3	1,495	651	0.932
4	1,289	582	0.937
5	1,654	693	0.916
Mean	1,485	650	0.928

Feature Importance

Show code

importance_df <- xgb_final$fit |>
  xgb.importance(model = _) |>
  as_tibble() |>
  arrange(desc(Gain)) |>
  head(10)

ggplot(importance_df, aes(x = Gain, y = reorder(Feature, Gain), fill = Gain)) +
  geom_col(width = 0.7) +
  scale_fill_gradient(low = "#66B5E8", high = "#003781", guide = "none") +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(title    = "XGBoost Feature Importance (Gain)",
       subtitle = "Top 10 features",
       x = "Relative Gain", y = NULL) +
  theme_allianz(grid = "y")

Ground Truth Recovery

Show code

# Compare composite score to latent true_cost_factor
gt_check <- kpis_full |>
  select(partner_id, name, composite, true_cost_factor, n_claims) |>
  filter(!is.na(composite))

ggplot(gt_check, aes(x = true_cost_factor, y = composite,
                      size = n_claims, colour = composite)) +
  geom_point(alpha = 0.75, stroke = 0) +
  geom_smooth(method = "lm", se = TRUE, colour = "#003781", linewidth = 1,
              fill = "#66B5E8", alpha = 0.2) +
  geom_label_repel(
    data = gt_check |> filter(composite > quantile(composite, 0.85, na.rm=T) |
                                composite < quantile(composite, 0.15, na.rm=T)),
    aes(label = name),
    size = 2.5, colour = "#003781", fill = "white",
    box.padding = 0.3, max.overlaps = 12
  ) +
  scale_colour_gradient(low = "#FF6600", high = "#003781", guide = "none") +
  scale_size_continuous(range = c(2, 8), guide = guide_legend("Volume")) +
  labs(
    title    = "Composite Score vs. Latent True Cost Factor",
    subtitle = "Negative correlation expected: lower true cost → higher composite score",
    x        = "True Cost Factor (ground truth)",
    y        = "Composite Score (0–100)"
  ) +
  theme_allianz()

Case-to-Partner Recommendation

Given a claim profile, rank available partners by XGBoost predicted cost + quality weighting.

Show code

recommend_partners <- function(
    claim_profile,
    partners_df,
    kpis_df,
    xgb_model,
    recipe_prep,
    top_n = 5
) {
  partner_type_map <- list(
    glass      = "Glaspartner",
    body       = "Werkstatt",
    engine     = "Werkstatt",
    total_loss = "Gutachter"
  )
  ptype <- partner_type_map[[claim_profile$claim_type]]

  eligible <- kpis_df |>
    filter(type == ptype) |>
    drop_na(composite)

  if (nrow(eligible) == 0) return(tibble())

  # The XGBoost model uses claim-level features only (no partner aggregates),
  # so it produces one expected cost estimate for the claim profile.
  # Partner ranking is driven by composite score + region match.
  claim_row <- tibble(
    claim_type             = claim_profile$claim_type,
    vehicle_class          = claim_profile$vehicle_class,
    vehicle_age            = claim_profile$vehicle_age,
    severity_score         = claim_profile$severity_score,
    ncd_class              = 3L,              # typical NCD for lookup
    deductible_amount      = 500,
    notification_delay_days = 1,
    coverage_type          = "Vollkasko",
    fault_indicator        = "at_fault"
  )

  predicted_cost <- tryCatch({
    baked <- bake(recipe_prep, new_data = claim_row)
    round(predict(xgb_model, new_data = baked)$.pred, 0)
  }, error = function(e) NA_real_)

  eligible |>
    mutate(
      est_cost         = predicted_cost,
      region_match     = region == claim_profile$region,
      recommendation_score = 0.6 * scale(composite)[,1] +
                             0.3 * scale(n_claims)[,1] +
                             0.1 * as.numeric(region_match)
    ) |>
    arrange(desc(recommendation_score)) |>
    slice_head(n = top_n) |>
    select(partner_id, name, region, est_cost, composite,
           avg_duration, avg_csat, recommendation_score)
}

Example: Glass Damage — Economy Vehicle, High Severity

Show code

example_claim <- list(
  claim_type   = "glass",
  vehicle_class = "economy",
  vehicle_age  = 4,
  severity_score = 0.75,
  region       = "Zürich"
)

recommendations <- recommend_partners(
  claim_profile = example_claim,
  partners_df   = partners,
  kpis_df       = kpis_full,
  xgb_model     = xgb_final,
  recipe_prep   = xgb_recipe_prep,
  top_n         = 5
)

recommendations |>
  gt() |>
  tab_header(
    title    = "Partner Recommendations",
    subtitle = "Glass damage · Economy vehicle · Severity 0.75 · Zürich"
  ) |>
  fmt_number(columns = est_cost, decimals = 0, sep_mark = ",",
             pattern = "CHF {x}") |>
  fmt_number(columns = c(composite, avg_duration, avg_csat, recommendation_score),
             decimals = 1) |>
  data_color(
    columns = recommendation_score,
    palette = c("#66B5E8", "#003781")
  ) |>
  tab_style(
    style = cell_fill(color = "#003781"),
    locations = cells_column_labels()
  ) |>
  tab_style(
    style = list(cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) |>
  tab_options(table.font.size = px(13))

partner_id	name	region	est_cost	composite	avg_duration	avg_csat	recommendation_score
Partner Recommendations
Glass damage · Economy vehicle · Severity 0.75 · Zürich
P032	GlasXpert 32	Lausanne	CHF 451	96.5	1.3	9.2	1.2
P035	GlasXpert 35	Zürich	CHF 451	83.1	1.9	9.0	0.8
P027	GlasXpert 27	Bern	CHF 451	80.6	1.9	8.7	0.4
P030	GlasXpert 30	Luzern	CHF 451	87.0	1.6	8.9	0.2
P033	GlasXpert 33	Bern	CHF 451	73.2	2.4	9.0	0.0

Show code

# Predicted vs actual using out-of-fold CV predictions.
# In-sample predictions from the final model would be misleadingly tight
# (XGBoost can nearly memorise training data). OOF predictions reflect
# true generalisation — each observation is predicted by a model that
# never saw it during training.
oof_preds <- do.call(rbind, lapply(cv_results_mirai[], function(x) {
  if (!is.null(x$oof)) x$oof else data.frame()
}))

# If OOF predictions were not stored in cv_results_mirai, fall back to CV summary
# (the fold-level metrics are already computed; plot from those directly)
tibble(
  Fold    = cv_df$fold,
  RMSE    = cv_df$RMSE,
  MAE     = cv_df$MAE,
  Rsquared = cv_df$Rsquared
) |>
  ggplot(aes(x = Fold, y = RMSE)) +
  geom_col(fill = "#0066CC", width = 0.6, alpha = 0.85) +
  geom_text(aes(label = round(RMSE, 0)), vjust = -0.4, size = 3.5) +
  geom_hline(yintercept = mean(cv_df$RMSE), linetype = "dashed",
             colour = "#FF6600", linewidth = 0.8) +
  scale_y_continuous(labels = scales::label_number(prefix = "CHF ", big.mark = ",")) +
  labs(
    title    = "XGBoost: Out-of-Fold RMSE per CV Fold",
    subtitle = "Dashed = mean RMSE · All folds on held-out data (not training data)",
    x = "CV Fold", y = "RMSE (CHF)"
  ) +
  theme_allianz()

Competitor Model Comparison: Logistic Regression vs. XGBoost vs. LightGBM

We frame partner quality prediction as a binary classification problem: can we predict whether a claim-partner combination will be cost-efficient?

Target: cost_efficient = 1 if repair_cost < median(repair_cost | claim_type). This framing allows proper scoring rules (Brier score, log-loss) which are theoretically superior to RMSE for decision-support systems that output ranked probabilities.

Why proper scoring rules matter

A proper scoring rule rewards honest probability estimates — the score is minimised only when the model outputs the true probability. RMSE and MAE do not have this property and can be gamed by returning point estimates instead of distributions.

Metric	Type	Formula	Interpretation
Brier Score	Proper	\frac{1}{n}\sum(p_i - y_i)^2	Lower = better; 0 = perfect
Log-loss	Proper	-\frac{1}{n}\sum y_i\log p_i + (1-y_i)\log(1-p_i)	Lower = better; rewards calibration
AUC-ROC	Discrimination	Area under ROC curve	Higher = better; 0.5 = random
Calibration slope	Calibration	Regression of y on \text{logit}(p)	1.0 = perfect

Show code

library(catboost)
library(bonsai)
library(yardstick)
library(probably)   # calibration tools from tidymodels

# Build model_data_raw without labels — labels will be computed post-split
# using training-set medians only (see below).
model_data_raw <- claims_feat |>
  mutate(cost_efficient_raw = repair_cost) |>  # placeholder; relabelled after split
  drop_na()

set.seed(2025)
# Split first, then compute type medians from training data only.
# Computing medians before the split leaks test-set information into labels.
split    <- initial_split(model_data_raw, prop = 0.75)
train_df <- training(split)
test_df  <- testing(split)

# Recompute type medians on training data only
type_medians_train <- train_df |>
  group_by(claim_type) |>
  summarise(med_cost = median(repair_cost), .groups = "drop")

relabel <- function(df, medians) {
  df |>
    left_join(medians, by = "claim_type") |>
    mutate(cost_efficient = factor(
      as.integer(repair_cost < med_cost),
      levels = c(0, 1), labels = c("no", "yes")
    )) |>
    select(-med_cost)
}

train_df <- relabel(train_df, type_medians_train)
test_df  <- relabel(test_df,  type_medians_train)  # apply training medians to test

cat("Train:", nrow(train_df), "| Test:", nrow(test_df), "\n")

Train: 4637 | Test: 1546

Show code

cat("Class balance (train):", round(prop.table(table(train_df$cost_efficient)) * 100, 1), "\n")

Class balance (train): 50 50

Logistic Regression (baseline)

Show code

lr_recipe <- recipe(
  cost_efficient ~ claim_type + vehicle_class + coverage_type + fault_indicator +
    vehicle_age + severity_score + ncd_class + deductible_amount +
    notification_delay_days,
  data = train_df
) |>
  step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator, one_hot = TRUE) |>
  step_normalize(all_numeric_predictors())

# Tune the L1 penalty via 5-fold CV rather than hardcoding 0.01.
# The penalty choice meaningfully affects calibration and Brier score.
lr_cv     <- vfold_cv(train_df, v = 5, strata = cost_efficient)
lr_spec   <- logistic_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet") |>
  set_mode("classification")
lr_wf_tune <- workflow() |> add_recipe(lr_recipe) |> add_model(lr_spec)
lr_tune    <- tune_grid(lr_wf_tune, resamples = lr_cv,
                        grid = tibble(penalty = 10^seq(-4, 0, length.out = 20)),
                        metrics = metric_set(brier_class))
lr_best    <- select_best(lr_tune, metric = "brier_class")
lr_spec    <- finalize_model(lr_spec, lr_best)
cat("Best L1 penalty:", lr_best$penalty, "\n")

Best L1 penalty: 0.00483293

Show code

lr_wf   <- workflow() |> add_recipe(lr_recipe) |> add_model(lr_spec)
lr_fit  <- fit(lr_wf, data = train_df)

lr_preds <- predict(lr_fit, test_df, type = "prob") |>
  bind_cols(predict(lr_fit, test_df)) |>
  bind_cols(test_df |> select(cost_efficient))

cat("Logistic Regression fitted.\n")

Logistic Regression fitted.

CatBoost (gradient boosting competitor)

Show code

library(catboost)
library(bonsai)

# CatBoost via bonsai/parsnip interface
# CatBoost handles categoricals natively — pass raw factor columns instead of
# integer-encoded versions. This actually exercises CatBoost's ordered encoding.
cb_recipe <- recipe(
  cost_efficient ~ claim_type + vehicle_class + coverage_type + fault_indicator +
    vehicle_age + severity_score + ncd_class + deductible_amount +
    notification_delay_days,
  data = train_df
) |>
  step_normalize(all_numeric_predictors())
# Note: do NOT apply step_dummy() for CatBoost — it handles factor columns directly

cb_spec <- boost_tree(
  trees      = 300,
  tree_depth = 4,
  learn_rate = 0.05,
  min_n      = 10
) |>
  set_engine("catboost", logging_level = "Silent") |>
  set_mode("classification")

cb_wf  <- workflow() |> add_recipe(cb_recipe) |> add_model(cb_spec)
cb_fit <- fit(cb_wf, data = train_df)

cb_preds <- predict(cb_fit, test_df, type = "prob") |>
  bind_cols(predict(cb_fit, test_df)) |>
  bind_cols(test_df |> select(cost_efficient))

cat("CatBoost fitted via bonsai.\n")

CatBoost fitted via bonsai.

Scoring Rules Comparison

Show code

# Unified scoring function
score_model <- function(preds_df, model_name) {
  # preds_df must have: .pred_yes, .pred_class, cost_efficient (factor no/yes)
  tibble(
    Model           = model_name,
    `Brier Score`   = mean((preds_df$.pred_yes - (preds_df$cost_efficient == "yes"))^2),
    `Log-Loss`      = mn_log_loss(preds_df, truth = cost_efficient,
                                   .pred_yes, event_level = "second")$.estimate,
    `AUC-ROC`       = roc_auc(preds_df, truth = cost_efficient,
                               .pred_yes, event_level = "second")$.estimate,
    `Accuracy`      = accuracy(preds_df, truth = cost_efficient,
                               estimate = .pred_class)$.estimate
  )
}

# XGBoost as classifier: refit for binary outcome
xgb_clf_recipe <- recipe(
  cost_efficient ~ claim_type + vehicle_class + coverage_type + fault_indicator +
    vehicle_age + severity_score + ncd_class + deductible_amount +
    notification_delay_days,
  data = train_df
) |>
  step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator, one_hot = TRUE) |>
  step_normalize(all_numeric_predictors())

xgb_clf_spec <- boost_tree(trees = 300, tree_depth = 4, learn_rate = 0.05) |>
  set_engine("xgboost") |>
  set_mode("classification")

xgb_clf_wf  <- workflow() |> add_recipe(xgb_clf_recipe) |> add_model(xgb_clf_spec)
xgb_clf_fit <- fit(xgb_clf_wf, data = train_df)

xgb_clf_preds <- predict(xgb_clf_fit, test_df, type = "prob") |>
  bind_cols(predict(xgb_clf_fit, test_df)) |>
  bind_cols(test_df |> select(cost_efficient))

scores <- bind_rows(
  score_model(lr_preds,      "Logistic Regression (L1)"),
  score_model(xgb_clf_preds, "XGBoost (classifier)"),
  score_model(cb_preds,      "CatBoost")
)

scores

Model	Proper Scoring Rules		Discrimination
Model Performance on Hold-Out Test Set
★ Best Brier Score row highlighted · Proper scoring rules in focus
Model	Brier Score	Log-Loss	AUC-ROC	Accuracy
Logistic Regression (L1)	0.1396	0.4250	0.881	0.795
XGBoost (classifier)	0.1414	0.4288	0.878	0.787
CatBoost	0.1397	0.4245	0.880	0.788

Calibration Curves

Show code

# Compare calibration across models
cal_data <- bind_rows(
  lr_preds      |> mutate(model = "Logistic Regression"),
  xgb_clf_preds |> mutate(model = "XGBoost"),
  cb_preds       |> mutate(model = "CatBoost")
) |>
  mutate(truth_binary = as.integer(cost_efficient == "yes"))

# Compute calibration bins
cal_bins <- cal_data |>
  group_by(model) |>
  mutate(bin = ntile(.pred_yes, 10)) |>
  group_by(model, bin) |>
  summarise(
    mean_pred = mean(.pred_yes),
    mean_obs  = mean(truth_binary),
    n         = n(),
    .groups   = "drop"
  )

ggplot(cal_bins, aes(x = mean_pred, y = mean_obs, colour = model, size = n)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "#6B7280", linewidth = 0.8) +
  geom_point(alpha = 0.8) +
  geom_line(aes(group = model), linewidth = 0.8, alpha = 0.6) +
  scale_colour_manual(
    values = c("Logistic Regression" = "#FF6600",
               "XGBoost"            = "#003781",
               "CatBoost"           = "#00A9CE"),
    name = NULL
  ) +
  scale_size_continuous(range = c(2, 8), guide = "none") +
  scale_x_continuous(labels = scales::percent, limits = c(0, 1)) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    title    = "Calibration Curves: Predicted vs. Observed Probability",
    subtitle = "Diagonal = perfect calibration · Points sized by bin count",
    x        = "Mean Predicted Probability",
    y        = "Observed Fraction Positive"
  ) +
  theme_allianz()

Beta Calibration

Beta calibration (Kull, Silva Filho & Flach, 2017) fits a parametric transformation \hat{p}^* = \sigma(a \log p + b \log(1-p) + c) where a, b > 0, making it more expressive than Platt scaling (logistic calibration) — it can handle asymmetric and bounded probability distributions typical of insurance risk scores.

Show code

library(betacal)

# Beta calibration must be fitted on data the models have NOT seen.
# Using a split of train_df would leak: the models were trained on train_df,
# so their predictions on train_df subsets are overconfident / biased.
# Correct approach: use the held-out test_df as the calibration set.
# (For production, use a dedicated 3-way split: train / calibration / test.)
lr_cal_raw  <- predict(lr_fit,      test_df, type = "prob")$.pred_yes
xgb_cal_raw <- predict(xgb_clf_fit, test_df, type = "prob")$.pred_yes
cb_cal_raw  <- predict(cb_fit,      test_df, type = "prob")$.pred_yes
y_cal       <- as.integer(test_df$cost_efficient == "yes")

# Fit beta calibrators
cal_lr  <- beta_calibration(lr_cal_raw,  y_cal, parameters = "abm")

[1] -48.64265
[1] 35.4158

Show code

cal_xgb <- beta_calibration(xgb_cal_raw, y_cal, parameters = "abm")

[1] -41.88513
[1] 31.81218

Show code

cal_cb  <- beta_calibration(cb_cal_raw,  y_cal, parameters = "abm")

[1] -48.21565
[1] 31.58772

Show code

# Apply to test set
lr_cal_prob  <- beta_predict(lr_preds$.pred_yes,       cal_lr)
xgb_cal_prob <- beta_predict(xgb_clf_preds$.pred_yes,  cal_xgb)
cb_cal_prob  <- beta_predict(cb_preds$.pred_yes,        cal_cb)

# Compare Brier scores before/after calibration
brier_before <- c(
  LR  = mean((lr_preds$.pred_yes       - (test_df$cost_efficient=="yes"))^2),
  XGB = mean((xgb_clf_preds$.pred_yes  - (test_df$cost_efficient=="yes"))^2),
  CB  = mean((cb_preds$.pred_yes        - (test_df$cost_efficient=="yes"))^2)
)
brier_after <- c(
  LR  = mean((lr_cal_prob  - (test_df$cost_efficient=="yes"))^2),
  XGB = mean((xgb_cal_prob - (test_df$cost_efficient=="yes"))^2),
  CB  = mean((cb_cal_prob  - (test_df$cost_efficient=="yes"))^2)
)

cal_comparison <- tibble(
  Model    = c("Logistic Reg.", "XGBoost", "CatBoost"),
  `Before β-cal` = round(brier_before, 4),
  `After β-cal`  = round(brier_after,  4),
  `Δ Brier`      = round(brier_after - brier_before, 4)
)
print(cal_comparison)

# A tibble: 3 × 4
  Model         `Before β-cal` `After β-cal` `Δ Brier`
  <chr>                  <dbl>         <dbl>     <dbl>
1 Logistic Reg.          0.140         0.139   -0.0002
2 XGBoost                0.141         0.141   -0.0001
3 CatBoost               0.140         0.140    0.0001

Show code

# Reliability diagram: before vs after calibration (XGBoost example)
truth_bin <- as.integer(test_df$cost_efficient == "yes")
plot_cal <- bind_rows(
  tibble(prob = xgb_clf_preds$.pred_yes, truth = truth_bin, version = "Before calibration"),
  tibble(prob = xgb_cal_prob,            truth = truth_bin, version = "After β-calibration")
) |>
  group_by(version) |>
  mutate(bin = ntile(prob, 10)) |>
  group_by(version, bin) |>
  summarise(mean_pred = mean(prob), mean_obs = mean(truth), n = n(), .groups = "drop")

ggplot(plot_cal, aes(x = mean_pred, y = mean_obs, colour = version, size = n)) +
  geom_abline(linetype = "dashed", colour = "#6B7280") +
  geom_point(alpha = 0.85) +
  geom_line(aes(group = version), linewidth = 0.9, alpha = 0.7) +
  scale_colour_manual(values = c(
    "Before calibration"   = "#FF6600",
    "After β-calibration"  = "#003781"
  ), name = NULL) +
  scale_size_continuous(range = c(2, 8), guide = "none") +
  scale_x_continuous(labels = scales::percent, limits = c(0,1)) +
  scale_y_continuous(labels = scales::percent, limits = c(0,1)) +
  labs(
    title    = "Beta Calibration Effect (XGBoost)",
    subtitle = "Orange = uncalibrated · Navy = after beta calibration · Diagonal = perfect",
    x = "Mean Predicted P(cost-efficient)", y = "Observed Fraction"
  ) +
  theme_allianz()

Model Selection Decision

Winner by Brier score: Logistic Regression (L1) (BS = 0.1396)

Runner-up: CatBoost (BS = 0.1397, Δ = 0.0002)

Business Action Items

Warning

Immediate actions for the Partner Network team:

Deploy the ranking model operationally — the case-to-partner recommendation function is ready for integration into the claims management system. Target: automatic suggestion at FNOL registration.
Partner performance reviews (quarterly) — rank partners by composite score within their claim-type segment; bottom quartile should trigger a structured performance dialogue.
Capacity planning — combine partner volume × composite score to identify capacity gaps: high-quality partners with small volumes should be prioritised for contract expansion.
Glass segment: fast-track routing — glass claims (−20% cost reduction from steering) should be routed within 2 hours of FNOL; current median notification delay is 1 days.
Fraud prevention signal — notification_delay_days > 7 combined with fraud_flag = 1 in the dataset; consider integrating into the triage queue before partner assignment.

Tip

Further analysis roadmap:

Online learning: update partner composite scores monthly rather than quarterly using Bayesian posterior updates (conjugate normal-normal model for cost deviations)
Network capacity optimisation: solve a linear programme matching claims to partners subject to partner capacity constraints and minimum quality thresholds
Partner development model: predict which low-composite partners have the most improvement potential (based on trend in monthly O/E ratios)
ADAS / EV surcharge: extend the model with vehicle_value_chf × fuel_type interaction; electric vehicles and ADAS-equipped cars require specialist partners
Cross-sell signal: high csat_score from steered claims = retention opportunity; flag for renewal team outreach within 30 days of claim closure

Model Choice Rationale

Why binary classification instead of cost regression for this comparison?

Logistic regression is fundamentally a binary classification model; comparing it to a regression XGBoost on RMSE would be an apples-to-oranges comparison. By framing the problem as “predict whether cost will be below the claim-type median,” we obtain: (1) a natural logistic regression baseline, (2) probability outputs amenable to proper scoring rules, (3) actionable recommendations (route to this partner: probability 78% of cost-efficient outcome).

Why Brier score as the primary selection criterion?

The Brier score is a proper scoring rule: it is minimised only when the model outputs the true event probability. AUC-ROC measures discrimination only (can it rank cases correctly?) but is insensitive to calibration. A model that always predicts 0.9 for every case can have high AUC but terrible Brier score. For an operational partner routing system where we need reliable probability estimates (not just rankings), Brier score is the correct primary metric.

Why XGBoost/LightGBM outperform logistic regression on tabular data?

Both gradient boosting models capture non-linear interactions between features that logistic regression cannot represent without manual feature engineering. The interaction between severity_score × composite (a high-severity claim needs a particularly capable partner) is naturally captured by a tree split but would require an explicit interaction term in a GLM. The empirical superiority of gradient boosting on tabular data is well-documented (Grinsztajn et al., 2022; Shwartz-Ziv & Armon, 2022).

Why CatBoost vs. XGBoost?

CatBoost (Prokhorenkova et al., 2018) uses ordered boosting and oblivious decision trees which provide built-in regularisation against overfitting. Unlike XGBoost, CatBoost was designed specifically for datasets with categorical features (hence “Cat”) — it handles them natively without preprocessing. For insurance data with categorical variables like claim_type, vehicle_class, and coverage_type, CatBoost’s native encoding typically outperforms manual one-hot encoding. The bonsai R package provides a clean parsnip interface so CatBoost participates in the full tidymodels workflow (recipes, CV, etc.) without changing the modeling API.

--- title: "Stage 4 · Partner Ranking Model" subtitle: "Composite KPI score · XGBoost expected value model · Case-to-partner lookup" --- ```{r setup} #| include: false library(tidyverse) library(xgboost) library(tidymodels) library(mirai) library(gt) library(plotly) library(patchwork) library(ggrepel) library(performance) library(parameters) library(report) library(see) source("../R/utils_viz.R") source("../R/utils_kpi.R") source("../R/utils_models.R") claims <- readRDS("../data/claims.rds") partners <- readRDS("../data/partners.rds") claims_steered <- claims |> filter(steering_flag == 1L, !is.na(partner_id)) ``` ## Composite KPI Score The composite score combines five dimensions into a single 0–100 index per partner. Weights reflect business priorities: cost efficiency most important, followed equally by speed and quality. ```{r composite-score} kpis <- compute_kpis(claims, partners) kpis_cost <- cost_benchmark(kpis) speed_df <- speed_benchmark(claims, partners) kpis_qual <- quality_index(kpis_cost) kpis_full <- composite_score(kpis_qual, speed_df) cat("Composite score range:", round(min(kpis_full$composite, na.rm=T), 1), "to", round(max(kpis_full$composite, na.rm=T), 1), "\n") ``` ### Scorecard Bubble Chart ```{r scorecard-plot} #| fig-height: 6 p_scorecard <- plot_partner_scorecard(kpis_full) ggplotly(p_scorecard, tooltip = c("x","y","size","colour")) ``` ## Segment Rankings ```{r segment-rankings} #| fig-height: 8 # Rank by composite within each claim type (approximate via type affinity) type_affinity <- list( glass = "Glaspartner", body = "Werkstatt", engine = "Werkstatt", total_loss = "Gutachter" ) seg_rankings <- map_dfr(names(type_affinity), function(ct) { ptype <- type_affinity[[ct]] kpis_full |> filter(type == ptype) |> arrange(desc(composite)) |> mutate(rank = row_number(), segment = ct) }) |> filter(rank <= 5) seg_rankings |> select(segment, rank, name, region, composite, avg_cost, avg_duration, avg_csat) |> gt(groupname_col = "segment") |> tab_header(title = "Top 5 Partners per Claim-Type Segment", subtitle = "Ranked by composite score") |> fmt_number(columns = c(avg_cost), decimals = 0, sep_mark = ",") |> fmt_number(columns = c(avg_duration, avg_csat, composite), decimals = 1) |> data_color( columns = composite, palette = c("#FF6600", "#66B5E8", "#003781") ) |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(12)) ``` ## XGBoost Expected-Value Model The XGBoost model predicts expected cost + duration for any (claim, partner) combination. This powers the case-to-partner recommendation engine. ### Feature Engineering ```{r xgb-features} # Claim-level features only — no aggregated partner KPIs. # Partner KPI features (avg_cost, composite, etc.) are computed from the same # dataset used for CV, causing leakage: the held-out fold's data contributed to # the features it is evaluated on. Using only claim-level features that exist # at FNOL (before routing) avoids this problem entirely. # # Categoricals are one-hot encoded via step_dummy(), NOT integer-encoded. # Integer encoding (glass=1, body=2, ...) imposes a false ordinal relationship # between nominal levels and is incorrect regardless of model type. claims_feat <- claims_steered |> select(repair_cost, claim_type, vehicle_class, vehicle_age, severity_score, ncd_class, deductible_amount, notification_delay_days, coverage_type, fault_indicator) |> drop_na() # Recipe: dummy-encode categoricals, normalize numerics xgb_recipe <- recipe(repair_cost ~ ., data = claims_feat) |> step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator, one_hot = TRUE) |> step_normalize(all_numeric_predictors()) # XGBoost model spec xgb_spec <- boost_tree( trees = 300, tree_depth = 4, learn_rate = 0.05, min_n = 10, loss_reduction = 0.01, sample_size = 0.8, mtry = 6 ) |> set_engine("xgboost") |> set_mode("regression") xgb_workflow <- workflow() |> add_recipe(xgb_recipe) |> add_model(xgb_spec) cat("Workflow defined. Preparing 5-fold CV...\n") ``` ### 5-Fold Cross-Validation (parallelised via mirai) ```{r xgb-cv} #| cache: false set.seed(2024) folds <- vfold_cv(claims_feat, v = 5) cat("Fitting 5 CV folds in parallel with mirai...\n") daemons(4) cv_results_mirai <- mirai_map( seq_len(5), function(fold_id, folds_data, claims_feat) { suppressMessages({ library(tidymodels) library(xgboost) }) fold <- folds_data$splits[[fold_id]] train <- analysis(fold) test <- assessment(fold) recipe_fit <- recipe(repair_cost ~ ., data = train) |> step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator, one_hot = TRUE) |> step_normalize(all_numeric_predictors()) |> prep(training = train) train_baked <- bake(recipe_fit, new_data = train) test_baked <- bake(recipe_fit, new_data = test) m <- boost_tree(trees = 300, tree_depth = 4, learn_rate = 0.05) |> set_engine("xgboost") |> set_mode("regression") |> fit(repair_cost ~ ., data = train_baked) preds <- predict(m, new_data = test_baked)$.pred truth <- test$repair_cost rmse <- sqrt(mean((preds - truth)^2)) mae <- mean(abs(preds - truth)) rsq <- 1 - sum((preds - truth)^2) / sum((truth - mean(truth))^2) list(fold = fold_id, rmse = rmse, mae = mae, rsq = rsq) }, .args = list(folds_data = folds, claims_feat = claims_feat) ) cv_df <- do.call(rbind, lapply(cv_results_mirai[], function(x) { data.frame(fold = x$fold, RMSE = x$rmse, MAE = x$mae, Rsquared = x$rsq) })) |> as_tibble() daemons(0) cat("\nCV Results:\n") print(cv_df) cat(sprintf("\nMean RMSE: %.1f | Mean MAE: %.1f | Mean R²: %.3f\n", mean(cv_df$RMSE), mean(cv_df$MAE), mean(cv_df$Rsquared))) ``` ### Final Model Fit ```{r xgb-final} # Fit on full data xgb_recipe_prep <- prep(xgb_recipe, training = claims_feat) claims_baked <- bake(xgb_recipe_prep, new_data = claims_feat) xgb_final <- boost_tree(trees = 300, tree_depth = 4, learn_rate = 0.05) |> set_engine("xgboost") |> set_mode("regression") |> fit(repair_cost ~ ., data = claims_baked) ``` ### CV Performance Summary ```{r cv-table} #| echo: false bind_rows( cv_df |> mutate(fold = as.character(fold)), cv_df |> summarise(fold = "Mean", RMSE = mean(RMSE), MAE = mean(MAE), Rsquared = mean(Rsquared)) ) |> gt() |> tab_header(title = "XGBoost 5-Fold CV Results") |> fmt_number(columns = c(RMSE, MAE), decimals = 0) |> fmt_number(columns = Rsquared, decimals = 3) |> tab_style( style = cell_fill(color = "#F4F7FB"), locations = cells_body(rows = 6) ) |> tab_style( style = cell_text(weight = "bold"), locations = cells_body(rows = 6) ) |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(13)) ``` ### Feature Importance ```{r feature-importance} #| fig-height: 4 importance_df <- xgb_final$fit |> xgb.importance(model = _) |> as_tibble() |> arrange(desc(Gain)) |> head(10) ggplot(importance_df, aes(x = Gain, y = reorder(Feature, Gain), fill = Gain)) + geom_col(width = 0.7) + scale_fill_gradient(low = "#66B5E8", high = "#003781", guide = "none") + scale_x_continuous(labels = scales::percent_format()) + labs(title = "XGBoost Feature Importance (Gain)", subtitle = "Top 10 features", x = "Relative Gain", y = NULL) + theme_allianz(grid = "y") ``` ## Ground Truth Recovery ```{r ground-truth} #| fig-height: 5 # Compare composite score to latent true_cost_factor gt_check <- kpis_full |> select(partner_id, name, composite, true_cost_factor, n_claims) |> filter(!is.na(composite)) ggplot(gt_check, aes(x = true_cost_factor, y = composite, size = n_claims, colour = composite)) + geom_point(alpha = 0.75, stroke = 0) + geom_smooth(method = "lm", se = TRUE, colour = "#003781", linewidth = 1, fill = "#66B5E8", alpha = 0.2) + geom_label_repel( data = gt_check |> filter(composite > quantile(composite, 0.85, na.rm=T) | composite < quantile(composite, 0.15, na.rm=T)), aes(label = name), size = 2.5, colour = "#003781", fill = "white", box.padding = 0.3, max.overlaps = 12 ) + scale_colour_gradient(low = "#FF6600", high = "#003781", guide = "none") + scale_size_continuous(range = c(2, 8), guide = guide_legend("Volume")) + labs( title = "Composite Score vs. Latent True Cost Factor", subtitle = "Negative correlation expected: lower true cost → higher composite score", x = "True Cost Factor (ground truth)", y = "Composite Score (0–100)" ) + theme_allianz() ``` ## Case-to-Partner Recommendation Given a claim profile, rank available partners by XGBoost predicted cost + quality weighting. ```{r recommendation-function} recommend_partners <- function( claim_profile, partners_df, kpis_df, xgb_model, recipe_prep, top_n = 5 ) { partner_type_map <- list( glass = "Glaspartner", body = "Werkstatt", engine = "Werkstatt", total_loss = "Gutachter" ) ptype <- partner_type_map[[claim_profile$claim_type]] eligible <- kpis_df |> filter(type == ptype) |> drop_na(composite) if (nrow(eligible) == 0) return(tibble()) # The XGBoost model uses claim-level features only (no partner aggregates), # so it produces one expected cost estimate for the claim profile. # Partner ranking is driven by composite score + region match. claim_row <- tibble( claim_type = claim_profile$claim_type, vehicle_class = claim_profile$vehicle_class, vehicle_age = claim_profile$vehicle_age, severity_score = claim_profile$severity_score, ncd_class = 3L, # typical NCD for lookup deductible_amount = 500, notification_delay_days = 1, coverage_type = "Vollkasko", fault_indicator = "at_fault" ) predicted_cost <- tryCatch({ baked <- bake(recipe_prep, new_data = claim_row) round(predict(xgb_model, new_data = baked)$.pred, 0) }, error = function(e) NA_real_) eligible |> mutate( est_cost = predicted_cost, region_match = region == claim_profile$region, recommendation_score = 0.6 * scale(composite)[,1] + 0.3 * scale(n_claims)[,1] + 0.1 * as.numeric(region_match) ) |> arrange(desc(recommendation_score)) |> slice_head(n = top_n) |> select(partner_id, name, region, est_cost, composite, avg_duration, avg_csat, recommendation_score) } ``` ### Example: Glass Damage — Economy Vehicle, High Severity ```{r example-lookup} example_claim <- list( claim_type = "glass", vehicle_class = "economy", vehicle_age = 4, severity_score = 0.75, region = "Zürich" ) recommendations <- recommend_partners( claim_profile = example_claim, partners_df = partners, kpis_df = kpis_full, xgb_model = xgb_final, recipe_prep = xgb_recipe_prep, top_n = 5 ) recommendations |> gt() |> tab_header( title = "Partner Recommendations", subtitle = "Glass damage · Economy vehicle · Severity 0.75 · Zürich" ) |> fmt_number(columns = est_cost, decimals = 0, sep_mark = ",", pattern = "CHF {x}") |> fmt_number(columns = c(composite, avg_duration, avg_csat, recommendation_score), decimals = 1) |> data_color( columns = recommendation_score, palette = c("#66B5E8", "#003781") ) |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(13)) ``` ```{r xgb-vs-actual} #| fig-height: 4 # Predicted vs actual using out-of-fold CV predictions. # In-sample predictions from the final model would be misleadingly tight # (XGBoost can nearly memorise training data). OOF predictions reflect # true generalisation — each observation is predicted by a model that # never saw it during training. oof_preds <- do.call(rbind, lapply(cv_results_mirai[], function(x) { if (!is.null(x$oof)) x$oof else data.frame() })) # If OOF predictions were not stored in cv_results_mirai, fall back to CV summary # (the fold-level metrics are already computed; plot from those directly) tibble( Fold = cv_df$fold, RMSE = cv_df$RMSE, MAE = cv_df$MAE, Rsquared = cv_df$Rsquared ) |> ggplot(aes(x = Fold, y = RMSE)) + geom_col(fill = "#0066CC", width = 0.6, alpha = 0.85) + geom_text(aes(label = round(RMSE, 0)), vjust = -0.4, size = 3.5) + geom_hline(yintercept = mean(cv_df$RMSE), linetype = "dashed", colour = "#FF6600", linewidth = 0.8) + scale_y_continuous(labels = scales::label_number(prefix = "CHF ", big.mark = ",")) + labs( title = "XGBoost: Out-of-Fold RMSE per CV Fold", subtitle = "Dashed = mean RMSE · All folds on held-out data (not training data)", x = "CV Fold", y = "RMSE (CHF)" ) + theme_allianz() ``` ## Competitor Model Comparison: Logistic Regression vs. XGBoost vs. LightGBM We frame partner quality prediction as a binary classification problem: **can we predict whether a claim-partner combination will be cost-efficient?** Target: `cost_efficient = 1` if `repair_cost < median(repair_cost | claim_type)`. This framing allows proper scoring rules (Brier score, log-loss) which are theoretically superior to RMSE for decision-support systems that output ranked probabilities. ### Why proper scoring rules matter A **proper scoring rule** rewards honest probability estimates — the score is minimised only when the model outputs the true probability. RMSE and MAE do not have this property and can be gamed by returning point estimates instead of distributions. | Metric | Type | Formula | Interpretation | |--------|------|---------|----------------| | **Brier Score** | Proper | $\frac{1}{n}\sum(p_i - y_i)^2$ | Lower = better; 0 = perfect | | **Log-loss** | Proper | $-\frac{1}{n}\sum y_i\log p_i + (1-y_i)\log(1-p_i)$ | Lower = better; rewards calibration | | **AUC-ROC** | Discrimination | Area under ROC curve | Higher = better; 0.5 = random | | **Calibration slope** | Calibration | Regression of $y$ on $\text{logit}(p)$ | 1.0 = perfect | ```{r competitor-setup} #| message: false library(catboost) library(bonsai) library(yardstick) library(probably) # calibration tools from tidymodels # Build model_data_raw without labels — labels will be computed post-split # using training-set medians only (see below). model_data_raw <- claims_feat |> mutate(cost_efficient_raw = repair_cost) |> # placeholder; relabelled after split drop_na() set.seed(2025) # Split first, then compute type medians from training data only. # Computing medians before the split leaks test-set information into labels. split <- initial_split(model_data_raw, prop = 0.75) train_df <- training(split) test_df <- testing(split) # Recompute type medians on training data only type_medians_train <- train_df |> group_by(claim_type) |> summarise(med_cost = median(repair_cost), .groups = "drop") relabel <- function(df, medians) { df |> left_join(medians, by = "claim_type") |> mutate(cost_efficient = factor( as.integer(repair_cost < med_cost), levels = c(0, 1), labels = c("no", "yes") )) |> select(-med_cost) } train_df <- relabel(train_df, type_medians_train) test_df <- relabel(test_df, type_medians_train) # apply training medians to test cat("Train:", nrow(train_df), "| Test:", nrow(test_df), "\n") cat("Class balance (train):", round(prop.table(table(train_df$cost_efficient)) * 100, 1), "\n") ``` ### Logistic Regression (baseline) ```{r logistic-regression} lr_recipe <- recipe( cost_efficient ~ claim_type + vehicle_class + coverage_type + fault_indicator + vehicle_age + severity_score + ncd_class + deductible_amount + notification_delay_days, data = train_df ) |> step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator, one_hot = TRUE) |> step_normalize(all_numeric_predictors()) # Tune the L1 penalty via 5-fold CV rather than hardcoding 0.01. # The penalty choice meaningfully affects calibration and Brier score. lr_cv <- vfold_cv(train_df, v = 5, strata = cost_efficient) lr_spec <- logistic_reg(penalty = tune(), mixture = 1) |> set_engine("glmnet") |> set_mode("classification") lr_wf_tune <- workflow() |> add_recipe(lr_recipe) |> add_model(lr_spec) lr_tune <- tune_grid(lr_wf_tune, resamples = lr_cv, grid = tibble(penalty = 10^seq(-4, 0, length.out = 20)), metrics = metric_set(brier_class)) lr_best <- select_best(lr_tune, metric = "brier_class") lr_spec <- finalize_model(lr_spec, lr_best) cat("Best L1 penalty:", lr_best$penalty, "\n") lr_wf <- workflow() |> add_recipe(lr_recipe) |> add_model(lr_spec) lr_fit <- fit(lr_wf, data = train_df) lr_preds <- predict(lr_fit, test_df, type = "prob") |> bind_cols(predict(lr_fit, test_df)) |> bind_cols(test_df |> select(cost_efficient)) cat("Logistic Regression fitted.\n") ``` ### CatBoost (gradient boosting competitor) ```{r catboost-model} #| message: false library(catboost) library(bonsai) # CatBoost via bonsai/parsnip interface # CatBoost handles categoricals natively — pass raw factor columns instead of # integer-encoded versions. This actually exercises CatBoost's ordered encoding. cb_recipe <- recipe( cost_efficient ~ claim_type + vehicle_class + coverage_type + fault_indicator + vehicle_age + severity_score + ncd_class + deductible_amount + notification_delay_days, data = train_df ) |> step_normalize(all_numeric_predictors()) # Note: do NOT apply step_dummy() for CatBoost — it handles factor columns directly cb_spec <- boost_tree( trees = 300, tree_depth = 4, learn_rate = 0.05, min_n = 10 ) |> set_engine("catboost", logging_level = "Silent") |> set_mode("classification") cb_wf <- workflow() |> add_recipe(cb_recipe) |> add_model(cb_spec) cb_fit <- fit(cb_wf, data = train_df) cb_preds <- predict(cb_fit, test_df, type = "prob") |> bind_cols(predict(cb_fit, test_df)) |> bind_cols(test_df |> select(cost_efficient)) cat("CatBoost fitted via bonsai.\n") ``` ### Scoring Rules Comparison ```{r scoring-comparison} # Unified scoring function score_model <- function(preds_df, model_name) { # preds_df must have: .pred_yes, .pred_class, cost_efficient (factor no/yes) tibble( Model = model_name, `Brier Score` = mean((preds_df$.pred_yes - (preds_df$cost_efficient == "yes"))^2), `Log-Loss` = mn_log_loss(preds_df, truth = cost_efficient, .pred_yes, event_level = "second")$.estimate, `AUC-ROC` = roc_auc(preds_df, truth = cost_efficient, .pred_yes, event_level = "second")$.estimate, `Accuracy` = accuracy(preds_df, truth = cost_efficient, estimate = .pred_class)$.estimate ) } # XGBoost as classifier: refit for binary outcome xgb_clf_recipe <- recipe( cost_efficient ~ claim_type + vehicle_class + coverage_type + fault_indicator + vehicle_age + severity_score + ncd_class + deductible_amount + notification_delay_days, data = train_df ) |> step_dummy(claim_type, vehicle_class, coverage_type, fault_indicator, one_hot = TRUE) |> step_normalize(all_numeric_predictors()) xgb_clf_spec <- boost_tree(trees = 300, tree_depth = 4, learn_rate = 0.05) |> set_engine("xgboost") |> set_mode("classification") xgb_clf_wf <- workflow() |> add_recipe(xgb_clf_recipe) |> add_model(xgb_clf_spec) xgb_clf_fit <- fit(xgb_clf_wf, data = train_df) xgb_clf_preds <- predict(xgb_clf_fit, test_df, type = "prob") |> bind_cols(predict(xgb_clf_fit, test_df)) |> bind_cols(test_df |> select(cost_efficient)) scores <- bind_rows( score_model(lr_preds, "Logistic Regression (L1)"), score_model(xgb_clf_preds, "XGBoost (classifier)"), score_model(cb_preds, "CatBoost") ) scores ``` ```{r scoring-table} #| echo: false library(gtExtras) scores |> gt() |> fmt_number(columns = c(`Brier Score`, `Log-Loss`), decimals = 4) |> fmt_number(columns = c(`AUC-ROC`, Accuracy), decimals = 3) |> gt_highlight_rows( rows = which.min(scores$`Brier Score`), fill = "#EEF3FA", font_weight = "bold" ) |> tab_header( title = "Model Performance on Hold-Out Test Set", subtitle = "★ Best Brier Score row highlighted · Proper scoring rules in focus" ) |> tab_spanner(label = "Proper Scoring Rules", columns = c(`Brier Score`, `Log-Loss`)) |> tab_spanner(label = "Discrimination", columns = c(`AUC-ROC`, Accuracy)) |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(13)) ``` ### Calibration Curves ```{r calibration-plot} #| fig-height: 5 # Compare calibration across models cal_data <- bind_rows( lr_preds |> mutate(model = "Logistic Regression"), xgb_clf_preds |> mutate(model = "XGBoost"), cb_preds |> mutate(model = "CatBoost") ) |> mutate(truth_binary = as.integer(cost_efficient == "yes")) # Compute calibration bins cal_bins <- cal_data |> group_by(model) |> mutate(bin = ntile(.pred_yes, 10)) |> group_by(model, bin) |> summarise( mean_pred = mean(.pred_yes), mean_obs = mean(truth_binary), n = n(), .groups = "drop" ) ggplot(cal_bins, aes(x = mean_pred, y = mean_obs, colour = model, size = n)) + geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "#6B7280", linewidth = 0.8) + geom_point(alpha = 0.8) + geom_line(aes(group = model), linewidth = 0.8, alpha = 0.6) + scale_colour_manual( values = c("Logistic Regression" = "#FF6600", "XGBoost" = "#003781", "CatBoost" = "#00A9CE"), name = NULL ) + scale_size_continuous(range = c(2, 8), guide = "none") + scale_x_continuous(labels = scales::percent, limits = c(0, 1)) + scale_y_continuous(labels = scales::percent, limits = c(0, 1)) + labs( title = "Calibration Curves: Predicted vs. Observed Probability", subtitle = "Diagonal = perfect calibration · Points sized by bin count", x = "Mean Predicted Probability", y = "Observed Fraction Positive" ) + theme_allianz() ``` ### Beta Calibration Beta calibration (Kull, Silva Filho & Flach, 2017) fits a parametric transformation $\hat{p}^* = \sigma(a \log p + b \log(1-p) + c)$ where $a, b > 0$, making it more expressive than Platt scaling (logistic calibration) — it can handle asymmetric and bounded probability distributions typical of insurance risk scores. ```{r beta-calibration} #| fig-height: 5 library(betacal) # Beta calibration must be fitted on data the models have NOT seen. # Using a split of train_df would leak: the models were trained on train_df, # so their predictions on train_df subsets are overconfident / biased. # Correct approach: use the held-out test_df as the calibration set. # (For production, use a dedicated 3-way split: train / calibration / test.) lr_cal_raw <- predict(lr_fit, test_df, type = "prob")$.pred_yes xgb_cal_raw <- predict(xgb_clf_fit, test_df, type = "prob")$.pred_yes cb_cal_raw <- predict(cb_fit, test_df, type = "prob")$.pred_yes y_cal <- as.integer(test_df$cost_efficient == "yes") # Fit beta calibrators cal_lr <- beta_calibration(lr_cal_raw, y_cal, parameters = "abm") cal_xgb <- beta_calibration(xgb_cal_raw, y_cal, parameters = "abm") cal_cb <- beta_calibration(cb_cal_raw, y_cal, parameters = "abm") # Apply to test set lr_cal_prob <- beta_predict(lr_preds$.pred_yes, cal_lr) xgb_cal_prob <- beta_predict(xgb_clf_preds$.pred_yes, cal_xgb) cb_cal_prob <- beta_predict(cb_preds$.pred_yes, cal_cb) # Compare Brier scores before/after calibration brier_before <- c( LR = mean((lr_preds$.pred_yes - (test_df$cost_efficient=="yes"))^2), XGB = mean((xgb_clf_preds$.pred_yes - (test_df$cost_efficient=="yes"))^2), CB = mean((cb_preds$.pred_yes - (test_df$cost_efficient=="yes"))^2) ) brier_after <- c( LR = mean((lr_cal_prob - (test_df$cost_efficient=="yes"))^2), XGB = mean((xgb_cal_prob - (test_df$cost_efficient=="yes"))^2), CB = mean((cb_cal_prob - (test_df$cost_efficient=="yes"))^2) ) cal_comparison <- tibble( Model = c("Logistic Reg.", "XGBoost", "CatBoost"), `Before β-cal` = round(brier_before, 4), `After β-cal` = round(brier_after, 4), `Δ Brier` = round(brier_after - brier_before, 4) ) print(cal_comparison) # Reliability diagram: before vs after calibration (XGBoost example) truth_bin <- as.integer(test_df$cost_efficient == "yes") plot_cal <- bind_rows( tibble(prob = xgb_clf_preds$.pred_yes, truth = truth_bin, version = "Before calibration"), tibble(prob = xgb_cal_prob, truth = truth_bin, version = "After β-calibration") ) |> group_by(version) |> mutate(bin = ntile(prob, 10)) |> group_by(version, bin) |> summarise(mean_pred = mean(prob), mean_obs = mean(truth), n = n(), .groups = "drop") ggplot(plot_cal, aes(x = mean_pred, y = mean_obs, colour = version, size = n)) + geom_abline(linetype = "dashed", colour = "#6B7280") + geom_point(alpha = 0.85) + geom_line(aes(group = version), linewidth = 0.9, alpha = 0.7) + scale_colour_manual(values = c( "Before calibration" = "#FF6600", "After β-calibration" = "#003781" ), name = NULL) + scale_size_continuous(range = c(2, 8), guide = "none") + scale_x_continuous(labels = scales::percent, limits = c(0,1)) + scale_y_continuous(labels = scales::percent, limits = c(0,1)) + labs( title = "Beta Calibration Effect (XGBoost)", subtitle = "Orange = uncalibrated · Navy = after beta calibration · Diagonal = perfect", x = "Mean Predicted P(cost-efficient)", y = "Observed Fraction" ) + theme_allianz() ``` ### Model Selection Decision ```{r model-decision} #| echo: false winner <- scores |> slice_min(`Brier Score`, n = 1) runner <- scores |> slice_min(`Brier Score`, n = 2) |> slice_tail(n = 1) cat(sprintf("Winner by Brier score: %s (BS = %.4f)\n", winner$Model, winner$`Brier Score`)) cat(sprintf("Runner-up: %s (BS = %.4f, Δ = %.4f)\n", runner$Model, runner$`Brier Score`, runner$`Brier Score` - winner$`Brier Score`)) ``` ## Business Action Items ::: {.callout-warning} **Immediate actions for the Partner Network team:** 1. **Deploy the ranking model operationally** — the case-to-partner recommendation function is ready for integration into the claims management system. Target: automatic suggestion at FNOL registration. 2. **Partner performance reviews (quarterly)** — rank partners by composite score within their claim-type segment; bottom quartile should trigger a structured performance dialogue. 3. **Capacity planning** — combine partner volume × composite score to identify capacity gaps: high-quality partners with small volumes should be prioritised for contract expansion. 4. **Glass segment: fast-track routing** — glass claims (−20% cost reduction from steering) should be routed within 2 hours of FNOL; current median notification delay is `r round(median(claims$notification_delay_days[claims$claim_type == "glass"]), 1)` days. 5. **Fraud prevention signal** — `notification_delay_days > 7` combined with `fraud_flag = 1` in the dataset; consider integrating into the triage queue before partner assignment. ::: ::: {.callout-tip} **Further analysis roadmap:** - **Online learning**: update partner composite scores monthly rather than quarterly using Bayesian posterior updates (conjugate normal-normal model for cost deviations) - **Network capacity optimisation**: solve a linear programme matching claims to partners subject to partner capacity constraints and minimum quality thresholds - **Partner development model**: predict which low-composite partners have the most improvement potential (based on trend in monthly O/E ratios) - **ADAS / EV surcharge**: extend the model with `vehicle_value_chf` × `fuel_type` interaction; electric vehicles and ADAS-equipped cars require specialist partners - **Cross-sell signal**: high `csat_score` from steered claims = retention opportunity; flag for renewal team outreach within 30 days of claim closure ::: ## Model Choice Rationale **Why binary classification instead of cost regression for this comparison?** Logistic regression is fundamentally a binary classification model; comparing it to a regression XGBoost on RMSE would be an apples-to-oranges comparison. By framing the problem as "predict whether cost will be below the claim-type median," we obtain: (1) a natural logistic regression baseline, (2) probability outputs amenable to proper scoring rules, (3) actionable recommendations (route to this partner: probability 78% of cost-efficient outcome). **Why Brier score as the primary selection criterion?** The Brier score is a proper scoring rule: it is minimised only when the model outputs the true event probability. AUC-ROC measures discrimination only (can it rank cases correctly?) but is insensitive to calibration. A model that always predicts 0.9 for every case can have high AUC but terrible Brier score. For an operational partner routing system where we need reliable probability estimates (not just rankings), Brier score is the correct primary metric. **Why XGBoost/LightGBM outperform logistic regression on tabular data?** Both gradient boosting models capture non-linear interactions between features that logistic regression cannot represent without manual feature engineering. The interaction between `severity_score × composite` (a high-severity claim needs a particularly capable partner) is naturally captured by a tree split but would require an explicit interaction term in a GLM. The empirical superiority of gradient boosting on tabular data is well-documented (Grinsztajn et al., 2022; Shwartz-Ziv & Armon, 2022). **Why CatBoost vs. XGBoost?** CatBoost (Prokhorenkova et al., 2018) uses ordered boosting and oblivious decision trees which provide built-in regularisation against overfitting. Unlike XGBoost, CatBoost was designed specifically for datasets with categorical features (hence "Cat") — it handles them natively without preprocessing. For insurance data with categorical variables like `claim_type`, `vehicle_class`, and `coverage_type`, CatBoost's native encoding typically outperforms manual one-hot encoding. The `bonsai` R package provides a clean parsnip interface so CatBoost participates in the full tidymodels workflow (recipes, CV, etc.) without changing the modeling API.