Stage 3 · Causal Steering Effect

IPW · AIPW (doubly robust) · CATE by segment · marginaleffects

Why Naive Comparisons Fail

Network-steered claims are not a random sample. Claim type, severity, and region all influence both the probability of steering and the outcome (cost). A naive comparison confounds the treatment effect with selection.

Show code

# Show covariate imbalance before adjustment
claims |>
  group_by(steering_flag, claim_type) |>
  summarise(avg_severity = mean(severity_score), .groups = "drop") |>
  mutate(steered = factor(steering_flag, labels = c("Not steered", "Steered"))) |>
  ggplot(aes(x = claim_type, y = avg_severity, fill = steered)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_manual(values = c("Not steered" = "#66B5E8", "Steered" = "#003781"),
                    name = NULL) +
  scale_y_continuous(limits = c(0, 0.5)) +
  labs(title    = "Selection Bias: Severity Differs Between Steered vs. Non-Steered",
       subtitle = "Steered claims tend to have different severity profiles",
       x = "Claim Type", y = "Average Severity Score") +
  theme_allianz()

Propensity Score Model

Show code

ps_result <- fit_propensity(claims)
claims_ps <- ps_result$data
m_ps      <- ps_result$model

# Summary
cat("Propensity model AUC:", round(
  pROC::auc(pROC::roc(claims$steering_flag, fitted(m_ps), quiet = TRUE)), 3
), "\n")

Propensity model AUC: 0.707

Propensity Score Distribution

Show code

ggplot(claims_ps, aes(x = ps, fill = factor(steering_flag))) +
  geom_histogram(binwidth = 0.025, alpha = 0.7, position = "identity") +
  scale_fill_manual(
    values = c("0" = "#66B5E8", "1" = "#003781"),
    labels = c("0" = "Not steered", "1" = "Steered"),
    name   = NULL
  ) +
  facet_wrap(~claim_type, nrow = 1) +
  labs(title    = "Propensity Score Distribution by Claim Type",
       subtitle = "Good overlap is required for valid IPW estimates",
       x = "Propensity Score P(Steered | X)", y = "Count") +
  theme_allianz()

Propensity Model Coefficients (sjPlot)

Show code

plot_model(m_ps, type = "est",
           title = "Propensity Model: Log-Odds of Steering",
           colors = c("#FF6600", "#003781")) +
  theme_allianz()

Marginal Predictions (ggeffects)

Show code

gg_ps <- ggpredict(m_ps, terms = c("severity_score [all]", "claim_type"))
plot(gg_ps) +
  scale_colour_manual(values = c(
    glass="#003781", body="#0066CC", engine="#00A9CE", total_loss="#FF6600"
  )) +
  labs(title    = "P(Steered) by Severity and Claim Type",
       subtitle = "Marginal predictions from propensity model",
       x = "Severity Score", y = "P(Steered)") +
  theme_allianz()

Overlap (Positivity) Diagnostic

Before computing IPW weights, check for positivity violations. AIPW requires 0 < P(treated | X) < 1 for all covariate combinations; near-zero or near-one propensity scores inflate IPW weights and destabilise the doubly-robust correction.

Show code

overlap_check <- claims_ps |>
  group_by(claim_type) |>
  summarise(
    n              = n(),
    pct_ps_low     = mean(ps < 0.05) * 100,   # near-zero: almost never steered
    pct_ps_high    = mean(ps > 0.95) * 100,   # near-one:  almost always steered
    max_trim_ipw   = max(trim_ipw),
    median_trim_ipw = median(trim_ipw),
    .groups = "drop"
  )

overlap_check |>
  gt() |>
  tab_header(
    title    = "Propensity Score Overlap Diagnostics",
    subtitle = "Rows with PS < 0.05 or > 0.95 signal potential positivity violations"
  ) |>
  fmt_number(columns = c(pct_ps_low, pct_ps_high), decimals = 1, suffix = "%") |>
  fmt_number(columns = c(max_trim_ipw, median_trim_ipw), decimals = 2) |>
  tab_style(
    style = cell_fill(color = "#FFE0CC"),
    locations = cells_body(
      rows = pct_ps_low > 5 | pct_ps_high > 5
    )
  ) |>
  tab_style(
    style = cell_fill(color = "#003781"),
    locations = cells_column_labels()
  ) |>
  tab_style(
    style = list(cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) |>
  tab_options(table.font.size = px(13))

claim_type	n	pct_ps_low	pct_ps_high	max_trim_ipw	median_trim_ipw
Propensity Score Overlap Diagnostics
Rows with PS < 0.05 or > 0.95 signal potential positivity violations
glass	2890	0.0	0.4	2.69	0.83
body	4079	0.0	0.0	2.69	0.93
engine	1994	0.0	0.0	2.69	0.93
total_loss	1037	0.0	0.0	2.69	0.63

Warning

If pct_ps_low or pct_ps_high exceeds ~5% for any segment, AIPW estimates for that segment should be interpreted with caution. The 99th-percentile trimming of IPW weights (trim_ipw) mitigates but does not eliminate the instability.

IPW / AIPW Estimators

WeightIt: Stabilised IPW

Show code

w_out <- weightit(
  steering_flag ~ claim_type + vehicle_class + severity_score + region + vehicle_age,
  data   = claims,
  method = "ps",
  estimand = "ATE"
)
summary(w_out)

                  Summary of weights

- Weight ranges:

          Min                                  Max
treated 1.035 |-------|                      6.679
control 1.173 |---------------------------| 17.954

- Units with the 5 most extreme weights by group:
                                           
           5290   2274   2032   2026   1924
 treated  5.909  5.954  6.183  6.509  6.679
           2512   2444   1463   1420   1221
 control 13.951 14.289 16.605 17.492 17.954

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.336 0.228   0.047       0
control       0.559 0.375   0.120       0

- Effective Sample Sizes:

           Control Treated
Unweighted 3817.   6183.  
Weighted   2907.31 5554.98

Balance Diagnostics (halfmoon)

Standardized Mean Differences (SMD) measure covariate balance between steered and non-steered groups before and after inverse-probability weighting. Good balance: |SMD| < 0.1.

Show code

library(halfmoon)
library(tidysmd)

balance_vars <- c("severity_score", "vehicle_age", "ncd_class",
                   "deductible_amount", "notification_delay_days")

# tidy_smd with .wts returns method="observed" + method="trim_ipw" rows
smd_df <- tidy_smd(
  claims_ps,
  all_of(balance_vars),
  .group = steering_flag,
  .wts   = trim_ipw
) |>
  mutate(
    abs_smd = abs(smd),
    method  = factor(method,
                     levels = c("observed", "trim_ipw"),
                     labels = c("Unweighted", "IPW weighted"))
  )

love_plot(smd_df) +
  geom_vline(xintercept = 0.1, linetype = "dashed", colour = "#FF6600", linewidth = 0.8) +
  scale_colour_manual(values = c(Unweighted = "#FF6600", `IPW weighted` = "#003781"),
                      aesthetics = c("colour", "fill"), name = NULL) +
  labs(
    title    = "Love Plot: Covariate Balance Before vs. After IPW",
    subtitle = "Dashed line = |SMD| = 0.1 threshold · Values below = good balance",
    x        = "|Standardized Mean Difference|", y = NULL
  ) +
  theme_allianz(grid = "y")

Doubly-Robust AIPW (with mirai bootstrap CIs)

The doubly-robust AIPW estimator is consistent if either the propensity model or the outcome model is correctly specified — providing valuable insurance against model misspecification.

Show code

# Point estimate
ate_result <- aipw_ate(claims_ps)
cat(sprintf("AIPW ATE: %.1f%% (CHF difference: %.0f)\n",
            ate_result$ate_pct, ate_result$ate_chf))

AIPW ATE: -6.0% (CHF difference: -314)

Bootstrap CIs for AIPW (500 resamples via mirai)

Show code

cat("Bootstrapping AIPW CIs (500 resamples, 4 mirai workers)...\n")

Bootstrapping AIPW CIs (500 resamples, 4 mirai workers)...

Show code

daemons(4)

boot_ate <- mirai_map(
  seq_len(500L),
  function(b, claims, utils_path) {
    suppressMessages(library(tidyverse))
    source(utils_path, local = TRUE)
    idx <- sample(nrow(claims), nrow(claims), replace = TRUE)
    d   <- claims[idx, ]
    ps_res <- fit_propensity(d)
    res    <- aipw_ate(ps_res$data)
    res$ate_pct
  },
  .args = list(claims = claims, utils_path = "../R/utils_models.R")
)

boot_ate_vec <- vapply(boot_ate[], function(x) {
  tryCatch(as.numeric(x), error = function(e) NA_real_)
}, numeric(1))
daemons(0)

ci_lower <- quantile(boot_ate_vec, 0.025, na.rm = TRUE)
ci_upper <- quantile(boot_ate_vec, 0.975, na.rm = TRUE)

cat(sprintf("AIPW ATE: %.1f%% [95%% CI: %.1f%%, %.1f%%]\n",
            ate_result$ate_pct, ci_lower, ci_upper))

AIPW ATE: -6.0% [95% CI: -7.2%, -4.7%]

ATE Summary

Estimator	ATE_pct	CI_95	Note
Summary: ATE of Network Steering on Repair Cost
Naive (raw)	−44.2	—	Confounded by selection
IPW (WeightIt)	NA	—	IPW only
AIPW (doubly robust)	−6.0	[-7.2%, -4.7%]	Doubly robust

Heterogeneous Treatment Effects (CATE)

The overall ATE masks important heterogeneity. Steering benefits vary substantially by claim type — this is where the real business value lies.

CATE by Claim Type

Show code

cat("Fitting CATE models per segment (4 mirai workers)...\n")

Fitting CATE models per segment (4 mirai workers)...

Show code

daemons(4)

cate_results_mirai <- mirai_map(
  levels(claims_ps$claim_type),
  function(ct, claims_ps, utils_path) {
    suppressMessages(library(tidyverse))
    source(utils_path, local = TRUE)
    d   <- claims_ps[claims_ps$claim_type == ct, ]
    res <- tryCatch(aipw_ate(d), error = function(e) list(ate_pct = NA_real_))
    list(claim_type = ct, cate_pct = res$ate_pct, n = nrow(d))
  },
  .args = list(claims_ps = claims_ps, utils_path = "../R/utils_models.R")
)

cate_df <- do.call(rbind, lapply(cate_results_mirai[], function(x) {
  data.frame(claim_type = x$claim_type, cate_pct = x$cate_pct, n = x$n)
})) |>
  as_tibble()

daemons(0)

# Persist CATE results for the Shiny dashboard (avoids hard-coded values there)
saveRDS(cate_df, "../data/cate_results.rds")

Show code

# Add ground truth for reference
ground_truth <- tibble(
  claim_type = c("glass","body","engine","total_loss"),
  true_ate   = c(-20, -10, -8, 0)
)

cate_df |>
  left_join(ground_truth, by = "claim_type") |>
  ggplot(aes(x = claim_type)) +
  geom_col(aes(y = cate_pct, fill = cate_pct < 0), width = 0.65, alpha = 0.9) +
  geom_point(aes(y = true_ate), shape = 18, size = 5, colour = "#FF6600",
             position = position_nudge(x = 0)) +
  geom_hline(yintercept = 0, colour = "#6B7280", linewidth = 0.5, linetype = "dashed") +
  geom_text(aes(y = cate_pct, label = sprintf("%.1f%%", cate_pct)),
            vjust = ifelse(cate_df$cate_pct < 0, 1.5, -0.5),
            size = 3.5, fontface = "bold", colour = "#1A1A1A") +
  scale_fill_manual(values = c(`TRUE` = "#00A9CE", `FALSE` = "#FF6600"), guide = "none") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(
    title    = "CATE: Steering Effect on Cost by Claim Type",
    subtitle = "Bars = AIPW estimate · Orange diamonds = ground truth | n per segment labeled",
    caption  = "Ground truth: glass −20%, body −10%, engine −8%, total_loss ~0%",
    x        = "Claim Type",
    y        = "Cost Change (%)"
  ) +
  theme_allianz()

CATE by Region

Show code

cate_region <- claims_ps |>
  group_by(region) |>
  group_map(function(d, k) {
    res <- tryCatch(aipw_ate(d), error = function(e) list(ate_pct = NA_real_))
    tibble(region = k$region, cate_pct = res$ate_pct, n = nrow(d))
  }) |>
  bind_rows()

p_region <- ggplot(cate_region, aes(x = reorder(region, cate_pct), y = cate_pct,
                                     fill = cate_pct)) +
  geom_col(width = 0.7) +
  geom_hline(yintercept = 0, colour = "#6B7280", linetype = "dashed") +
  geom_text(aes(label = sprintf("%.1f%%\n(n=%d)", cate_pct, n)),
            hjust = ifelse(cate_region$cate_pct[order(cate_region$cate_pct)] < 0, 1.1, -0.1),
            size = 2.8, colour = "#1A1A1A") +
  scale_fill_gradient2(
    low = "#FF6600", mid = "#66B5E8", high = "#003781", midpoint = -10,
    guide = "none"
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  coord_flip() +
  labs(
    title    = "CATE by Swiss Region",
    subtitle = "Regional heterogeneity in steering benefit",
    x        = NULL, y = "Estimated Cost Change (%)"
  ) +
  theme_allianz(grid = "y")

ggplotly(p_region)

Sensitivity Check: marginaleffects (OLS baseline)

The AIPW and causal forest above are the primary estimators. As a sensitivity check, marginaleffects::avg_comparisons() on a simple OLS interaction model should produce directionally consistent CATEs — if it does not, the more flexible methods should be trusted.

Show code

# OLS with interaction — NOT the preferred estimator (assumes homoscedastic errors
# on log-cost), but serves as a quick sanity check against AIPW.
m_outcome <- lm(log(repair_cost) ~ steering_flag * claim_type +
                  vehicle_class + severity_score + region + vehicle_age,
                data = claims)

avg_comp <- avg_comparisons(
  m_outcome,
  variables = "steering_flag",
  by        = "claim_type"
)

# Check directional agreement with AIPW
left_join(
  as_tibble(avg_comp) |> select(claim_type, ols_est = estimate, ols_lo = conf.low, ols_hi = conf.high),
  cate_df |> select(claim_type, aipw_pct = cate_pct),
  by = "claim_type"
) |>
  mutate(agree = sign(ols_est) == sign(aipw_pct / 100)) |>
  gt() |>
  tab_header(title = "Sensitivity: OLS marginaleffects vs. AIPW",
             subtitle = "Both estimators should agree in direction if assumptions hold") |>
  fmt_number(columns = c(ols_est, ols_lo, ols_hi), decimals = 4) |>
  fmt_number(columns = aipw_pct, decimals = 1, suffix = "%") |>
  tab_style(style = cell_fill(color = "#003781"), locations = cells_column_labels()) |>
  tab_style(style = list(cell_text(color = "white", weight = "bold")),
            locations = cells_column_labels()) |>
  tab_options(table.font.size = px(13))

claim_type	ols_est	ols_lo	ols_hi	aipw_pct	agree
Sensitivity: OLS marginaleffects vs. AIPW
Both estimators should agree in direction if assumptions hold
glass	−0.1807	−0.1952	−0.1662	NA	NA
body	−0.0972	−0.1079	−0.0865	NA	NA
engine	−0.0655	−0.0802	−0.0508	NA	NA
total_loss	−0.0473	−0.0687	−0.0259	NA	NA

Business Interpretation

Note

Steering delivers the most value for glass and body damage claims. Engine damage shows moderate benefits. For total loss, steering provides no cost advantage — resources should focus on quality (CSAT) and speed rather than cost optimisation in this segment.

Claim Type	CATE (est.)	Ground Truth	Volume Share	Recommendation
Steering Strategy Recommendations by Claim Type
Glass	NA%	−20%	30%	Prioritise steering — large cost saving
Body	NA%	−10%	40%	High priority — moderate saving at high volume
Engine	NA%	−8%	20%	Moderate priority — select quality partners
Total Loss	NA%	~0%	10%	No cost advantage — focus on speed & CSAT

Action Items

Warning

Immediate steering optimisation opportunities:

Glass claims: maximise steering rate — a −20% cost effect at scale (30% of claims) represents the single largest cost-reduction lever. Target steering rate > 85% for glass.
Body damage: partner quality gate — steer to partners with composite score > 60; average O/E ratio < 1.05. The −10% effect is real but heterogeneous by partner.
Total loss: switch priority metric — since steering provides no cost benefit, rank total loss partners by duration_days and csat_score instead. Speed of settlement drives customer satisfaction in total loss cases.
Engine damage: segment by vehicle class — CATE for engine damage varies by vehicle class; premium vehicles with ADAS systems have a higher partner-skill dependency.
New variables to activate: fraud_flag × notification_delay_days as a real-time triage signal; incorporate into the steering decision logic.

Tip

Further analysis required:

Longitudinal CATE analysis: does the steering benefit change over time (partner learning effects)?
Mediation analysis: what fraction of the cost reduction is due to partner quality vs. speed?
Compliance analysis: why do some steerable claims not get steered? (identify friction points)
Test coverage_type (Vollkasko vs Teilkasko) as a moderator for the steering effect

Model Choice Rationale

Why doubly-robust AIPW instead of simple regression or matching?

The doubly-robust AIPW estimator provides valid causal inference if either the propensity model or the outcome model is correctly specified — but not necessarily both. This is practically important: we cannot be certain our logistic propensity model fully captures the assignment mechanism, and our outcome model inevitably misspecifies the true functional form. AIPW exploits both models simultaneously, offering a safety margin that neither IPW alone nor simple outcome regression provides.

Why not matching (MatchIt)? Matching discards unmatched units (typically 20–40% of data), reducing statistical power. For a 10,000-claim dataset, this sacrifice is unnecessary. The IPW approach retains all observations and achieves balance through re-weighting. The trimming at the 99th percentile of weights prevents extreme observations from dominating.

Why bootstrap CIs for AIPW instead of analytical variance estimation? The semiparametric variance of AIPW involves the influence function of both nuisance models simultaneously. The bootstrap approximation, parallelised via mirai, avoids these derivations and is asymptotically equivalent under mild regularity conditions.

Why segment CATE by claim type? The Conditional Average Treatment Effect (CATE) is the causally correct estimand when treatment effects are heterogeneous across subgroups. Reporting a single ATE would mask the zero effect in total loss (and could justify wasteful steering resources in that segment). Segment-level CATE directly maps to the steering strategy decision for each peril type.

Causal ML: Honest Causal Forests (grf)

Causal forests (Wager & Athey, 2018) estimate heterogeneous treatment effects at the individual level using honest random forests — a non-parametric causal ML alternative to the parametric AIPW approach used above. Key advantages:

Non-parametric: no functional form assumption for \tau(x)
Honest splitting: the sample used to build the tree structure is separate from the sample used to estimate leaf values, providing valid confidence intervals
Adaptive: the forest automatically focuses splitting power on regions of covariate space with high treatment effect heterogeneity

Show code

library(grf)

# Prepare matrices
X_vars <- c("severity_score", "vehicle_age", "ncd_class", "deductible_amount",
             "notification_delay_days")
X_cat  <- c("claim_type", "vehicle_class", "region", "coverage_type", "fault_indicator")

# One-hot encode categoricals
X_num <- claims |>
  select(all_of(X_vars)) |>
  as.matrix()

X_enc <- model.matrix(
  ~ claim_type + vehicle_class + region + coverage_type + fault_indicator - 1,
  data = claims
)

X  <- cbind(X_num, X_enc)
# Note on scale: grf requires a continuous Y; we use log(cost) to reduce the
# influence of large outliers and stabilise variance. The AIPW in utils_models.R
# uses a Gamma GLM on the original CHF scale. When comparing CATE estimates
# below, both are converted to % change — they should agree directionally but
# may differ in magnitude due to the scale difference (log-normal vs Gamma).
Y  <- log(claims$repair_cost)
W  <- claims$steering_flag

set.seed(2024)
cf <- causal_forest(
  X         = X,
  Y         = Y,
  W         = W,
  num.trees = 2000,
  tune.parameters = "all"
)

# Individual-level CATE estimates
tau_hat <- predict(cf, estimate.variance = TRUE)
claims_cf <- claims |>
  mutate(
    cate        = tau_hat$predictions,
    cate_pct    = (exp(cate) - 1) * 100,
    cate_se     = sqrt(tau_hat$variance.estimates),
    cate_lo     = (exp(cate - 1.96 * cate_se) - 1) * 100,
    cate_hi     = (exp(cate + 1.96 * cate_se) - 1) * 100
  )

cat("Causal forest ATE:", round((exp(average_treatment_effect(cf)[["estimate"]]) - 1) * 100, 1), "%\n")

Causal forest ATE: -10.4 %

Show code

# Calibration test: checks whether CATE estimates have the right mean and
# whether heterogeneity is real or noise. Best Linear Projection of tau(X)
# onto forest predictions — slope ≈ 1 means well-calibrated heterogeneity.
tc <- test_calibration(cf)
cat("\nCalibration test (BLP):\n")


Calibration test (BLP):

Show code

print(tc)


Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                               Estimate Std. Error t value    Pr(>t)    
mean.forest.prediction         1.000894   0.030926  32.364 < 2.2e-16 ***
differential.forest.prediction 1.027284   0.074255  13.835 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show code

cat("95% CI: [",
    round((exp(average_treatment_effect(cf)[["estimate"]] -
               1.96 * average_treatment_effect(cf)[["std.err"]]) - 1) * 100, 1),
    ",",
    round((exp(average_treatment_effect(cf)[["estimate"]] +
               1.96 * average_treatment_effect(cf)[["std.err"]]) - 1) * 100, 1),
    "]\n")

95% CI: [ -11.1 , -9.8 ]

CATE Distribution

Show code

ggplot(claims_cf, aes(x = cate_pct, fill = claim_type)) +
  geom_histogram(binwidth = 2, alpha = 0.75, colour = "white", linewidth = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "#FF6600", linewidth = 0.8) +
  scale_fill_manual(
    values = c(glass="#003781", body="#0066CC", engine="#00A9CE", total_loss="#FF6600"),
    name   = "Claim Type"
  ) +
  facet_wrap(~claim_type, nrow = 1, scales = "free_y") +
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  labs(
    title    = "Distribution of Individual CATE Estimates (Causal Forest)",
    subtitle = "Each bar = one claim · Dashed = zero effect · Colour = claim type",
    x        = "Estimated Treatment Effect on Cost (%)", y = "Count"
  ) +
  theme_allianz() +
  theme(legend.position = "none")

Comparing AIPW vs. Causal Forest CATEs

Show code

cf_segment <- claims_cf |>
  group_by(claim_type) |>
  summarise(
    cf_cate_pct = mean(cate_pct),
    cf_lo       = mean(cate_lo),
    cf_hi       = mean(cate_hi),
    .groups     = "drop"
  )

cate_compare <- left_join(
  cate_df,
  cf_segment |> rename(cf_cate = cf_cate_pct),
  by = "claim_type"
)

cate_compare_long <- cate_compare |>
  pivot_longer(c(cate_pct, cf_cate),
               names_to = "method", values_to = "estimate") |>
  mutate(method = recode(method,
                         cate_pct = "AIPW (doubly robust)",
                         cf_cate  = "Causal Forest (grf)"))

ggplot(cate_compare_long, aes(x = claim_type, y = estimate,
                               colour = method, group = method)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "#6B7280") +
  geom_line(linewidth = 1, alpha = 0.7) +
  geom_point(size = 4) +
  geom_errorbar(
    data = filter(cate_compare_long, method == "Causal Forest (grf)"),
    aes(ymin = cf_lo, ymax = cf_hi), width = 0.2, linewidth = 0.8
  ) +
  scale_colour_manual(
    values = c("AIPW (doubly robust)" = "#003781", "Causal Forest (grf)" = "#00A9CE"),
    name   = NULL
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(
    title    = "AIPW vs. Causal Forest CATE Estimates",
    subtitle = "Error bars = 95% CI from causal forest · Both methods should broadly agree",
    x        = "Claim Type", y = "Estimated Cost Change (%)"
  ) +
  theme_allianz()

Note

Causal forest interpretation: If AIPW and causal forest estimates broadly agree (within their respective CIs), this triangulation increases confidence in the causal estimates. Disagreements reveal where parametric assumptions in the AIPW propensity or outcome model may be violated — the causal forest provides a more flexible non-parametric check. The causal forest additionally provides individual-level CATE estimates, enabling claim-by-claim decisions about whether network steering is expected to save costs.

--- title: "Stage 3 · Causal Steering Effect" subtitle: "IPW · AIPW (doubly robust) · CATE by segment · marginaleffects" --- ```{r setup} #| include: false library(tidyverse) library(WeightIt) library(MatchIt) library(marginaleffects) library(ggeffects) library(sjPlot) library(halfmoon) library(tidysmd) library(gtsummary) library(gtExtras) library(mirai) library(plotly) library(gt) library(patchwork) source("../R/utils_viz.R") source("../R/utils_models.R") claims <- readRDS("../data/claims.rds") partners <- readRDS("../data/partners.rds") ``` ## Why Naive Comparisons Fail Network-steered claims are not a random sample. Claim type, severity, and region all influence both the probability of steering *and* the outcome (cost). A naive comparison confounds the treatment effect with selection. ```{r overlap-check} #| fig-height: 4 # Show covariate imbalance before adjustment claims |> group_by(steering_flag, claim_type) |> summarise(avg_severity = mean(severity_score), .groups = "drop") |> mutate(steered = factor(steering_flag, labels = c("Not steered", "Steered"))) |> ggplot(aes(x = claim_type, y = avg_severity, fill = steered)) + geom_col(position = "dodge", width = 0.7) + scale_fill_manual(values = c("Not steered" = "#66B5E8", "Steered" = "#003781"), name = NULL) + scale_y_continuous(limits = c(0, 0.5)) + labs(title = "Selection Bias: Severity Differs Between Steered vs. Non-Steered", subtitle = "Steered claims tend to have different severity profiles", x = "Claim Type", y = "Average Severity Score") + theme_allianz() ``` ## Propensity Score Model ```{r propensity-model} ps_result <- fit_propensity(claims) claims_ps <- ps_result$data m_ps <- ps_result$model # Summary cat("Propensity model AUC:", round( pROC::auc(pROC::roc(claims$steering_flag, fitted(m_ps), quiet = TRUE)), 3 ), "\n") ``` ### Propensity Score Distribution ```{r ps-distribution} #| fig-height: 4 ggplot(claims_ps, aes(x = ps, fill = factor(steering_flag))) + geom_histogram(binwidth = 0.025, alpha = 0.7, position = "identity") + scale_fill_manual( values = c("0" = "#66B5E8", "1" = "#003781"), labels = c("0" = "Not steered", "1" = "Steered"), name = NULL ) + facet_wrap(~claim_type, nrow = 1) + labs(title = "Propensity Score Distribution by Claim Type", subtitle = "Good overlap is required for valid IPW estimates", x = "Propensity Score P(Steered | X)", y = "Count") + theme_allianz() ``` ### Propensity Model Coefficients (sjPlot) ```{r ps-coef-plot} #| fig-height: 5 plot_model(m_ps, type = "est", title = "Propensity Model: Log-Odds of Steering", colors = c("#FF6600", "#003781")) + theme_allianz() ``` ### Marginal Predictions (ggeffects) ```{r ggeffects-ps} #| fig-height: 4 gg_ps <- ggpredict(m_ps, terms = c("severity_score [all]", "claim_type")) plot(gg_ps) + scale_colour_manual(values = c( glass="#003781", body="#0066CC", engine="#00A9CE", total_loss="#FF6600" )) + labs(title = "P(Steered) by Severity and Claim Type", subtitle = "Marginal predictions from propensity model", x = "Severity Score", y = "P(Steered)") + theme_allianz() ``` ### Overlap (Positivity) Diagnostic Before computing IPW weights, check for positivity violations. AIPW requires `0 < P(treated | X) < 1` for all covariate combinations; near-zero or near-one propensity scores inflate IPW weights and destabilise the doubly-robust correction. ```{r overlap-diagnostic} overlap_check <- claims_ps |> group_by(claim_type) |> summarise( n = n(), pct_ps_low = mean(ps < 0.05) * 100, # near-zero: almost never steered pct_ps_high = mean(ps > 0.95) * 100, # near-one: almost always steered max_trim_ipw = max(trim_ipw), median_trim_ipw = median(trim_ipw), .groups = "drop" ) overlap_check |> gt() |> tab_header( title = "Propensity Score Overlap Diagnostics", subtitle = "Rows with PS < 0.05 or > 0.95 signal potential positivity violations" ) |> fmt_number(columns = c(pct_ps_low, pct_ps_high), decimals = 1, suffix = "%") |> fmt_number(columns = c(max_trim_ipw, median_trim_ipw), decimals = 2) |> tab_style( style = cell_fill(color = "#FFE0CC"), locations = cells_body( rows = pct_ps_low > 5 | pct_ps_high > 5 ) ) |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(13)) ``` ::: {.callout-warning} If `pct_ps_low` or `pct_ps_high` exceeds ~5% for any segment, AIPW estimates for that segment should be interpreted with caution. The 99th-percentile trimming of IPW weights (`trim_ipw`) mitigates but does not eliminate the instability. ::: ## IPW / AIPW Estimators ### WeightIt: Stabilised IPW ```{r weightit} w_out <- weightit( steering_flag ~ claim_type + vehicle_class + severity_score + region + vehicle_age, data = claims, method = "ps", estimand = "ATE" ) summary(w_out) ``` ### Balance Diagnostics (halfmoon) Standardized Mean Differences (SMD) measure covariate balance between steered and non-steered groups before and after inverse-probability weighting. Good balance: |SMD| < 0.1. ```{r balance-halfmoon} #| fig-height: 5 library(halfmoon) library(tidysmd) balance_vars <- c("severity_score", "vehicle_age", "ncd_class", "deductible_amount", "notification_delay_days") # tidy_smd with .wts returns method="observed" + method="trim_ipw" rows smd_df <- tidy_smd( claims_ps, all_of(balance_vars), .group = steering_flag, .wts = trim_ipw ) |> mutate( abs_smd = abs(smd), method = factor(method, levels = c("observed", "trim_ipw"), labels = c("Unweighted", "IPW weighted")) ) love_plot(smd_df) + geom_vline(xintercept = 0.1, linetype = "dashed", colour = "#FF6600", linewidth = 0.8) + scale_colour_manual(values = c(Unweighted = "#FF6600", `IPW weighted` = "#003781"), aesthetics = c("colour", "fill"), name = NULL) + labs( title = "Love Plot: Covariate Balance Before vs. After IPW", subtitle = "Dashed line = |SMD| = 0.1 threshold · Values below = good balance", x = "|Standardized Mean Difference|", y = NULL ) + theme_allianz(grid = "y") ``` ### Doubly-Robust AIPW (with mirai bootstrap CIs) The doubly-robust AIPW estimator is consistent if *either* the propensity model *or* the outcome model is correctly specified — providing valuable insurance against model misspecification. ```{r aipw-estimate} # Point estimate ate_result <- aipw_ate(claims_ps) cat(sprintf("AIPW ATE: %.1f%% (CHF difference: %.0f)\n", ate_result$ate_pct, ate_result$ate_chf)) ``` ### Bootstrap CIs for AIPW (500 resamples via mirai) ```{r aipw-bootstrap} #| cache: false cat("Bootstrapping AIPW CIs (500 resamples, 4 mirai workers)...\n") daemons(4) boot_ate <- mirai_map( seq_len(500L), function(b, claims, utils_path) { suppressMessages(library(tidyverse)) source(utils_path, local = TRUE) idx <- sample(nrow(claims), nrow(claims), replace = TRUE) d <- claims[idx, ] ps_res <- fit_propensity(d) res <- aipw_ate(ps_res$data) res$ate_pct }, .args = list(claims = claims, utils_path = "../R/utils_models.R") ) boot_ate_vec <- vapply(boot_ate[], function(x) { tryCatch(as.numeric(x), error = function(e) NA_real_) }, numeric(1)) daemons(0) ci_lower <- quantile(boot_ate_vec, 0.025, na.rm = TRUE) ci_upper <- quantile(boot_ate_vec, 0.975, na.rm = TRUE) cat(sprintf("AIPW ATE: %.1f%% [95%% CI: %.1f%%, %.1f%%]\n", ate_result$ate_pct, ci_lower, ci_upper)) ``` ### ATE Summary ```{r ate-summary} #| echo: false tibble( Estimator = c("Naive (raw)", "IPW (WeightIt)", "AIPW (doubly robust)"), ATE_pct = c( (mean(claims$repair_cost[claims$steering_flag==1]) / mean(claims$repair_cost[claims$steering_flag==0]) - 1) * 100, NA_real_, ate_result$ate_pct ), CI_95 = c("—", "—", sprintf("[%.1f%%, %.1f%%]", ci_lower, ci_upper)), Note = c("Confounded by selection", "IPW only", "Doubly robust") ) |> gt() |> tab_header(title = "Summary: ATE of Network Steering on Repair Cost") |> fmt_number(columns = ATE_pct, decimals = 1, suffix = "%") |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(13)) ``` ## Heterogeneous Treatment Effects (CATE) The overall ATE masks important heterogeneity. Steering benefits vary substantially by claim type — this is where the real business value lies. ### CATE by Claim Type ```{r cate-types} #| cache: false cat("Fitting CATE models per segment (4 mirai workers)...\n") daemons(4) cate_results_mirai <- mirai_map( levels(claims_ps$claim_type), function(ct, claims_ps, utils_path) { suppressMessages(library(tidyverse)) source(utils_path, local = TRUE) d <- claims_ps[claims_ps$claim_type == ct, ] res <- tryCatch(aipw_ate(d), error = function(e) list(ate_pct = NA_real_)) list(claim_type = ct, cate_pct = res$ate_pct, n = nrow(d)) }, .args = list(claims_ps = claims_ps, utils_path = "../R/utils_models.R") ) cate_df <- do.call(rbind, lapply(cate_results_mirai[], function(x) { data.frame(claim_type = x$claim_type, cate_pct = x$cate_pct, n = x$n) })) |> as_tibble() daemons(0) # Persist CATE results for the Shiny dashboard (avoids hard-coded values there) saveRDS(cate_df, "../data/cate_results.rds") ``` ```{r cate-plot} #| fig-height: 4 # Add ground truth for reference ground_truth <- tibble( claim_type = c("glass","body","engine","total_loss"), true_ate = c(-20, -10, -8, 0) ) cate_df |> left_join(ground_truth, by = "claim_type") |> ggplot(aes(x = claim_type)) + geom_col(aes(y = cate_pct, fill = cate_pct < 0), width = 0.65, alpha = 0.9) + geom_point(aes(y = true_ate), shape = 18, size = 5, colour = "#FF6600", position = position_nudge(x = 0)) + geom_hline(yintercept = 0, colour = "#6B7280", linewidth = 0.5, linetype = "dashed") + geom_text(aes(y = cate_pct, label = sprintf("%.1f%%", cate_pct)), vjust = ifelse(cate_df$cate_pct < 0, 1.5, -0.5), size = 3.5, fontface = "bold", colour = "#1A1A1A") + scale_fill_manual(values = c(`TRUE` = "#00A9CE", `FALSE` = "#FF6600"), guide = "none") + scale_y_continuous(labels = scales::percent_format(scale = 1)) + labs( title = "CATE: Steering Effect on Cost by Claim Type", subtitle = "Bars = AIPW estimate · Orange diamonds = ground truth | n per segment labeled", caption = "Ground truth: glass −20%, body −10%, engine −8%, total_loss ~0%", x = "Claim Type", y = "Cost Change (%)" ) + theme_allianz() ``` ### CATE by Region ```{r cate-region} #| fig-height: 5 cate_region <- claims_ps |> group_by(region) |> group_map(function(d, k) { res <- tryCatch(aipw_ate(d), error = function(e) list(ate_pct = NA_real_)) tibble(region = k$region, cate_pct = res$ate_pct, n = nrow(d)) }) |> bind_rows() p_region <- ggplot(cate_region, aes(x = reorder(region, cate_pct), y = cate_pct, fill = cate_pct)) + geom_col(width = 0.7) + geom_hline(yintercept = 0, colour = "#6B7280", linetype = "dashed") + geom_text(aes(label = sprintf("%.1f%%\n(n=%d)", cate_pct, n)), hjust = ifelse(cate_region$cate_pct[order(cate_region$cate_pct)] < 0, 1.1, -0.1), size = 2.8, colour = "#1A1A1A") + scale_fill_gradient2( low = "#FF6600", mid = "#66B5E8", high = "#003781", midpoint = -10, guide = "none" ) + scale_y_continuous(labels = scales::percent_format(scale = 1)) + coord_flip() + labs( title = "CATE by Swiss Region", subtitle = "Regional heterogeneity in steering benefit", x = NULL, y = "Estimated Cost Change (%)" ) + theme_allianz(grid = "y") ggplotly(p_region) ``` ## Sensitivity Check: marginaleffects (OLS baseline) The AIPW and causal forest above are the primary estimators. As a sensitivity check, `marginaleffects::avg_comparisons()` on a simple OLS interaction model should produce directionally consistent CATEs — if it does not, the more flexible methods should be trusted. ```{r marginaleffects} #| message: false # OLS with interaction — NOT the preferred estimator (assumes homoscedastic errors # on log-cost), but serves as a quick sanity check against AIPW. m_outcome <- lm(log(repair_cost) ~ steering_flag * claim_type + vehicle_class + severity_score + region + vehicle_age, data = claims) avg_comp <- avg_comparisons( m_outcome, variables = "steering_flag", by = "claim_type" ) # Check directional agreement with AIPW left_join( as_tibble(avg_comp) |> select(claim_type, ols_est = estimate, ols_lo = conf.low, ols_hi = conf.high), cate_df |> select(claim_type, aipw_pct = cate_pct), by = "claim_type" ) |> mutate(agree = sign(ols_est) == sign(aipw_pct / 100)) |> gt() |> tab_header(title = "Sensitivity: OLS marginaleffects vs. AIPW", subtitle = "Both estimators should agree in direction if assumptions hold") |> fmt_number(columns = c(ols_est, ols_lo, ols_hi), decimals = 4) |> fmt_number(columns = aipw_pct, decimals = 1, suffix = "%") |> tab_style(style = cell_fill(color = "#003781"), locations = cells_column_labels()) |> tab_style(style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels()) |> tab_options(table.font.size = px(13)) ``` ## Business Interpretation ::: {.callout-note} **Steering delivers the most value for glass and body damage claims.** Engine damage shows moderate benefits. For total loss, steering provides no cost advantage — resources should focus on quality (CSAT) and speed rather than cost optimisation in this segment. ::: ```{r business-table} #| echo: false tibble( `Claim Type` = c("Glass", "Body", "Engine", "Total Loss"), `CATE (est.)` = sprintf("%.1f%%", cate_df$cate_pct), `Ground Truth` = c("−20%","−10%","−8%","~0%"), `Volume Share` = sprintf("%.0f%%", c(30,40,20,10)), `Recommendation` = c( "Prioritise steering — large cost saving", "High priority — moderate saving at high volume", "Moderate priority — select quality partners", "No cost advantage — focus on speed & CSAT" ) ) |> gt() |> tab_header(title = "Steering Strategy Recommendations by Claim Type") |> tab_style( style = cell_fill(color = "#003781"), locations = cells_column_labels() ) |> tab_style( style = list(cell_text(color = "white", weight = "bold")), locations = cells_column_labels() ) |> tab_options(table.font.size = px(13)) ``` ## Action Items ::: {.callout-warning} **Immediate steering optimisation opportunities:** 1. **Glass claims: maximise steering rate** — a −20% cost effect at scale (30% of claims) represents the single largest cost-reduction lever. Target steering rate > 85% for glass. 2. **Body damage: partner quality gate** — steer to partners with composite score > 60; average O/E ratio < 1.05. The −10% effect is real but heterogeneous by partner. 3. **Total loss: switch priority metric** — since steering provides no cost benefit, rank total loss partners by `duration_days` and `csat_score` instead. Speed of settlement drives customer satisfaction in total loss cases. 4. **Engine damage: segment by vehicle class** — CATE for engine damage varies by vehicle class; premium vehicles with ADAS systems have a higher partner-skill dependency. 5. **New variables to activate:** `fraud_flag` × `notification_delay_days` as a real-time triage signal; incorporate into the steering decision logic. ::: ::: {.callout-tip} **Further analysis required:** - Longitudinal CATE analysis: does the steering benefit change over time (partner learning effects)? - Mediation analysis: what fraction of the cost reduction is due to partner quality vs. speed? - Compliance analysis: why do some steerable claims not get steered? (identify friction points) - Test `coverage_type` (Vollkasko vs Teilkasko) as a moderator for the steering effect ::: ## Model Choice Rationale **Why doubly-robust AIPW instead of simple regression or matching?** The doubly-robust AIPW estimator provides valid causal inference if *either* the propensity model *or* the outcome model is correctly specified — but not necessarily both. This is practically important: we cannot be certain our logistic propensity model fully captures the assignment mechanism, and our outcome model inevitably misspecifies the true functional form. AIPW exploits both models simultaneously, offering a safety margin that neither IPW alone nor simple outcome regression provides. **Why not matching (MatchIt)?** Matching discards unmatched units (typically 20–40% of data), reducing statistical power. For a 10,000-claim dataset, this sacrifice is unnecessary. The IPW approach retains all observations and achieves balance through re-weighting. The trimming at the 99th percentile of weights prevents extreme observations from dominating. **Why bootstrap CIs for AIPW instead of analytical variance estimation?** The semiparametric variance of AIPW involves the influence function of both nuisance models simultaneously. The bootstrap approximation, parallelised via `mirai`, avoids these derivations and is asymptotically equivalent under mild regularity conditions. **Why segment CATE by claim type?** The Conditional Average Treatment Effect (CATE) is the causally correct estimand when treatment effects are heterogeneous across subgroups. Reporting a single ATE would mask the zero effect in total loss (and could justify wasteful steering resources in that segment). Segment-level CATE directly maps to the steering strategy decision for each peril type. ## Causal ML: Honest Causal Forests (grf) Causal forests (Wager & Athey, 2018) estimate heterogeneous treatment effects at the individual level using honest random forests — a non-parametric causal ML alternative to the parametric AIPW approach used above. Key advantages: - **Non-parametric**: no functional form assumption for $\tau(x)$ - **Honest splitting**: the sample used to build the tree structure is separate from the sample used to estimate leaf values, providing valid confidence intervals - **Adaptive**: the forest automatically focuses splitting power on regions of covariate space with high treatment effect heterogeneity ```{r causal-forest} #| message: false #| fig-height: 5 library(grf) # Prepare matrices X_vars <- c("severity_score", "vehicle_age", "ncd_class", "deductible_amount", "notification_delay_days") X_cat <- c("claim_type", "vehicle_class", "region", "coverage_type", "fault_indicator") # One-hot encode categoricals X_num <- claims |> select(all_of(X_vars)) |> as.matrix() X_enc <- model.matrix( ~ claim_type + vehicle_class + region + coverage_type + fault_indicator - 1, data = claims ) X <- cbind(X_num, X_enc) # Note on scale: grf requires a continuous Y; we use log(cost) to reduce the # influence of large outliers and stabilise variance. The AIPW in utils_models.R # uses a Gamma GLM on the original CHF scale. When comparing CATE estimates # below, both are converted to % change — they should agree directionally but # may differ in magnitude due to the scale difference (log-normal vs Gamma). Y <- log(claims$repair_cost) W <- claims$steering_flag set.seed(2024) cf <- causal_forest( X = X, Y = Y, W = W, num.trees = 2000, tune.parameters = "all" ) # Individual-level CATE estimates tau_hat <- predict(cf, estimate.variance = TRUE) claims_cf <- claims |> mutate( cate = tau_hat$predictions, cate_pct = (exp(cate) - 1) * 100, cate_se = sqrt(tau_hat$variance.estimates), cate_lo = (exp(cate - 1.96 * cate_se) - 1) * 100, cate_hi = (exp(cate + 1.96 * cate_se) - 1) * 100 ) cat("Causal forest ATE:", round((exp(average_treatment_effect(cf)[["estimate"]]) - 1) * 100, 1), "%\n") # Calibration test: checks whether CATE estimates have the right mean and # whether heterogeneity is real or noise. Best Linear Projection of tau(X) # onto forest predictions — slope ≈ 1 means well-calibrated heterogeneity. tc <- test_calibration(cf) cat("\nCalibration test (BLP):\n") print(tc) cat("95% CI: [", round((exp(average_treatment_effect(cf)[["estimate"]] - 1.96 * average_treatment_effect(cf)[["std.err"]]) - 1) * 100, 1), ",", round((exp(average_treatment_effect(cf)[["estimate"]] + 1.96 * average_treatment_effect(cf)[["std.err"]]) - 1) * 100, 1), "]\n") ``` ### CATE Distribution ```{r cf-cate-dist} #| fig-height: 4 ggplot(claims_cf, aes(x = cate_pct, fill = claim_type)) + geom_histogram(binwidth = 2, alpha = 0.75, colour = "white", linewidth = 0.2) + geom_vline(xintercept = 0, linetype = "dashed", colour = "#FF6600", linewidth = 0.8) + scale_fill_manual( values = c(glass="#003781", body="#0066CC", engine="#00A9CE", total_loss="#FF6600"), name = "Claim Type" ) + facet_wrap(~claim_type, nrow = 1, scales = "free_y") + scale_x_continuous(labels = scales::percent_format(scale = 1)) + labs( title = "Distribution of Individual CATE Estimates (Causal Forest)", subtitle = "Each bar = one claim · Dashed = zero effect · Colour = claim type", x = "Estimated Treatment Effect on Cost (%)", y = "Count" ) + theme_allianz() + theme(legend.position = "none") ``` ### Comparing AIPW vs. Causal Forest CATEs ```{r cf-vs-aipw} #| fig-height: 4 cf_segment <- claims_cf |> group_by(claim_type) |> summarise( cf_cate_pct = mean(cate_pct), cf_lo = mean(cate_lo), cf_hi = mean(cate_hi), .groups = "drop" ) cate_compare <- left_join( cate_df, cf_segment |> rename(cf_cate = cf_cate_pct), by = "claim_type" ) cate_compare_long <- cate_compare |> pivot_longer(c(cate_pct, cf_cate), names_to = "method", values_to = "estimate") |> mutate(method = recode(method, cate_pct = "AIPW (doubly robust)", cf_cate = "Causal Forest (grf)")) ggplot(cate_compare_long, aes(x = claim_type, y = estimate, colour = method, group = method)) + geom_hline(yintercept = 0, linetype = "dashed", colour = "#6B7280") + geom_line(linewidth = 1, alpha = 0.7) + geom_point(size = 4) + geom_errorbar( data = filter(cate_compare_long, method == "Causal Forest (grf)"), aes(ymin = cf_lo, ymax = cf_hi), width = 0.2, linewidth = 0.8 ) + scale_colour_manual( values = c("AIPW (doubly robust)" = "#003781", "Causal Forest (grf)" = "#00A9CE"), name = NULL ) + scale_y_continuous(labels = scales::percent_format(scale = 1)) + labs( title = "AIPW vs. Causal Forest CATE Estimates", subtitle = "Error bars = 95% CI from causal forest · Both methods should broadly agree", x = "Claim Type", y = "Estimated Cost Change (%)" ) + theme_allianz() ``` ::: {.callout-note} **Causal forest interpretation:** If AIPW and causal forest estimates broadly agree (within their respective CIs), this triangulation increases confidence in the causal estimates. Disagreements reveal where parametric assumptions in the AIPW propensity or outcome model may be violated — the causal forest provides a more flexible non-parametric check. The causal forest additionally provides *individual-level* CATE estimates, enabling claim-by-claim decisions about whether network steering is expected to save costs. :::