{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "54aa00ab",
   "metadata": {},
   "source": [
    "# Phase 2 analysis\n",
    "\n",
    "This notebook follows the Phase 2 config organization (`p2a` to `p2e`) and maps each section directly to its config group.\n",
    "It separates three concerns:\n",
    "\n",
    "1. **Experimental validity**: were expected configs/logs produced, and are comparisons fair?\n",
    "2. **Evidence**: what do the 5-fold CV metrics support?\n",
    "3. **Decision**: which preprocessing choices should move into Phase 3?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "734db3ee",
   "metadata": {},
   "source": [
    "## Questions\n",
    "\n",
    "| Section | Config group | Question | Required evidence |\n",
    "|---|---|---|---|\n",
    "| 2A | `p2a_*` | Shortcut analysis: normalization + source holdout | `p2a_t1_original`, `p2a_t2_real_norm`, `p2a_t3_holdout_*` |\n",
    "| 2B | `p2b_*` | Does 224 improve over 128? | `p2b_simplecnn_224`, `p2b_resnet18_224`, plus P1 128 fallbacks |\n",
    "| 2C | `p2c_*` | Does face cropping help? | `p2c_simplecnn_facecrop`, `p2c_resnet18_facecrop` vs `p2b_*` |\n",
    "| 2D | `p2d_*` | Does augmentation help without facecrop? | `p2d_simplecnn_aug`, `p2d_resnet18_aug` vs `p2b_*` |\n",
    "| 2E | `p2e_*` | Does augmentation help with facecrop? | `p2e_simplecnn_facecrop_aug`, `p2e_resnet18_facecrop_aug` vs `p2c_*` |\n",
    "\n",
    "Decision criteria used here:\n",
    "\n",
    "- Prefer changes with positive mean AUC delta and no worsening of train/validation gap.\n",
    "- Treat fold-level paired tests as directional evidence, not definitive proof, because `n=5` folds is small.\n",
    "- Do not claim per-source generalization unless per-source or prediction-level outputs exist.\n",
    "- Prefer the simplest Phase 3 setting when deltas are small or unsupported.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f4c04b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import json\n",
    "import math\n",
    "import os\n",
    "import sys\n",
    "from dataclasses import dataclass\n",
    "from pathlib import Path\n",
    "from typing import Any\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "\n",
    "try:\n",
    "    from IPython.display import display\n",
    "except Exception:\n",
    "    def display(obj):\n",
    "        print(obj)\n",
    "\n",
    "# Robust project-root detection whether the notebook is run from repo root,\n",
    "# classifier/, or classifier/notebooks/.\n",
    "def find_project_root(start: Path | None = None) -> Path:\n",
    "    start = (start or Path.cwd()).resolve()\n",
    "    for candidate in [start, *start.parents]:\n",
    "        if (candidate / \"classifier\" / \"v2.md\").exists() and (candidate / \"classifier\" / \"impl.md\").exists():\n",
    "            return candidate\n",
    "    raise RuntimeError(f\"Could not find project root from {start}\")\n",
    "\n",
    "PROJECT_ROOT = find_project_root()\n",
    "CLASSIFIER_DIR = PROJECT_ROOT / \"classifier\"\n",
    "LOGS_DIR = CLASSIFIER_DIR / \"outputs\" / \"logs\"\n",
    "FIGURES_DIR = CLASSIFIER_DIR / \"outputs\" / \"figures\" / \"phase2\"\n",
    "ANALYSIS_DIR = CLASSIFIER_DIR / \"outputs\" / \"analysis\"\n",
    "CONFIG_DIR = CLASSIFIER_DIR / \"configs\"\n",
    "\n",
    "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
    "ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "if str(CLASSIFIER_DIR) not in sys.path:\n",
    "    sys.path.insert(0, str(CLASSIFIER_DIR))\n",
    "\n",
    "sns.set_theme(style=\"whitegrid\", context=\"notebook\")\n",
    "plt.rcParams.update({\n",
    "    \"figure.figsize\": (12, 7),\n",
    "    \"axes.spines.top\": False,\n",
    "    \"axes.spines.right\": False,\n",
    "})\n",
    "\n",
    "print(f\"Project root: {PROJECT_ROOT}\")\n",
    "print(f\"Logs:         {LOGS_DIR}\")\n",
    "print(f\"Figures:      {FIGURES_DIR}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24830212",
   "metadata": {},
   "outputs": [],
   "source": [
    "@dataclass(frozen=True)\n",
    "class RunSpec:\n",
    "    run: str\n",
    "    label: str\n",
    "    section: str\n",
    "    model: str\n",
    "    condition: str\n",
    "    intended_role: str\n",
    "    fallback_for: str | None = None\n",
    "\n",
    "RUN_SPECS = [\n",
    "    # 2A: shortcut analysis (normalization + source holdout), ResNet18 only.\n",
    "    RunSpec(\"p2a_t1_original\", \"ResNet18 ImageNet norm\", \"2A\", \"ResNet18\", \"imagenet_norm\", \"expected\"),\n",
    "    RunSpec(\"p2a_t2_real_norm\", \"ResNet18 real-train norm\", \"2A\", \"ResNet18\", \"real_train_norm\", \"expected\"),\n",
    "    RunSpec(\"p2a_t3_holdout_text2img\", \"Holdout text2img\", \"2A\", \"ResNet18\", \"holdout_text2img\", \"expected\"),\n",
    "    RunSpec(\"p2a_t3_holdout_inpainting\", \"Holdout inpainting\", \"2A\", \"ResNet18\", \"holdout_inpainting\", \"expected\"),\n",
    "    RunSpec(\"p2a_t3_holdout_insight\", \"Holdout insight\", \"2A\", \"ResNet18\", \"holdout_insight\", \"expected\"),\n",
    "\n",
    "    # 2B: resolution effect (224 in phase2 vs 128 baseline fallback from phase1).\n",
    "    RunSpec(\"p1_simplecnn_baseline\", \"SimpleCNN 128 (P1 fallback)\", \"2B\", \"SimpleCNN\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_simplecnn_128\"),\n",
    "    RunSpec(\"p1_resnet18_baseline\", \"ResNet18 128 (P1 fallback)\", \"2B\", \"ResNet18\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_resnet18_128\"),\n",
    "    RunSpec(\"p2b_simplecnn_224\", \"SimpleCNN 224\", \"2B\", \"SimpleCNN\", \"224_no_crop_no_aug\", \"expected\"),\n",
    "    RunSpec(\"p2b_resnet18_224\", \"ResNet18 224\", \"2B\", \"ResNet18\", \"224_no_crop_no_aug\", \"expected\"),\n",
    "\n",
    "    # 2C: facecrop effect at 224, no augmentation.\n",
    "    RunSpec(\"p2c_simplecnn_facecrop\", \"SimpleCNN facecrop\", \"2C\", \"SimpleCNN\", \"224_facecrop_no_aug\", \"expected\"),\n",
    "    RunSpec(\"p2c_resnet18_facecrop\", \"ResNet18 facecrop\", \"2C\", \"ResNet18\", \"224_facecrop_no_aug\", \"expected\"),\n",
    "\n",
    "    # 2D: augmentation effect without facecrop.\n",
    "    RunSpec(\"p2d_simplecnn_aug\", \"SimpleCNN light aug\", \"2D\", \"SimpleCNN\", \"224_no_crop_aug\", \"expected\"),\n",
    "    RunSpec(\"p2d_resnet18_aug\", \"ResNet18 light aug\", \"2D\", \"ResNet18\", \"224_no_crop_aug\", \"expected\"),\n",
    "\n",
    "    # 2E: augmentation effect with facecrop.\n",
    "    RunSpec(\"p2e_simplecnn_facecrop_aug\", \"SimpleCNN facecrop + aug\", \"2E\", \"SimpleCNN\", \"224_facecrop_aug\", \"expected\"),\n",
    "    RunSpec(\"p2e_resnet18_facecrop_aug\", \"ResNet18 facecrop + aug\", \"2E\", \"ResNet18\", \"224_facecrop_aug\", \"expected\"),\n",
    "]\n",
    "\n",
    "# Use these aliases when synthetic 128 run IDs are requested for 2B.\n",
    "RUN_ALIASES = {\n",
    "    \"p2b_simplecnn_128\": \"p1_simplecnn_baseline\",\n",
    "    \"p2b_resnet18_128\": \"p1_resnet18_baseline\",\n",
    "}\n",
    "\n",
    "PLANNED_COMPARISONS = [\n",
    "    (\"2A\", \"ResNet18\", \"normalization\", \"p2a_t1_original\", \"p2a_t2_real_norm\", \"real_norm - imagenet_norm\"),\n",
    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"holdout text2img - all-source\"),\n",
    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_inpainting\", \"holdout inpainting - all-source\"),\n",
    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_insight\", \"holdout insight - all-source\"),\n",
    "\n",
    "    (\"2B\", \"SimpleCNN\", \"resolution\", \"p2b_simplecnn_128\", \"p2b_simplecnn_224\", \"224 - 128\"),\n",
    "    (\"2B\", \"ResNet18\", \"resolution\", \"p2b_resnet18_128\", \"p2b_resnet18_224\", \"224 - 128\"),\n",
    "\n",
    "    (\"2C\", \"SimpleCNN\", \"facecrop\", \"p2b_simplecnn_224\", \"p2c_simplecnn_facecrop\", \"facecrop - no facecrop\"),\n",
    "    (\"2C\", \"ResNet18\", \"facecrop\", \"p2b_resnet18_224\", \"p2c_resnet18_facecrop\", \"facecrop - no facecrop\"),\n",
    "\n",
    "    (\"2D\", \"SimpleCNN\", \"augmentation\", \"p2b_simplecnn_224\", \"p2d_simplecnn_aug\", \"light aug - no aug\"),\n",
    "    (\"2D\", \"ResNet18\", \"augmentation\", \"p2b_resnet18_224\", \"p2d_resnet18_aug\", \"light aug - no aug\"),\n",
    "\n",
    "    (\"2E\", \"SimpleCNN\", \"facecrop + augmentation\", \"p2c_simplecnn_facecrop\", \"p2e_simplecnn_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
    "    (\"2E\", \"ResNet18\", \"facecrop + augmentation\", \"p2c_resnet18_facecrop\", \"p2e_resnet18_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
    "]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e2ccd27",
   "metadata": {},
   "source": [
    "## Evidence audit\n",
    "\n",
    "Before comparing numbers, check whether the planned artifacts exist. Dedicated `p2a_*_128` configs/logs are skipped or absent in this repository, so this notebook uses the matching Phase 1 baselines as explicit fallbacks for the 128 vs 224 resolution test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53356e8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_json(path: Path) -> dict[str, Any] | None:\n",
    "    if not path.exists():\n",
    "        return None\n",
    "    with path.open() as f:\n",
    "        return json.load(f)\n",
    "\n",
    "\n",
    "def config_path_for(run: str) -> Path | None:\n",
    "    candidates = [\n",
    "        CONFIG_DIR / \"phase2\" / f\"{run}.json\",\n",
    "        CONFIG_DIR / \"phase2\" / f\"{run}.json.skip\",\n",
    "        CONFIG_DIR / \"phase1\" / f\"{run}.json\",\n",
    "        CONFIG_DIR / \"phase1\" / f\"{run}.json.skip\",\n",
    "    ]\n",
    "    return next((p for p in candidates if p.exists()), None)\n",
    "\n",
    "\n",
    "def log_path_for(run: str) -> Path:\n",
    "    return LOGS_DIR / f\"{run}.json\"\n",
    "\n",
    "\n",
    "def resolve_run(run: str) -> str:\n",
    "    return run if log_path_for(run).exists() else RUN_ALIASES.get(run, run)\n",
    "\n",
    "\n",
    "def load_results(run: str) -> dict[str, Any] | None:\n",
    "    resolved = resolve_run(run)\n",
    "    return load_json(log_path_for(resolved))\n",
    "\n",
    "\n",
    "def metric_values(results: dict[str, Any], metric: str = \"auc_roc\") -> np.ndarray:\n",
    "    vals = []\n",
    "    for fold in results.get(\"fold_results\", []):\n",
    "        value = fold.get(\"test_metrics\", {}).get(metric)\n",
    "        if value is not None:\n",
    "            vals.append(float(value))\n",
    "    return np.asarray(vals, dtype=float)\n",
    "\n",
    "\n",
    "def best_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
    "    hist = fold.get(\"history\", {})\n",
    "    train_key = f\"train_{metric}\"\n",
    "    val_key = f\"val_{metric}\"\n",
    "    train = hist.get(train_key, [])\n",
    "    val = hist.get(val_key, [])\n",
    "    if not train or not val:\n",
    "        return None\n",
    "    idx = int(np.nanargmax(np.asarray(val, dtype=float)))\n",
    "    return float(train[idx] - val[idx])\n",
    "\n",
    "\n",
    "def final_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
    "    hist = fold.get(\"history\", {})\n",
    "    train = hist.get(f\"train_{metric}\", [])\n",
    "    val = hist.get(f\"val_{metric}\", [])\n",
    "    if not train or not val:\n",
    "        return None\n",
    "    return float(train[-1] - val[-1])\n",
    "\n",
    "\n",
    "def summarize_run(spec: RunSpec) -> dict[str, Any]:\n",
    "    resolved = resolve_run(spec.run)\n",
    "    results = load_results(spec.run)\n",
    "    config_path = config_path_for(spec.run) or config_path_for(resolved)\n",
    "    cfg = load_json(config_path) if config_path else None\n",
    "\n",
    "    row = {\n",
    "        \"section\": spec.section,\n",
    "        \"run\": spec.run,\n",
    "        \"resolved_run\": resolved,\n",
    "        \"label\": spec.label,\n",
    "        \"model\": spec.model,\n",
    "        \"condition\": spec.condition,\n",
    "        \"role\": spec.intended_role,\n",
    "        \"fallback_for\": spec.fallback_for,\n",
    "        \"config_path\": str(config_path.relative_to(PROJECT_ROOT)) if config_path else None,\n",
    "        \"config_status\": \"present\" if config_path and config_path.suffix == \".json\" else (\"skipped\" if config_path else \"missing\"),\n",
    "        \"log_status\": \"present\" if log_path_for(spec.run).exists() else (\"fallback\" if resolved != spec.run and log_path_for(resolved).exists() else \"missing\"),\n",
    "        \"n_folds\": None,\n",
    "        \"auc_mean\": np.nan,\n",
    "        \"auc_std\": np.nan,\n",
    "        \"acc_mean\": np.nan,\n",
    "        \"f1_mean\": np.nan,\n",
    "        \"gap_best_mean\": np.nan,\n",
    "        \"gap_final_mean\": np.nan,\n",
    "        \"image_size\": None,\n",
    "        \"face_crop\": None,\n",
    "        \"augment\": None,\n",
    "        \"normalization\": None,\n",
    "        \"train_sources\": None,\n",
    "        \"eval_sources\": None,\n",
    "    }\n",
    "\n",
    "    if cfg:\n",
    "        row.update({\n",
    "            \"image_size\": cfg.get(\"image_size\"),\n",
    "            \"face_crop\": cfg.get(\"face_crop\"),\n",
    "            \"augment\": \"light\" if isinstance(cfg.get(\"augment\"), dict) else cfg.get(\"augment\"),\n",
    "            \"normalization\": cfg.get(\"normalization\"),\n",
    "            \"train_sources\": tuple(cfg.get(\"train_sources\", [])) or None,\n",
    "            \"eval_sources\": tuple(cfg.get(\"eval_sources\", [])) or None,\n",
    "        })\n",
    "\n",
    "    if results:\n",
    "        agg = results.get(\"aggregated_metrics\", {})\n",
    "        row.update({\n",
    "            \"n_folds\": results.get(\"n_folds\"),\n",
    "            \"auc_mean\": agg.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
    "            \"auc_std\": agg.get(\"auc_roc\", {}).get(\"std\", np.nan),\n",
    "            \"acc_mean\": agg.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
    "            \"f1_mean\": agg.get(\"f1\", {}).get(\"mean\", np.nan),\n",
    "        })\n",
    "        best_gaps = [best_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
    "        final_gaps = [final_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
    "        best_gaps = [x for x in best_gaps if x is not None]\n",
    "        final_gaps = [x for x in final_gaps if x is not None]\n",
    "        row[\"gap_best_mean\"] = float(np.mean(best_gaps)) if best_gaps else np.nan\n",
    "        row[\"gap_final_mean\"] = float(np.mean(final_gaps)) if final_gaps else np.nan\n",
    "\n",
    "    return row\n",
    "\n",
    "runs_df = pd.DataFrame([summarize_run(spec) for spec in RUN_SPECS])\n",
    "\n",
    "# Prefer canonical rows for analysis: keep fallbacks only where expected rows are missing.\n",
    "canonical_runs_df = runs_df[runs_df[\"role\"] == \"expected\"].copy()\n",
    "for missing_run, fallback_run in RUN_ALIASES.items():\n",
    "    mask = canonical_runs_df[\"run\"].eq(missing_run) & canonical_runs_df[\"log_status\"].eq(\"missing\")\n",
    "    if mask.any():\n",
    "        fallback = runs_df[runs_df[\"run\"].eq(fallback_run)].copy()\n",
    "        if not fallback.empty:\n",
    "            fallback.loc[:, \"run\"] = missing_run\n",
    "            fallback.loc[:, \"label\"] = fallback.iloc[0][\"label\"].replace(\" (P1 fallback)\", \"\") + \" [P1 fallback]\"\n",
    "            fallback.loc[:, \"role\"] = \"expected_via_fallback\"\n",
    "            canonical_runs_df = pd.concat([canonical_runs_df[~mask], fallback], ignore_index=True)\n",
    "\n",
    "print(\"Artifact audit:\")\n",
    "display(runs_df[[\"section\", \"run\", \"resolved_run\", \"role\", \"config_status\", \"log_status\", \"n_folds\"]].sort_values([\"section\", \"run\"]))\n",
    "\n",
    "missing_expected = runs_df[(runs_df[\"role\"] == \"expected\") & (runs_df[\"log_status\"] == \"missing\")][\"run\"].tolist()\n",
    "print(f\"\\nExpected runs with no direct log: {missing_expected or 'none'}\")\n",
    "print(\"Fallbacks used:\", {k: v for k, v in RUN_ALIASES.items() if k in missing_expected})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b21a9faf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Protocol consistency audit from loaded logs/configs.\n",
    "protocol_fields = [\n",
    "    \"cv_folds\", \"batch_size\", \"early_stopping_patience\", \"seed\", \"subsample\",\n",
    "    \"lr\", \"weight_decay\", \"T_max\", \"epochs\",\n",
    "]\n",
    "\n",
    "protocol_rows = []\n",
    "for _, row in canonical_runs_df.iterrows():\n",
    "    results = load_results(row[\"run\"])\n",
    "    cfg = (results or {}).get(\"config\", {})\n",
    "    protocol_rows.append({\"run\": row[\"run\"], **{k: cfg.get(k) for k in protocol_fields}})\n",
    "\n",
    "protocol_df = pd.DataFrame(protocol_rows)\n",
    "display(protocol_df)\n",
    "\n",
    "print(\"Field variability across loaded canonical runs:\")\n",
    "for field in protocol_fields:\n",
    "    vals = sorted({str(v) for v in protocol_df[field].dropna().unique()})\n",
    "    print(f\"  {field:28s}: {vals}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6802bcd9",
   "metadata": {},
   "source": [
    "## Results table\n",
    "\n",
    "The table below is ranked by AUC and includes two gap estimates:\n",
    "\n",
    "- `gap_best_mean`: train AUC minus validation AUC at each fold's best validation epoch. This is closest to the saved best checkpoint.\n",
    "- `gap_final_mean`: train AUC minus validation AUC at the final epoch. This is useful for diagnosing late overfit but is less aligned with test evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be1ec0ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis_df = canonical_runs_df[canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"])].copy()\n",
    "analysis_df = analysis_df.sort_values(\"auc_mean\", ascending=False)\n",
    "\n",
    "cols = [\n",
    "    \"section\", \"label\", \"run\", \"resolved_run\", \"model\", \"condition\", \"log_status\",\n",
    "    \"auc_mean\", \"auc_std\", \"acc_mean\", \"f1_mean\", \"gap_best_mean\", \"gap_final_mean\",\n",
    "]\n",
    "\n",
    "display(\n",
    "    analysis_df[cols]\n",
    "    .style.format({\n",
    "        \"auc_mean\": \"{:.4f}\",\n",
    "        \"auc_std\": \"{:.4f}\",\n",
    "        \"acc_mean\": \"{:.4f}\",\n",
    "        \"f1_mean\": \"{:.4f}\",\n",
    "        \"gap_best_mean\": \"{:+.4f}\",\n",
    "        \"gap_final_mean\": \"{:+.4f}\",\n",
    "    })\n",
    "    .background_gradient(subset=[\"auc_mean\"], cmap=\"Greens\")\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e0d21c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def paired_comparison(section: str, model: str, question: str, before: str, after: str, contrast: str) -> dict[str, Any]:\n",
    "    r0 = load_results(before)\n",
    "    r1 = load_results(after)\n",
    "    resolved_before = resolve_run(before)\n",
    "    resolved_after = resolve_run(after)\n",
    "    out = {\n",
    "        \"section\": section,\n",
    "        \"model\": model,\n",
    "        \"question\": question,\n",
    "        \"before\": before,\n",
    "        \"after\": after,\n",
    "        \"resolved_before\": resolved_before,\n",
    "        \"resolved_after\": resolved_after,\n",
    "        \"contrast\": contrast,\n",
    "        \"status\": \"ok\" if r0 and r1 else \"missing\",\n",
    "        \"n\": 0,\n",
    "        \"before_auc\": np.nan,\n",
    "        \"after_auc\": np.nan,\n",
    "        \"delta_auc\": np.nan,\n",
    "        \"delta_ci95\": np.nan,\n",
    "        \"ttest_p\": np.nan,\n",
    "        \"wilcoxon_p\": np.nan,\n",
    "        \"cohen_dz\": np.nan,\n",
    "        \"before_gap\": np.nan,\n",
    "        \"after_gap\": np.nan,\n",
    "        \"delta_gap\": np.nan,\n",
    "        \"interpretation\": \"insufficient data\",\n",
    "        \"caveat\": \"\",\n",
    "    }\n",
    "    if not (r0 and r1):\n",
    "        return out\n",
    "\n",
    "    v0 = metric_values(r0, \"auc_roc\")\n",
    "    v1 = metric_values(r1, \"auc_roc\")\n",
    "    n = min(len(v0), len(v1))\n",
    "    v0, v1 = v0[:n], v1[:n]\n",
    "    diff = v1 - v0\n",
    "\n",
    "    out.update({\n",
    "        \"n\": n,\n",
    "        \"before_auc\": float(np.mean(v0)),\n",
    "        \"after_auc\": float(np.mean(v1)),\n",
    "        \"delta_auc\": float(np.mean(diff)),\n",
    "    })\n",
    "\n",
    "    if n >= 2:\n",
    "        sd = float(np.std(diff, ddof=1))\n",
    "        se = sd / math.sqrt(n) if sd > 0 else 0.0\n",
    "        out[\"delta_ci95\"] = float(stats.t.ppf(0.975, df=n - 1) * se) if n > 1 else np.nan\n",
    "        if sd > 0:\n",
    "            out[\"cohen_dz\"] = float(np.mean(diff) / sd)\n",
    "            out[\"ttest_p\"] = float(stats.ttest_rel(v1, v0).pvalue)\n",
    "        if n >= 3 and not np.allclose(diff, 0):\n",
    "            try:\n",
    "                out[\"wilcoxon_p\"] = float(stats.wilcoxon(diff).pvalue)\n",
    "            except ValueError:\n",
    "                pass\n",
    "\n",
    "    gaps0 = [best_epoch_gap(f) for f in r0.get(\"fold_results\", [])]\n",
    "    gaps1 = [best_epoch_gap(f) for f in r1.get(\"fold_results\", [])]\n",
    "    gaps0 = np.asarray([x for x in gaps0 if x is not None], dtype=float)\n",
    "    gaps1 = np.asarray([x for x in gaps1 if x is not None], dtype=float)\n",
    "    if len(gaps0) and len(gaps1):\n",
    "        m = min(len(gaps0), len(gaps1))\n",
    "        out[\"before_gap\"] = float(np.mean(gaps0[:m]))\n",
    "        out[\"after_gap\"] = float(np.mean(gaps1[:m]))\n",
    "        out[\"delta_gap\"] = float(np.mean(gaps1[:m] - gaps0[:m]))\n",
    "\n",
    "    if question == \"source_holdout\":\n",
    "        out[\"caveat\"] = \"Aggregate holdout-run AUC only; not held-out-source vs in-source AUC.\"\n",
    "    if before != resolved_before or after != resolved_after:\n",
    "        out[\"caveat\"] = (out[\"caveat\"] + \" \" if out[\"caveat\"] else \"\") + \"Uses Phase 1 fallback for missing p2a 128 log.\"\n",
    "\n",
    "    if out[\"delta_auc\"] >= 0.01:\n",
    "        out[\"interpretation\"] = \"meaningful improvement\"\n",
    "    elif out[\"delta_auc\"] > 0.002:\n",
    "        out[\"interpretation\"] = \"small improvement\"\n",
    "    elif out[\"delta_auc\"] >= -0.002:\n",
    "        out[\"interpretation\"] = \"negligible change\"\n",
    "    elif out[\"delta_auc\"] > -0.01:\n",
    "        out[\"interpretation\"] = \"small drop\"\n",
    "    else:\n",
    "        out[\"interpretation\"] = \"meaningful drop\"\n",
    "    return out\n",
    "\n",
    "comparisons_df = pd.DataFrame([paired_comparison(*args) for args in PLANNED_COMPARISONS])\n",
    "\n",
    "# Benjamini-Hochberg correction across planned paired t-tests where available.\n",
    "valid_p = comparisons_df[\"ttest_p\"].notna()\n",
    "pvals = comparisons_df.loc[valid_p, \"ttest_p\"].to_numpy()\n",
    "qvals = np.full(len(comparisons_df), np.nan)\n",
    "if len(pvals):\n",
    "    order = np.argsort(pvals)\n",
    "    ranked = pvals[order]\n",
    "    adjusted = np.empty_like(ranked)\n",
    "    m = len(ranked)\n",
    "    running = 1.0\n",
    "    for i in range(m - 1, -1, -1):\n",
    "        running = min(running, ranked[i] * m / (i + 1))\n",
    "        adjusted[i] = running\n",
    "    qvals[np.where(valid_p)[0][order]] = adjusted\n",
    "comparisons_df[\"bh_q\"] = qvals\n",
    "\n",
    "display(\n",
    "    comparisons_df[[\n",
    "        \"section\", \"model\", \"question\", \"contrast\", \"before_auc\", \"after_auc\", \"delta_auc\",\n",
    "        \"delta_ci95\", \"ttest_p\", \"bh_q\", \"wilcoxon_p\", \"cohen_dz\", \"delta_gap\", \"interpretation\", \"caveat\",\n",
    "    ]].style.format({\n",
    "        \"before_auc\": \"{:.4f}\",\n",
    "        \"after_auc\": \"{:.4f}\",\n",
    "        \"delta_auc\": \"{:+.4f}\",\n",
    "        \"delta_ci95\": \"\u00b1{:.4f}\",\n",
    "        \"ttest_p\": \"{:.4f}\",\n",
    "        \"bh_q\": \"{:.4f}\",\n",
    "        \"wilcoxon_p\": \"{:.4f}\",\n",
    "        \"cohen_dz\": \"{:+.2f}\",\n",
    "        \"delta_gap\": \"{:+.4f}\",\n",
    "    }).background_gradient(subset=[\"delta_auc\"], cmap=\"RdYlGn\", vmin=-0.06, vmax=0.06)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f20e5262",
   "metadata": {},
   "source": [
    "## Visual summary\n",
    "\n",
    "Two plots are most useful for decision-making:\n",
    "\n",
    "- Ranking all conditions by AUC shows the best observed configurations but can overstate duplicated/near-identical runs.\n",
    "- Paired delta plot shows the controlled effect of each preprocessing change and exposes uncertainty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42882c6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_df = analysis_df.copy()\n",
    "plot_df[\"display_label\"] = plot_df[\"section\"] + \" | \" + plot_df[\"label\"]\n",
    "plot_df = plot_df.sort_values(\"auc_mean\", ascending=True)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(11, max(7, 0.35 * len(plot_df))))\n",
    "colors = {\"2A\": \"#4C78A8\", \"2B\": \"#F58518\", \"2C\": \"#54A24B\", \"2D\": \"#E45756\", \"2E\": \"#B279A2\"}\n",
    "ax.barh(\n",
    "    plot_df[\"display_label\"],\n",
    "    plot_df[\"auc_mean\"],\n",
    "    xerr=plot_df[\"auc_std\"],\n",
    "    color=[colors.get(s, \"#999999\") for s in plot_df[\"section\"]],\n",
    "    alpha=0.85,\n",
    ")\n",
    "ax.set_xlim(0.65, 1.0)\n",
    "ax.set_xlabel(\"Mean AUC across CV folds\")\n",
    "ax.set_title(\"Phase 2 Conditions Ranked by AUC\")\n",
    "ax.axvline(0.95, color=\"black\", linewidth=1, linestyle=\"--\", alpha=0.4)\n",
    "for y, (_, row) in enumerate(plot_df.iterrows()):\n",
    "    ax.text(row[\"auc_mean\"] + 0.004, y, f\"{row['auc_mean']:.4f}\", va=\"center\", fontsize=9)\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"ranked_auc.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()\n",
    "\n",
    "forest = comparisons_df.copy()\n",
    "forest[\"display\"] = forest[\"section\"] + \" \" + forest[\"model\"] + \" - \" + forest[\"contrast\"]\n",
    "forest = forest.iloc[::-1]\n",
    "fig, ax = plt.subplots(figsize=(11, max(6, 0.45 * len(forest))))\n",
    "y = np.arange(len(forest))\n",
    "ax.errorbar(\n",
    "    forest[\"delta_auc\"], y,\n",
    "    xerr=forest[\"delta_ci95\"],\n",
    "    fmt=\"o\", color=\"#1F2937\", ecolor=\"#6B7280\", capsize=4,\n",
    ")\n",
    "ax.axvline(0, color=\"black\", linewidth=1)\n",
    "ax.axvspan(-0.002, 0.002, color=\"#9CA3AF\", alpha=0.18, label=\"negligible band\")\n",
    "ax.set_yticks(y)\n",
    "ax.set_yticklabels(forest[\"display\"])\n",
    "ax.set_xlabel(\"Delta AUC (after - before), paired by fold\")\n",
    "ax.set_title(\"Planned Phase 2 Effect Estimates\")\n",
    "ax.legend(loc=\"lower right\")\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"planned_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e063cfc0",
   "metadata": {},
   "source": [
    "## 2A - Shortcut analysis\n",
    "\n",
    "Shortcut checks map to `p2a_*` configs:\n",
    "- `p2a_t1_original` vs `p2a_t2_real_norm` (normalization)\n",
    "- `p2a_t1_original` vs `p2a_t3_holdout_*` (source_holdout)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "910bd5bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "def comparison_subset(section: str, question: str | None = None) -> pd.DataFrame:\n",
    "    df = comparisons_df[comparisons_df[\"section\"].eq(section)].copy()\n",
    "    if question:\n",
    "        df = df[df[\"question\"].eq(question)]\n",
    "    return df\n",
    "\n",
    "\n",
    "def print_comparison_readout(df: pd.DataFrame) -> None:\n",
    "    for _, row in df.iterrows():\n",
    "        print(f\"{row['section']} {row['model']} - {row['contrast']}\")\n",
    "        print(f\"  AUC: {row['before_auc']:.4f} -> {row['after_auc']:.4f} ({row['delta_auc']:+.4f})\")\n",
    "        print(f\"  paired t p={row['ttest_p']:.4f}, BH q={row['bh_q']:.4f}, CI95 delta=\u00b1{row['delta_ci95']:.4f}\")\n",
    "        print(f\"  gap delta: {row['delta_gap']:+.4f}; interpretation: {row['interpretation']}\")\n",
    "        if row['caveat']:\n",
    "            print(f\"  caveat: {row['caveat']}\")\n",
    "        print()\n",
    "\n",
    "print_comparison_readout(comparison_subset(\"2B\", \"resolution\"))\n",
    "\n",
    "res_plot = comparison_subset(\"2B\", \"resolution\")\n",
    "fig, ax = plt.subplots(figsize=(8, 5))\n",
    "for _, row in res_plot.iterrows():\n",
    "    r0, r1 = load_results(row[\"before\"]), load_results(row[\"after\"])\n",
    "    v0, v1 = metric_values(r0), metric_values(r1)\n",
    "    x = [0, 1]\n",
    "    for a, b in zip(v0, v1):\n",
    "        ax.plot(x, [a, b], color=\"#9CA3AF\", alpha=0.7)\n",
    "    ax.plot(x, [v0.mean(), v1.mean()], marker=\"o\", linewidth=3, label=row[\"model\"])\n",
    "ax.set_xticks([0, 1])\n",
    "ax.set_xticklabels([\"128\", \"224\"])\n",
    "ax.set_ylabel(\"AUC\")\n",
    "ax.set_title(\"2B Resolution: Fold-Paired AUC\")\n",
    "ax.legend()\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"2b_resolution_paired.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "530e8675",
   "metadata": {},
   "source": [
    "## 2B - Resolution impact\n",
    "\n",
    "This section compares 128 vs 224 using `p2b_*_224` and Phase 1 baselines as explicit 128 fallbacks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13304d38",
   "metadata": {},
   "outputs": [],
   "source": [
    "print_comparison_readout(comparison_subset(\"2C\", \"facecrop\"))\n",
    "\n",
    "face_df = canonical_runs_df[canonical_runs_df[\"section\"].eq(\"2C\")].copy()\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=False)\n",
    "for ax, model in zip(axes, [\"SimpleCNN\", \"ResNet18\"]):\n",
    "    sub = face_df[face_df[\"model\"].eq(model)].sort_values(\"face_crop\")\n",
    "    ax.bar(sub[\"condition\"], sub[\"auc_mean\"], yerr=sub[\"auc_std\"], color=[\"#D97706\", \"#059669\"], alpha=0.85, capsize=5)\n",
    "    ax.set_title(model)\n",
    "    ax.set_ylim(0.70 if model == \"SimpleCNN\" else 0.94, 0.99)\n",
    "    ax.set_ylabel(\"AUC\")\n",
    "    ax.tick_params(axis=\"x\", rotation=20)\n",
    "fig.suptitle(\"2C Facecrop Impact\")\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"2c_facecrop.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8702d10d",
   "metadata": {},
   "source": [
    "## 2C - Facecrop impact\n",
    "\n",
    "This section compares `p2c_*_facecrop` against the matching `p2b_*_224` no-facecrop baselines.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec5e03ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "print_comparison_readout(comparison_subset(\"2A\"))\n\n# Inspect whether logs contain the per-source data needed by v2.md.\nsource_audit = []\nfor run in [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]:\n    results = load_results(run)\n    has_per_source = False\n    has_records = False\n    example_keys = []\n    if results:\n        for fold in results.get(\"fold_results\", []):\n            tm = fold.get(\"test_metrics\", {})\n            example_keys = sorted(tm.keys())\n            has_per_source = has_per_source or any(k in tm for k in [\"per_source\", \"per_source_metrics\", \"pairwise_source_metrics\", \"source_metrics\", \"pair_metrics\"])\n            has_records = has_records or any(k in fold for k in [\"records\", \"predictions\", \"test_records\"])\n    source_audit.append({\n        \"run\": run,\n        \"has_per_source_metrics\": has_per_source,\n        \"has_prediction_records\": has_records,\n        \"test_metric_keys\": example_keys,\n    })\nsource_audit_df = pd.DataFrame(source_audit)\ndisplay(source_audit_df)\n\nholdout_runs = [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]\nholdout_df = canonical_runs_df[canonical_runs_df[\"run\"].isin(holdout_runs)].copy()\nholdout_df[\"delta_vs_all_source\"] = holdout_df[\"auc_mean\"] - float(holdout_df.loc[holdout_df[\"run\"].eq(\"p2a_t1_original\"), \"auc_mean\"].iloc[0])\n\nfig, ax = plt.subplots(figsize=(9, 5))\nax.bar(holdout_df[\"label\"], holdout_df[\"auc_mean\"], yerr=holdout_df[\"auc_std\"], color=\"#54A24B\", alpha=0.85, capsize=5)\nax.set_ylim(0.88, 0.99)\nax.set_ylabel(\"Aggregate AUC\")\nax.set_title(\"2C Source Holdout Proxy: Aggregate Test AUC\")\nax.tick_params(axis=\"x\", rotation=20)\nfor i, (_, row) in enumerate(holdout_df.iterrows()):\n    ax.text(i, row[\"auc_mean\"] + 0.004, f\"{row['delta_vs_all_source']:+.3f}\", ha=\"center\", fontsize=9)\nfig.tight_layout()\nfig.savefig(FIGURES_DIR / \"2c_holdout_proxy.png\", dpi=200, bbox_inches=\"tight\")\nplt.show()\n\nprint(\"Geometry diagnostic evidence:\")\ngeometry_keys = []\nfor run in [\"p2a_t1_original\", \"p2a_t2_real_norm\"]:\n    results = load_results(run)\n    cfg = (results or {}).get(\"config\", {})\n    geometry_keys.append({\n        \"run\": run,\n        \"config_geometry_condition\": cfg.get(\"geometry_condition\"),\n        \"has_matched_geometry_metric\": any(\n            \"geometry\" in str(k).lower() or \"matched\" in str(k).lower()\n            for fold in (results or {}).get(\"fold_results\", [])\n            for k in fold.get(\"test_metrics\", {}).keys()\n        ),\n    })\ndisplay(pd.DataFrame(geometry_keys))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c3b8812",
   "metadata": {},
   "source": [
    "## 2D / 2E - Augmentation impact and test-set integrity\n",
    "\n",
    "The augmentation question has two parts:\n",
    "\n",
    "- Does light augmentation help at 224 without facecrop?\n",
    "- Does it help once facecrop is enabled?\n",
    "\n",
    "The implementation also needs to guarantee that validation/test evaluation is not stochastic. The preprocessing pipeline keeps stochastic operations behind `self.train`, so `train=False` disables them even if augmentation settings exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f11c3257",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"2D (p2d): augmentation without facecrop\")\n",
    "print_comparison_readout(comparison_subset(\"2D\", \"augmentation\"))\n",
    "print(\"2E (p2e): augmentation with facecrop\")\n",
    "print_comparison_readout(comparison_subset(\"2E\", \"facecrop + augmentation\"))\n",
    "\n",
    "aug_sections = comparisons_df[comparisons_df[\"section\"].isin([\"2D\", \"2E\"])].copy()\n",
    "fig, ax = plt.subplots(figsize=(9, 5))\n",
    "labels = aug_sections[\"section\"] + \" \" + aug_sections[\"model\"]\n",
    "ax.bar(labels, aug_sections[\"delta_auc\"], yerr=aug_sections[\"delta_ci95\"], color=[\"#E45756\" if d < 0 else \"#059669\" for d in aug_sections[\"delta_auc\"]], alpha=0.85, capsize=5)\n",
    "ax.axhline(0, color=\"black\", linewidth=1)\n",
    "ax.set_ylabel(\"Delta AUC from adding augmentation\")\n",
    "ax.set_title(\"Augmentation Effects Across Facecrop Conditions\")\n",
    "ax.tick_params(axis=\"x\", rotation=20)\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"2d_2e_augmentation_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()\n",
    "\n",
    "# Static and behavioral audit of eval stochasticity.\n",
    "try:\n",
    "    import inspect\n",
    "    from src.preprocessing.pipeline import DFFImagePipeline\n",
    "    from src.evaluation import evaluate as evaluate_module\n",
    "\n",
    "    pipeline_src = inspect.getsource(DFFImagePipeline)\n",
    "    build_transforms_src = inspect.getsource(evaluate_module.build_transforms)\n",
    "    stochastic_guards = {\n",
    "        \"flip_guarded_by_train\": \"if self.train and random.random() < self.hflip_p\" in pipeline_src,\n",
    "        \"rotate_guarded_by_train\": \"if self.train and self.rotation_degrees > 0\" in pipeline_src,\n",
    "        \"color_jitter_returns_when_not_train\": \"if not self.train:\" in pipeline_src,\n",
    "        \"blur_guarded_by_train\": \"if self.train and random.random() < self.blur_p\" in pipeline_src,\n",
    "        \"jpeg_guarded_by_train\": \"if self.train and random.random() < self.jpeg_p\" in pipeline_src,\n",
    "        \"erase_guarded_by_train\": \"if self.train and random.random() < self.erase_p\" in pipeline_src,\n",
    "        \"noise_guarded_by_train\": \"if self.train and random.random() < self.noise_p\" in pipeline_src,\n",
    "        \"cv_transform_uses_train_flag\": \"get_transforms(train=train\" in build_transforms_src,\n",
    "    }\n",
    "    display(pd.DataFrame([stochastic_guards]).T.rename(columns={0: \"passes\"}))\n",
    "except Exception as exc:\n",
    "    print(f\"Could not run transform audit: {exc}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02e47658",
   "metadata": {},
   "source": [
    "## Decision synthesis\n",
    "\n",
    "This section converts the evidence into Phase 3 settings. It intentionally distinguishes a recommendation from a claim:\n",
    "\n",
    "- Recommendation: choose the setting that is best supported for the next experiment.\n",
    "- Claim: what the current evidence proves. Some Phase 2C claims remain incomplete without per-source or matched-geometry outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7034443c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_delta(question: str, model: str | None = None, section: str | None = None) -> pd.DataFrame:\n",
    "    df = comparisons_df[comparisons_df[\"question\"].eq(question)].copy()\n",
    "    if model:\n",
    "        df = df[df[\"model\"].eq(model)]\n",
    "    if section:\n",
    "        df = df[df[\"section\"].eq(section)]\n",
    "    return df\n",
    "\n",
    "resolution_resnet = get_delta(\"resolution\", \"ResNet18\").iloc[0]\n",
    "facecrop_resnet = get_delta(\"facecrop\", \"ResNet18\").iloc[0]\n",
    "facecrop_simple = get_delta(\"facecrop\", \"SimpleCNN\").iloc[0]\n",
    "aug_no_crop_resnet = get_delta(\"augmentation\", \"ResNet18\").iloc[0]\n",
    "aug_no_crop_simple = get_delta(\"augmentation\", \"SimpleCNN\").iloc[0]\n",
    "aug_crop_resnet = get_delta(\"facecrop + augmentation\", \"ResNet18\").iloc[0]\n",
    "aug_crop_simple = get_delta(\"facecrop + augmentation\", \"SimpleCNN\").iloc[0]\n",
    "norm = get_delta(\"normalization\", \"ResNet18\").iloc[0]\n",
    "\n",
    "recommendations = [\n",
    "    {\n",
    "        \"choice\": \"resolution\",\n",
    "        \"recommendation\": \"224x224\",\n",
    "        \"evidence\": f\"ResNet18 delta AUC {resolution_resnet.delta_auc:+.4f}; SimpleCNN does not determine Phase 3 capacity.\",\n",
    "        \"confidence\": \"high\" if resolution_resnet.delta_auc > 0.02 else \"medium\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"facecrop\",\n",
    "        \"recommendation\": \"use facecrop\",\n",
    "        \"evidence\": f\"Small positive deltas for both models: SimpleCNN {facecrop_simple.delta_auc:+.4f}, ResNet18 {facecrop_resnet.delta_auc:+.4f}.\",\n",
    "        \"confidence\": \"medium\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"augmentation\",\n",
    "        \"recommendation\": \"do not use light augmentation for Phase 3 at 20% data\",\n",
    "        \"evidence\": f\"SimpleCNN drops {aug_no_crop_simple.delta_auc:+.4f} without facecrop and {aug_crop_simple.delta_auc:+.4f} with facecrop; ResNet18 is neutral/slightly mixed ({aug_no_crop_resnet.delta_auc:+.4f}, {aug_crop_resnet.delta_auc:+.4f}).\",\n",
    "        \"confidence\": \"high for SimpleCNN, medium for ResNet18\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"normalization\",\n",
    "        \"recommendation\": \"ImageNet normalization\",\n",
    "        \"evidence\": f\"Real-train-only normalization delta AUC {norm.delta_auc:+.4f}; no useful gain and less standard for pretrained ResNet.\",\n",
    "        \"confidence\": \"medium\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"shortcut/source claims\",\n",
    "        \"recommendation\": \"do not overclaim; add per-source or prediction exports before final report\",\n",
    "        \"evidence\": \"Current CV logs lack held-out-source vs in-source AUC and matched-geometry test metrics.\",\n",
    "        \"confidence\": \"high\",\n",
    "    },\n",
    "]\n",
    "\n",
    "recommendations_df = pd.DataFrame(recommendations)\n",
    "display(recommendations_df)\n",
    "\n",
    "summary = {\n",
    "    \"phase\": \"phase2\",\n",
    "    \"source_documents\": [\"classifier/v2.md\", \"classifier/impl.md\"],\n",
    "    \"artifact_counts\": {\n",
    "        \"canonical_runs\": int(len(canonical_runs_df)),\n",
    "        \"loaded_canonical_runs\": int(canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"]).sum()),\n",
    "        \"fallback_runs_used\": {k: v for k, v in RUN_ALIASES.items() if resolve_run(k) != k},\n",
    "    },\n",
    "    \"recommendations\": recommendations,\n",
    "    \"planned_comparisons\": comparisons_df.replace({np.nan: None}).to_dict(orient=\"records\"),\n",
    "    \"known_gaps\": [\n",
    "        \"Dedicated p2a_*_128 logs are absent/skipped; Phase 1 baselines are used as fallbacks.\",\n",
    "        \"Source holdout logs do not include prediction-level or per-source metrics, so held-out-source AUC vs in-source AUC cannot be computed.\",\n",
    "        \"No matched-geometry evaluation metric is present in p2c logs, so geometry shortcut analysis is incomplete.\",\n",
    "    ],\n",
    "}\n",
    "\n",
    "summary_path = ANALYSIS_DIR / \"phase2_analysis_summary.json\"\n",
    "with summary_path.open(\"w\") as f:\n",
    "    json.dump(summary, f, indent=2)\n",
    "\n",
    "print(f\"Saved summary: {summary_path.relative_to(PROJECT_ROOT)}\")\n",
    "print(f\"Saved figures: {FIGURES_DIR.relative_to(PROJECT_ROOT)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a337f73",
   "metadata": {},
   "source": [
    "## Report-ready conclusion\n",
    "\n",
    "The strongest Phase 2 result is the resolution effect for ResNet18: moving to 224x224 substantially improves AUC under the controlled CV protocol. Face cropping gives a small positive effect and is reasonable to carry forward, especially because it aligns the model with face evidence rather than background context. Light augmentation is not supported at this 20% data setting: it strongly hurts SimpleCNN and provides no reliable gain for ResNet18, with or without face cropping. ImageNet normalization remains preferable because real-train-only normalization does not improve AUC and is less aligned with pretrained ResNet expectations.\n",
    "\n",
    "Recommended Phase 3 preprocessing: **224x224, facecrop enabled, no light augmentation, ImageNet normalization**.\n",
    "\n",
    "Limitations to fix before the final report: export prediction-level records or per-source pairwise metrics for source holdout, and add the matched-geometry evaluation required by the shortcut-analysis plan. Without those artifacts, Phase 2C can only support a limited shortcut analysis."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "drl",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}