Phase 4 classifier

2026-05-05 11:42:14 +01:00
225 changed files with 7833 additions and 24988 deletions
@@ -67,6 +67,3 @@ generator/outputs/samples/*
 .venv/
 .ipynb_checkpoints/
 __pycache__/
-
-#Presentation
-presentation_inputs.zip
@@ -1,264 +1,125 @@
-# Deep learning face project
+# DRL_PROJ — DeepFake Detection

-This repository contains a two-part deep learning project on the
-DeepFakeFace (DFF) dataset:
+Deep learning project for binary deepfake detection on the DeepFakeFace dataset.

-1. **Classifier:** detect whether a face image is real or fake.
-2. **Generator:** train generative models that produce new fake face images.
+## Project structure

-The project is written as an experimental report. The notebooks are the main
-deliverable: they show the pipeline, the intermediate failures, the ablations,
-the decisions, and the final models. Read them in order.
-
-## Project story
-
-The work follows the same principle in both parts: start with a simple
-baseline, inspect what fails, change one important factor at a time, and keep
-the evidence tied to saved logs and saved artifacts.
-
-For the **classifier**, the story moves from dataset understanding to
-preprocessing, baseline models, controlled ablations, Grad-CAM inspection,
-stronger model families, and data scaling. The final practical classifier is a
-ResNet50-style pipeline using face crops, 224×224 inputs, ImageNet/default
-normalization, and no stochastic augmentation at validation/test time.
-
-For the **generator**, the story starts with raw baseline failures, then locks
-the data pipeline before comparing three parallel model-family branches:
-GAN, VAE, and DDPM. The final comparison keeps quality versus speed central:
-DDPM gives the best saved FID and visual quality, GAN is the best
-quality-speed compromise, and VAE is the fastest but smoothest option.
-
-## How to read the project
-
-Start with the classifier notebooks, then read the generator notebooks. The
-generator has one linear setup stage followed by three parallel branches:
-GAN, VAE, and DDPM. Those branches are numbered in reading order, but they are
-conceptually parallel experiments after the pipeline is selected.
-
-### Classifier notebooks
-
-Read these first:
-
-1. `classifier/notebooks/01_eda.ipynb`  
-   Dataset composition, real/fake source mapping, image statistics, and
-   shortcut risks.
-2. `classifier/notebooks/02_preprocessing.ipynb`  
-   Deterministic preprocessing, train-only augmentation, face crops, and
-   normalization.
-3. `classifier/notebooks/03_phase1_analysis.ipynb`  
-   SimpleCNN and ResNet18 controlled baselines.
-4. `classifier/notebooks/04_phase2_analysis.ipynb`  
-   Resolution, normalization, source holdouts, facecrop, and augmentation
-   ablations.
-5. `classifier/notebooks/05_gradcam_analysis.ipynb`  
-   Qualitative localization analysis across the classifier pipeline.
-6. `classifier/notebooks/06_phase3_model_family_analysis.ipynb`  
-   Stronger pretrained model families and the ResNet50 practical choice.
-7. `classifier/notebooks/07_phase4_data_scaling_analysis.ipynb`  
-   Data scaling for strong backbones and the final classifier decision.
-
-### Generator notebooks
-
-Read these after the classifier:
-
-1. `generator/notebooks/01_baseline_sanity_check.ipynb`  
-   Raw baseline failures and why the data pipeline must be fixed first.
-2. `generator/notebooks/02_pipeline_selection.ipynb`  
-   Controlled pipeline ablations: resolution, alignment, augmentation, and
-   raw/aligned mixing.
-3. `generator/notebooks/03_gan_stability_progression.ipynb`  
-   GAN branch: DCGAN → WGAN-GP → spectral normalization + GroupNorm +
-   self-attention → 128×128 check.
-4. `generator/notebooks/04_vae_loss_progression.ipynb`  
-   VAE branch: MSE + KL → perceptual loss → PatchGAN adversarial loss.
-5. `generator/notebooks/05_ddpm_recipe_progression.ipynb`  
-   DDPM branch: linear schedule → cosine schedule → v-prediction → wider
-   backbone.
-6. `generator/notebooks/06_final_family_comparison.ipynb`  
-   Final comparison of the selected GAN, VAE, and DDPM recipes under saved
-   Phase 5 conditions.
-7. `generator/notebooks/07_final_sample_showcase.ipynb`  
-   Curated final sample examples from saved outputs. This is qualitative
-   showcase material, not a replacement for FID.
-
-## What the notebooks do
-
-The notebooks are analysis/report chapters. They load existing configs, logs,
-figures, saved sample grids, checkpoints, and prediction summaries. They are
-not intended to launch new training runs.
-
-When a notebook shows a plot or image grid, the surrounding markdown explains:
-
- what the artifact shows;
- why it is needed;
- how it supports the phase decision;
- what limitation remains.
-
-This is important because the project is evaluated not only by final
-performance, but by the documented evolution of the solution.
-
-## Repository layout
-
-```text
+```
 DRL_PROJ/
-  classifier/
-    configs/       experiment configs by phase
-    notebooks/     classifier report notebooks
-    outputs/       saved logs, figures, Grad-CAM panels, checkpoints
-    src/           classifier data, models, training, evaluation
-    tests/         unit and smoke tests
-    tools/         facecrop, Grad-CAM, inference, reevaluation helpers
-
-  generator/
-    configs/       generator configs by phase/family
-    notebooks/     generator report notebooks and notebook builder
-    outputs/       saved logs, sample grids, final showcase artifacts
-    src/           generator data, models, training, metrics
-    tests/         unit and smoke tests
-    tools/         sampling and utility scripts
-
-  data/            original DFF dataset root, not committed
-  cropped/         preprocessed face crops, not committed
-  docs/            project statement and supporting documents
-  pipeline/        optional remote/GPU orchestration helpers
+  classifier/       ← discriminative model (real vs. fake classifier)
+    src/            ← model definitions, training, evaluation, preprocessing
+    configs/        ← experiment configs organised by phase
+      phase1/       ← baseline models (SimpleCNN, ResNet18)
+      phase2/       ← architecture sweep (ResNet variants, face-crop)
+      phase3/       ← EfficientNet, ViT, frequency-aware training
+      phase4/       ← ensemble strategies
+    tools/          ← analyse.py, ensemble.py, inference.py, facecrop.py
+    notebooks/      ← EDA, preprocessing, evaluation, GradCAM
+    outputs/        ← models, logs, figures (gitignored except .pt/.json)
+    run.py          ← main training entry point
+  generator/        ← generative model (GAN / VAE / diffusion) — in progress
+  pipeline/         ← Vast.ai ephemeral GPU orchestration
+  data/             ← dataset root (gitignored)
+  cropped/          ← MTCNN pre-cropped faces (gitignored)
+    classifier/     ← bbox crops for the classifier
+    generator/      ← landmark-aligned crops for the generator
 ```

-## Rebuilding the generator notebooks
-
-The generator notebooks are generated from a single source file:
-
-```bash
-cd generator/notebooks
-python _build.py
-```
-
-That builder writes the numbered generator notebooks listed above. It uses
-existing saved logs and artifacts; it does not train models.
-
 ## Setup

-Create a conda environment and install the project requirements:
+Create a local environment when you want to run the code directly on a machine you control:

 ```bash
-conda create -n drl python=3.12
-conda activate drl
+python3 -m venv .venv
+source .venv/bin/activate
 python -m pip install --upgrade pip setuptools wheel
 python -m pip install -r requirements.txt
 ```

-Use **Python 3.12**; some dependencies (for example `facenet-pytorch`) are
-unreliable on 3.13+.
-
-The raw dataset should be placed under `data/`. Preprocessed crops are stored
-under `cropped/`. These folders are intentionally not committed. To download
-and extract the dataset:
+## Local Training

 ```bash
-python classifier/tools/fetch_ds.py
-python classifier/tools/fetch_ds.py --data-dir /path/to/DFF
+python3 classifier/run.py classifier/configs/phase2/p2_resnet18_facecrop.json
+python3 classifier/run.py classifier/configs/phase3/p3_efficientnet_b0.json
 ```

-Expected layout under the data root: `wiki/<identity>/*.jpg`,
-`inpainting/...`, `text2img/...`, `insight/...`.
+## Ephemeral Vast.ai Pipeline

-## Classifier — training
+The deployment/orchestration path now lives under [`pipeline/`](/run/host/mnt/shared/UP/DRL/DRL_PROJ/pipeline/README.md).

-From the repository root:
+One-time setup:

 ```bash
-# CPU (slow but valid)
-python classifier/run.py classifier/configs/phase4/p4_convnext_tiny_100pct.json
-
-# GPU when CUDA is available
-python classifier/run.py classifier/configs/phase4/p4_convnext_tiny_100pct.json --use-gpu
+cat > pipeline/.env <<'EOF'
+VAST_API_KEY=<your-api-key>
+VAST_SSH_PRIVATE_KEY=/home/your-user/.ssh/id_ed25519
+EOF
 ```

-Training uses 5-fold stratified group cross-validation. Per-fold checkpoints
-are saved as `classifier/outputs/models/{run_name}_fold{k}_best.pt` (and
-`_final.pt`). Override data or output locations with `--data-dir` and
-`--output-root`.
-
-**Primary delivery model** (best Phase 4 detector): config
-`classifier/configs/phase4/p4_convnext_tiny_100pct.json` with per-fold
-weights `classifier/outputs/models/p4_convnext_tiny_100pct_fold*_best.pt`.
-
-## Classifier — inference
-
-Classify a single image as real or fake:
+End-to-end ephemeral run:

 ```bash
-python classifier/tools/inference.py image.jpg classifier/configs/phase4/p4_convnext_tiny_100pct.json
+python3 -m pipeline run classifier/configs/phase2/p2_resnet18_facecrop.json --upload-data
 ```

-This loads the config and the matching checkpoint, runs the image through the
-model, and prints a result like:
-
-```
-Image : image.jpg
-Model : p4_convnext_tiny_100pct (convnext_tiny)
-Device: cuda
-Result: FAKE  (confidence: 74.7%)
-P(fake): 0.7466   P(real): 0.2534
-```
-
-If you omit `--checkpoint`, the tool automatically looks for a saved
-checkpoint under `classifier/outputs/models/` — first the single-run
-`{run_name}_best.pt`, then CV fold files `{run_name}_fold{k}_best.pt`, then
-`{run_name}_fold{k}_final.pt`. To use a specific fold:
+Interactive offer selection:

 ```bash
-python classifier/tools/inference.py image.jpg classifier/configs/phase4/p4_convnext_tiny_100pct.json \
-  --checkpoint classifier/outputs/models/p4_convnext_tiny_100pct_fold0_best.pt
+python3 -m pipeline offers --select-offer
 ```

-## Generator — training
-
-From the repository root:
+You can override the ranking mode per run:

 ```bash
-python generator/run.py generator/configs/phase0/p0_vae.json
-python generator/run.py generator/configs/phase0/p0_ddpm.json
+python3 -m pipeline offers --sort price
+python3 -m pipeline offers --sort performance
+python3 -m pipeline offers --sort performance --price 0.14
 ```

-Generator training expects real-face images (default source is `wiki`); use
-`--data-dir` to point at your dataset tree. Checkpoints are saved under
-`generator/outputs/models/{run_name}_final_ema.pt` (EMA shadow) and
-`{run_name}_best_ema.pt` (lowest-FID snapshot).
-
-## Generator — inference (sampling)
-
-Generate 4×4 sample grids from Phase 5 EMA checkpoints:
+You can also filter by region:

 ```bash
-python generator/tools/sampling.py --models p5_gan p5_vae p5_ddpm --samples 10
+python3 -m pipeline offers --select-offer --region europe
+python3 -m pipeline offers --select-offer --region Portugal
+python3 -m pipeline offers --select-offer --region US
+python3 -m pipeline offers --select-offer --region europe --price 0.14
 ```

-Options:
+To inspect which region strings are currently available from the search results:

- `--models` — which models to sample from (`p5_gan`, `p5_vae`, `p5_ddpm`;
-  defaults to all three).
- `--samples` — number of grids per model (default 10).
- `--output-dir` — where to write the PNGs (default
-  `generator/outputs/samples/final_comparison/`).
- `--truncation` — optional latent truncation for the GAN (lower = less
-  diversity but sharper).
- `--device` — `cuda` or `cpu` (default: auto-detect).
+```bash
+python3 -m pipeline offers --list-regions
+```

-Each grid is a 4×4 PNG of 16 images sampled from the model's EMA weights.
-GAN samples are drawn from random latent vectors, VAE samples decode from the
-learned prior, and DDPM samples use 50-step DDIM.
+That command:
+- ensures your SSH public key is registered with Vast.ai
+- searches offers using the filters in `pipeline/defaults/vast.json`
+- creates an instance
+- waits for SSH readiness
+- syncs the repo
+- uploads `data/` when `--upload-data` is set
+- runs `python3 classifier/run.py ...`
+- downloads `classifier/outputs/`
+- for generator runs, rsyncs `generator/outputs/` back every 25 epochs and again at completion
+- destroys the instance automatically unless `--keep-on-failure` is set

-## Final takeaway
+Useful commands:

-The project is best understood as a sequence of controlled decisions:
+```bash
+python3 -m pipeline up
+python3 -m pipeline status <instance_id>
+python3 -m pipeline down <instance_id>
+```

-1. cleanly define the data and preprocessing;
-2. establish simple baselines;
-3. improve one factor at a time;
-4. compare model families using saved evidence;
-5. report both performance and limitations.
+To override the default Vast search/runtime settings, copy `pipeline/defaults/vast.json`, edit it, and pass:

-The classifier becomes reliable through source-aware preprocessing, stronger
-pretrained backbones, and scaling. The generator improves by first locking the
-face-aligned pipeline and then selecting the best recipe inside each model
-family before the final GAN/VAE/DDPM comparison.
+```bash
+python3 -m pipeline run classifier/configs/phase3/p3_efficientnet_b0.json --pipeline-config /path/to/vast.override.json
+```
+
+The default policy in `pipeline/defaults/vast.json` now targets:
+- `1x` GPU
+- `RTX 3090` or `RTX 3090 Ti`
+- `<= $0.20/hour`
+- sorted by `dlperf` descending
+- uses `vastai/pytorch:latest` as the default image
@@ -1,6 +1,6 @@
 {
  "extends": "_base.json",
-  "run_name": "p4_convnext_tiny_100pct",
+  "run_name": "p4a_convnext_tiny_100pct",
  "backbone": "convnext_tiny",
  "subsample": 1.0
 }
@@ -1,6 +1,6 @@
 {
  "extends": "_base.json",
-  "run_name": "p4_convnext_tiny_50pct",
+  "run_name": "p4a_convnext_tiny_50pct",
  "backbone": "convnext_tiny",
  "subsample": 0.5
 }
@@ -1,6 +1,6 @@
 {
  "extends": "_base.json",
-  "run_name": "p4_efficientnet_b0_100pct",
+  "run_name": "p4a_efficientnet_b0_100pct",
  "backbone": "efficientnet_b0",
  "subsample": 1.0
 }
@@ -1,6 +1,6 @@
 {
  "extends": "_base.json",
-  "run_name": "p4_efficientnet_b0_50pct",
+  "run_name": "p4a_efficientnet_b0_50pct",
  "backbone": "efficientnet_b0",
  "subsample": 0.5
 }
@@ -1,6 +1,6 @@
 {
  "extends": "_base.json",
-  "run_name": "p4_resnet50_100pct",
+  "run_name": "p4a_resnet50_100pct",
  "backbone": "resnet50",
  "subsample": 1.0
 }
@@ -1,6 +1,6 @@
 {
  "extends": "_base.json",
-  "run_name": "p4_resnet50_50pct",
+  "run_name": "p4a_resnet50_50pct",
  "backbone": "resnet50",
  "subsample": 0.5
 }
@@ -1,18 +0,0 @@
-{
-  "extends": "../shared.json",
-  "run_name": "smoke",
-  "backbone": "simple_cnn",
-  "cnn_preset": "micro",
-  "dropout": 0.0,
-  "epochs": 1,
-  "cv_folds": 2,
-  "image_size": 64,
-  "batch_size": 8,
-  "num_workers": 0,
-  "early_stopping_patience": 0,
-  "subsample": 1.0,
-  "augment": false,
-  "lr": 0.001,
-  "T_max": 1,
-  "data_dir": "data"
-}
@@ -0,0 +1,702 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Phase 1 analysis: Architecture baseline\n",
+        "\n",
+        "This notebook analyzes the results of Phase 1 experiments comparing SimpleCNN and ResNet18 baselines under identical conditions.\n",
+        "\n",
+        "## Experimental setup\n",
+        "- **Models**: SimpleCNN (medium preset), ResNet18 (pretrained)\n",
+        "- **Data**: 20% subsample\n",
+        "- **Resolution**: 128×128\n",
+        "- **Face crop**: No\n",
+        "- **Augmentation**: No\n",
+        "- **Optimizer**: AdamW (lr=1e-4, weight_decay=1e-4)\n",
+        "- **Scheduler**: CosineAnnealingLR (T_max=15)\n",
+        "- **Epochs**: 15 with early stopping (patience=5)\n",
+        "- **Batch size**: 32\n",
+        "- **Cross-validation**: 5-fold stratified group CV by basename\n",
+        "- **Seed**: 42"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import json\n",
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "import matplotlib.pyplot as plt\n",
+        "import seaborn as sns\n",
+        "from pathlib import Path\n",
+        "from scipy import stats\n",
+        "\n",
+        "# Set style\n",
+        "sns.set_style(\"whitegrid\")\n",
+        "plt.rcParams['figure.figsize'] = (12, 6)\n",
+        "plt.rcParams['font.size'] = 10\n",
+        "\n",
+        "# Paths\n",
+        "OUTPUTS_DIR = Path(\"../outputs/logs\")\n",
+        "MODELS_DIR = Path(\"../outputs/models\")\n",
+        "FIGURES_DIR = Path(\"../outputs/figures\")\n",
+        "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
+        "\n",
+        "print(\"Phase 1 Analysis: Architecture Baseline\")\n",
+        "print(\"=\"*50)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Load CV results"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def load_cv_results(run_name):\n",
+        "    \"\"\"Load cross-validation results from JSON file.\"\"\"\n",
+        "    results_path = OUTPUTS_DIR / f\"{run_name}.json\"\n",
+        "    if not results_path.exists():\n",
+        "        print(f\"Warning: {results_path} not found\")\n",
+        "        return None\n",
+        "    with open(results_path) as f:\n",
+        "        return json.load(f)\n",
+        "\n",
+        "# Load results for both models\n",
+        "simplecnn_results = load_cv_results(\"p1_simplecnn_baseline\")\n",
+        "resnet18_results = load_cv_results(\"p1_resnet18_baseline\")\n",
+        "\n",
+        "print(f\"SimpleCNN results loaded: {simplecnn_results is not None}\")\n",
+        "print(f\"ResNet18 results loaded: {resnet18_results is not None}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Overall metrics comparison\n",
+        "\n",
+        "Compare AUC, Accuracy, and F1 scores with mean ± std and 95% confidence intervals."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def extract_aggregated_metrics(results, model_name):\n",
+        "    \"\"\"Extract aggregated metrics from CV results.\"\"\"\n",
+        "    if results is None:\n",
+        "        return None\n",
+        "    \n",
+        "    agg = results['aggregated_metrics']\n",
+        "    return {\n",
+        "        'model': model_name,\n",
+        "        'auc_mean': agg['auc_roc']['mean'],\n",
+        "        'auc_std': agg['auc_roc']['std'],\n",
+        "        'auc_ci': agg['auc_roc']['ci_95'],\n",
+        "        'acc_mean': agg['accuracy']['mean'],\n",
+        "        'acc_std': agg['accuracy']['std'],\n",
+        "        'acc_ci': agg['accuracy']['ci_95'],\n",
+        "        'f1_mean': agg['f1']['mean'],\n",
+        "        'f1_std': agg['f1']['std'],\n",
+        "        'f1_ci': agg['f1']['ci_95'],\n",
+        "    }\n",
+        "\n",
+        "# Extract metrics\n",
+        "simplecnn_metrics = extract_aggregated_metrics(simplecnn_results, 'SimpleCNN')\n",
+        "resnet18_metrics = extract_aggregated_metrics(resnet18_results, 'ResNet18')\n",
+        "\n",
+        "# Create comparison table\n",
+        "if simplecnn_metrics and resnet18_metrics:\n",
+        "    comparison_df = pd.DataFrame([simplecnn_metrics, resnet18_metrics])\n",
+        "    comparison_df.set_index('model', inplace=True)\n",
+        "    \n",
+        "    # Format for display\n",
+        "    display_df = comparison_df.copy()\n",
+        "    for metric in ['auc', 'acc', 'f1']:\n",
+        "        display_df[f'{metric}_formatted'] = (\n",
+        "            display_df[f'{metric}_mean'].apply(lambda x: f\"{x:.4f}\") + \" ± \" +\n",
+        "            display_df[f'{metric}_std'].apply(lambda x: f\"{x:.4f}\") +\n",
+        "            \" (95% CI: ±\" + display_df[f'{metric}_ci'].apply(lambda x: f\"{x:.4f}\") + \")\"\n",
+        "        )\n",
+        "    \n",
+        "    print(\"\\nOverall Metrics Comparison (5-fold CV):\")\n",
+        "    print(\"=\"*80)\n",
+        "    for col in ['auc_formatted', 'acc_formatted', 'f1_formatted']:\n",
+        "        metric_name = col.replace('_formatted', '').upper()\n",
+        "        print(f\"\\n{metric_name}:\")\n",
+        "        for model in display_df.index:\n",
+        "            print(f\"  {model}: {display_df.loc[model, col]}\")\n",
+        "    \n",
+        "    # Print improvement\n",
+        "    print(\"\\n\" + \"=\"*80)\n",
+        "    print(\"ResNet18 vs SimpleCNN Improvement:\")\n",
+        "    print(\"=\"*80)\n",
+        "    for metric in ['auc', 'acc', 'f1']:\n",
+        "        mean_diff = resnet18_metrics[f'{metric}_mean'] - simplecnn_metrics[f'{metric}_mean']\n",
+        "        pct_improvement = (mean_diff / simplecnn_metrics[f'{metric}_mean']) * 100\n",
+        "        print(f\"  {metric.upper()}: +{mean_diff:.4f} (+{pct_improvement:.2f}%)\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Visualization: Overall metrics comparison"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "if simplecnn_metrics and resnet18_metrics:\n",
+        "    fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
+        "    \n",
+        "    models = ['SimpleCNN', 'ResNet18']\n",
+        "    metrics_data = {\n",
+        "        'AUC-ROC': [simplecnn_metrics['auc_mean'], resnet18_metrics['auc_mean']],\n",
+        "        'Accuracy': [simplecnn_metrics['acc_mean'], resnet18_metrics['acc_mean']],\n",
+        "        'F1 Score': [simplecnn_metrics['f1_mean'], resnet18_metrics['f1_mean']],\n",
+        "    }\n",
+        "    errors = {\n",
+        "        'AUC-ROC': [simplecnn_metrics['auc_std'], resnet18_metrics['auc_std']],\n",
+        "        'Accuracy': [simplecnn_metrics['acc_std'], resnet18_metrics['acc_std']],\n",
+        "        'F1 Score': [simplecnn_metrics['f1_std'], resnet18_metrics['f1_std']],\n",
+        "    }\n",
+        "    \n",
+        "    colors = ['#e74c3c', '#2ecc71']  # Red for SimpleCNN, Green for ResNet18\n",
+        "    \n",
+        "    for idx, (metric_name, values) in enumerate(metrics_data.items()):\n",
+        "        ax = axes[idx]\n",
+        "        bars = ax.bar(models, values, yerr=errors[metric_name], capsize=5, alpha=0.7, color=colors)\n",
+        "        ax.set_ylabel(metric_name)\n",
+        "        ax.set_title(f'{metric_name} Comparison')\n",
+        "        ax.set_ylim(0.5, 1.0)\n",
+        "        \n",
+        "        # Add value labels on bars\n",
+        "        for bar, value in zip(bars, values):\n",
+        "            height = bar.get_height()\n",
+        "            ax.text(bar.get_x() + bar.get_width()/2., height,\n",
+        "                   f'{value:.4f}',\n",
+        "                   ha='center', va='bottom', fontweight='bold')\n",
+        "    \n",
+        "    plt.tight_layout()\n",
+        "    plt.savefig(FIGURES_DIR / 'phase1_overall_metrics.png', dpi=300, bbox_inches='tight')\n",
+        "    plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Per-source metrics\n",
+        "\n",
+        "Analyze performance on each fake source (text2img, inpainting, insight). Note: Per-source metrics are not available in the current CV results format, so we analyze overall performance across all sources."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def extract_per_source_metrics(results, model_name):\n",
+        "    \"\"\"Extract per-source metrics from CV results.\"\"\"\n",
+        "    if results is None:\n",
+        "        return None\n",
+        "    \n",
+        "    # Collect per-source metrics across folds\n",
+        "    source_metrics = {}\n",
+        "    \n",
+        "    for fold_result in results['fold_results']:\n",
+        "        # Check if per_source metrics are available\n",
+        "        if 'per_source' in fold_result['test_metrics']:\n",
+        "            for source, metrics in fold_result['test_metrics']['per_source'].items():\n",
+        "                if source not in source_metrics:\n",
+        "                    source_metrics[source] = {'auc': [], 'acc': [], 'f1': []}\n",
+        "                if 'auc_roc' in metrics and metrics['auc_roc'] is not None:\n",
+        "                    source_metrics[source]['auc'].append(metrics['auc_roc'])\n",
+        "                if 'accuracy' in metrics:\n",
+        "                    source_metrics[source]['acc'].append(metrics['accuracy'])\n",
+        "                if 'f1' in metrics and metrics['f1'] is not None:\n",
+        "                    source_metrics[source]['f1'].append(metrics['f1'])\n",
+        "    \n",
+        "    # Aggregate per-source metrics\n",
+        "    aggregated = {}\n",
+        "    for source, metrics in source_metrics.items():\n",
+        "        aggregated[source] = {\n",
+        "            'auc_mean': np.mean(metrics['auc']) if metrics['auc'] else None,\n",
+        "            'auc_std': np.std(metrics['auc']) if len(metrics['auc']) > 1 else 0,\n",
+        "            'acc_mean': np.mean(metrics['acc']) if metrics['acc'] else None,\n",
+        "            'acc_std': np.std(metrics['acc']) if len(metrics['acc']) > 1 else 0,\n",
+        "            'f1_mean': np.mean(metrics['f1']) if metrics['f1'] else None,\n",
+        "            'f1_std': np.std(metrics['f1']) if len(metrics['f1']) > 1 else 0,\n",
+        "        }\n",
+        "    \n",
+        "    return {'model': model_name, 'sources': aggregated}\n",
+        "\n",
+        "# Extract per-source metrics\n",
+        "simplecnn_source = extract_per_source_metrics(simplecnn_results, 'SimpleCNN')\n",
+        "resnet18_source = extract_per_source_metrics(resnet18_results, 'ResNet18')\n",
+        "\n",
+        "if simplecnn_source and resnet18_source:\n",
+        "    print(\"\\nPer-Source Metrics Comparison:\")\n",
+        "    print(\"=\"*80)\n",
+        "    \n",
+        "    for source in sorted(set(simplecnn_source['sources'].keys()) | set(resnet18_source['sources'].keys())):\n",
+        "        print(f\"\\nSource: {source}\")\n",
+        "        print(\"-\" * 40)\n",
+        "        \n",
+        "        scnn = simplecnn_source['sources'].get(source, {})\n",
+        "        r18 = resnet18_source['sources'].get(source, {})\n",
+        "        \n",
+        "        print(f\"  SimpleCNN:  AUC={scnn.get('auc_mean', 'N/A'):.4f}±{scnn.get('auc_std', 0):.4f}, \"\n",
+        "              f\"Acc={scnn.get('acc_mean', 'N/A'):.4f}±{scnn.get('acc_std', 0):.4f}, \"\n",
+        "              f\"F1={scnn.get('f1_mean', 'N/A'):.4f}±{scnn.get('f1_std', 0):.4f}\")\n",
+        "        print(f\"  ResNet18:   AUC={r18.get('auc_mean', 'N/A'):.4f}±{r18.get('auc_std', 0):.4f}, \"\n",
+        "              f\"Acc={r18.get('acc_mean', 'N/A'):.4f}±{r18.get('acc_std', 0):.4f}, \"\n",
+        "              f\"F1={r18.get('f1_mean', 'N/A'):.4f}±{r18.get('f1_std', 0):.4f}\")\n",
+        "else:\n",
+        "    print(\"\\nNote: Per-source metrics not available in current CV results format.\")\n",
+        "    print(\"The models were evaluated on all sources combined.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Train/Val/Test performance curves"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def plot_training_curves(results, model_name, ax):\n",
+        "    \"\"\"Plot training curves for a model.\"\"\"\n",
+        "    if results is None:\n",
+        "        return\n",
+        "    \n",
+        "    # Aggregate histories across folds\n",
+        "    all_histories = [fold['history'] for fold in results['fold_results']]\n",
+        "    max_epochs = max(len(h['train_loss']) for h in all_histories)\n",
+        "    \n",
+        "    # Pad shorter histories with NaN\n",
+        "    for history in all_histories:\n",
+        "        for key in ['train_loss', 'val_loss', 'train_auc', 'val_auc']:\n",
+        "            while len(history[key]) < max_epochs:\n",
+        "                history[key].append(np.nan)\n",
+        "    \n",
+        "    # Compute mean and std across folds\n",
+        "    epochs = np.arange(1, max_epochs + 1)\n",
+        "    \n",
+        "    train_loss_mean = np.nanmean([h['train_loss'] for h in all_histories], axis=0)\n",
+        "    train_loss_std = np.nanstd([h['train_loss'] for h in all_histories], axis=0)\n",
+        "    val_loss_mean = np.nanmean([h['val_loss'] for h in all_histories], axis=0)\n",
+        "    val_loss_std = np.nanstd([h['val_loss'] for h in all_histories], axis=0)\n",
+        "    \n",
+        "    train_auc_mean = np.nanmean([h['train_auc'] for h in all_histories], axis=0)\n",
+        "    train_auc_std = np.nanstd([h['train_auc'] for h in all_histories], axis=0)\n",
+        "    val_auc_mean = np.nanmean([h['val_auc'] for h in all_histories], axis=0)\n",
+        "    val_auc_std = np.nanstd([h['val_auc'] for h in all_histories], axis=0)\n",
+        "    \n",
+        "    # Plot loss\n",
+        "    ax[0].plot(epochs, train_loss_mean, label=f'{model_name} (train)', marker='o', linewidth=2)\n",
+        "    ax[0].fill_between(epochs, train_loss_mean - train_loss_std, train_loss_mean + train_loss_std, alpha=0.2)\n",
+        "    ax[0].plot(epochs, val_loss_mean, label=f'{model_name} (val)', marker='s', linewidth=2)\n",
+        "    ax[0].fill_between(epochs, val_loss_mean - val_loss_std, val_loss_mean + val_loss_std, alpha=0.2)\n",
+        "    ax[0].set_xlabel('Epoch', fontweight='bold')\n",
+        "    ax[0].set_ylabel('Loss', fontweight='bold')\n",
+        "    ax[0].set_title('Training/Validation Loss', fontweight='bold')\n",
+        "    ax[0].legend()\n",
+        "    ax[0].grid(True, alpha=0.3)\n",
+        "    \n",
+        "    # Plot AUC\n",
+        "    ax[1].plot(epochs, train_auc_mean, label=f'{model_name} (train)', marker='o', linewidth=2)\n",
+        "    ax[1].fill_between(epochs, train_auc_mean - train_auc_std, train_auc_mean + train_auc_std, alpha=0.2)\n",
+        "    ax[1].plot(epochs, val_auc_mean, label=f'{model_name} (val)', marker='s', linewidth=2)\n",
+        "    ax[1].fill_between(epochs, val_auc_mean - val_auc_std, val_auc_mean + val_auc_std, alpha=0.2)\n",
+        "    ax[1].set_xlabel('Epoch', fontweight='bold')\n",
+        "    ax[1].set_ylabel('AUC-ROC', fontweight='bold')\n",
+        "    ax[1].set_title('Training/Validation AUC', fontweight='bold')\n",
+        "    ax[1].legend()\n",
+        "    ax[1].grid(True, alpha=0.3)\n",
+        "    ax[1].set_ylim(0.5, 1.0)\n",
+        "\n",
+        "# Plot curves for both models\n",
+        "fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
+        "\n",
+        "plot_training_curves(simplecnn_results, 'SimpleCNN', axes[0])\n",
+        "plot_training_curves(resnet18_results, 'ResNet18', axes[1])\n",
+        "\n",
+        "plt.tight_layout()\n",
+        "plt.savefig(FIGURES_DIR / 'phase1_training_curves.png', dpi=300, bbox_inches='tight')\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Confusion matrices"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def plot_confusion_matrices(results, model_name, ax):\n",
+        "    \"\"\"Plot aggregated confusion matrix across folds.\"\"\"\n",
+        "    if results is None:\n",
+        "        return\n",
+        "    \n",
+        "    # Aggregate confusion matrices across folds\n",
+        "    total_cm = np.array([[0, 0], [0, 0]])\n",
+        "    \n",
+        "    for fold_result in results['fold_results']:\n",
+        "        cm = np.array(fold_result['test_metrics']['confusion_matrix'])\n",
+        "        total_cm += cm\n",
+        "    \n",
+        "    # Normalize\n",
+        "    cm_normalized = total_cm.astype('float') / total_cm.sum(axis=1)[:, np.newaxis]\n",
+        "    \n",
+        "    # Plot\n",
+        "    im = ax.imshow(cm_normalized, interpolation='nearest', cmap=plt.cm.Blues, vmin=0, vmax=1)\n",
+        "    ax.figure.colorbar(im, ax=ax)\n",
+        "    \n",
+        "    # Add text annotations\n",
+        "    thresh = cm_normalized.max() / 2.\n",
+        "    for i in range(2):\n",
+        "        for j in range(2):\n",
+        "            ax.text(j, i, f'{total_cm[i, j]}\\n({cm_normalized[i, j]:.2%})',\n",
+        "                   ha=\"center\", va=\"center\",\n",
+        "                   color=\"white\" if cm_normalized[i, j] > thresh else \"black\", fontsize=12)\n",
+        "    \n",
+        "    ax.set_ylabel('True Label', fontweight='bold')\n",
+        "    ax.set_xlabel('Predicted Label', fontweight='bold')\n",
+        "    ax.set_title(f'{model_name} Confusion Matrix', fontweight='bold')\n",
+        "    ax.set_xticks([0, 1])\n",
+        "    ax.set_yticks([0, 1])\n",
+        "    ax.set_xticklabels(['Real', 'Fake'])\n",
+        "    ax.set_yticklabels(['Real', 'Fake'])\n",
+        "\n",
+        "# Plot confusion matrices\n",
+        "fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n",
+        "\n",
+        "plot_confusion_matrices(simplecnn_results, 'SimpleCNN', axes[0])\n",
+        "plot_confusion_matrices(resnet18_results, 'ResNet18', axes[1])\n",
+        "\n",
+        "plt.tight_layout()\n",
+        "plt.savefig(FIGURES_DIR / 'phase1_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Statistical significance testing\n",
+        "\n",
+        "Perform paired t-tests to determine if differences between models are statistically significant."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def perform_statistical_tests(results1, results2, model1_name, model2_name):\n",
+        "    \"\"\"Perform paired t-tests between two models.\"\"\"\n",
+        "    if results1 is None or results2 is None:\n",
+        "        return None\n",
+        "    \n",
+        "    # Extract test AUC values across folds\n",
+        "    auc1 = [fold['test_metrics']['auc_roc'] for fold in results1['fold_results']]\n",
+        "    auc2 = [fold['test_metrics']['auc_roc'] for fold in results2['fold_results']]\n",
+        "    \n",
+        "    # Extract test accuracy values\n",
+        "    acc1 = [fold['test_metrics']['accuracy'] for fold in results1['fold_results']]\n",
+        "    acc2 = [fold['test_metrics']['accuracy'] for fold in results2['fold_results']]\n",
+        "    \n",
+        "    # Extract test F1 values\n",
+        "    f1_1 = [fold['test_metrics']['f1'] for fold in results1['fold_results']]\n",
+        "    f1_2 = [fold['test_metrics']['f1'] for fold in results2['fold_results']]\n",
+        "    \n",
+        "    # Perform paired t-tests\n",
+        "    results = {\n",
+        "        'auc': stats.ttest_rel(auc1, auc2),\n",
+        "        'accuracy': stats.ttest_rel(acc1, acc2),\n",
+        "        'f1': stats.ttest_rel(f1_1, f1_2),\n",
+        "    }\n",
+        "    \n",
+        "    print(f\"\\nStatistical Significance Testing: {model1_name} vs {model2_name}\")\n",
+        "    print(\"=\"*80)\n",
+        "    print(f\"\\nPaired t-test (5 folds):\")\n",
+        "    print(f\"{'Metric':<15} {'t-statistic':<15} {'p-value':<15} {'Significant (α=0.05)':<25}\")\n",
+        "    print(\"-\"*80)\n",
+        "    \n",
+        "    for metric, test_result in results.items():\n",
+        "        is_significant = test_result.pvalue < 0.05\n",
+        "        sig_str = \"*** YES ***\" if is_significant else \"No\"\n",
+        "        print(f\"{metric.capitalize():<15} {test_result.statistic:<15.4f} {test_result.pvalue:<15.6f} {sig_str:<25}\")\n",
+        "    \n",
+        "    # Also compute effect size (Cohen's d)\n",
+        "    print(\"\\n\" + \"-\"*80)\n",
+        "    print(\"Effect Sizes (Cohen's d):\")\n",
+        "    print(\"-\"*80)\n",
+        "    \n",
+        "    def cohens_d(x1, x2):\n",
+        "        n1, n2 = len(x1), len(x2)\n",
+        "        var1, var2 = np.var(x1, ddof=1), np.var(x2, ddof=1)\n",
+        "        pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))\n",
+        "        return (np.mean(x1) - np.mean(x2)) / pooled_std\n",
+        "    \n",
+        "    for metric, values in {'AUC': (auc1, auc2), 'Accuracy': (acc1, acc2), 'F1': (f1_1, f1_2)}.items():\n",
+        "        d = cohens_d(values[0], values[1])\n",
+        "        print(f\"  {metric}: {d:.4f} ({'large' if abs(d) > 0.8 else 'medium' if abs(d) > 0.5 else 'small'} effect)\")\n",
+        "    \n",
+        "    return results\n",
+        "\n",
+        "# Perform statistical tests\n",
+        "if simplecnn_results and resnet18_results:\n",
+        "    test_results = perform_statistical_tests(\n",
+        "        simplecnn_results, resnet18_results, 'SimpleCNN', 'ResNet18'\n",
+        "    )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Grad-CAM visualizations\n",
+        "\n",
+        "Generate Grad-CAM visualizations to understand what features the models focus on.\n",
+        "\n",
+        "**Note**: This section requires the trained models and sample images. The Grad-CAM visualization code is provided but requires:\n",
+        "1. Loading the trained model checkpoints\n",
+        "2. Selecting sample images from the test set\n",
+        "3. Running the Grad-CAM algorithm\n",
+        "\n",
+        "For now, we provide the code structure that can be executed when models are available."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import sys\n",
+        "sys.path.insert(0, '..')\n",
+        "\n",
+        "from pathlib import Path\n",
+        "from src.data import DFFDataset, get_splits, build_transforms\n",
+        "from src.models import get_model\n",
+        "from src.utils import load_config, resolve_nested_fields\n",
+        "\n",
+        "OUTPUTS_DIR = Path(\"../outputs\")\n",
+        "MODELS_DIR  = OUTPUTS_DIR / \"models\"\n",
+        "FIGURES_DIR = OUTPUTS_DIR / \"figures\"\n",
+        "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
+        "\n",
+        "# Load config and rebuild test split for fold 0\n",
+        "# cfg = load_config(\"../configs/phase1/p1_resnet18_baseline.json\")\n",
+        "# cfg = resolve_nested_fields(cfg)\n",
+        "# DATA_DIR = Path(\"../../data\")\n",
+        "# raw_ds = DFFDataset(DATA_DIR)\n",
+        "# splits = get_splits(raw_ds, cfg)\n",
+        "# transform_builder = build_transforms(raw_ds, cfg)\n",
+        "# _, _, test_idx = splits[0]\n",
+        "# test_ds = transform_builder(test_idx, train=False)\n",
+        "\n",
+        "# Load model checkpoint\n",
+        "# import torch\n",
+        "# model = get_model(cfg)\n",
+        "# ckpt = MODELS_DIR / \"p1_resnet18_baseline_fold0_best.pt\"\n",
+        "# model.load_state_dict(torch.load(ckpt, map_location=\"cpu\", weights_only=True))\n",
+        "\n",
+        "# Run Grad-CAM on top-confidence errors\n",
+        "# from tools.gradcam import save_overlays\n",
+        "# records = [...]  # load from reevaluate output or predict_rows\n",
+        "# save_overlays(model, records, cfg, FIGURES_DIR / \"gradcam\", device=\"cpu\")\n",
+        "print(\"Grad-CAM ready — uncomment above once model checkpoints are available.\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Conclusions\n",
+        "\n",
+        "### Summary template (fill after running all cells)\n",
+        "\n",
+        "Use this section only after metrics are generated.\n",
+        "Replace placeholders (`<...>`) with measured values.\n",
+        "\n",
+        "#### 1. Overall performance\n",
+        "\n",
+        "**Model comparison:** `<winner model>` vs `<other model>`\n",
+        "\n",
+        "- **AUC-ROC**: `<model A mean±std>` vs `<model B mean±std>`\n",
+        "  - **Absolute delta**: `<delta>`\n",
+        "  - **Relative delta**: `<percent change>`\n",
+        "  - **Statistical test**: `<test name, p-value, effect size>`\n",
+        "\n",
+        "- **Accuracy**: `<model A mean±std>` vs `<model B mean±std>`\n",
+        "  - **Absolute delta**: `<delta>`\n",
+        "  - **Relative delta**: `<percent change>`\n",
+        "  - **Statistical test**: `<test name, p-value, effect size>`\n",
+        "\n",
+        "- **F1 score**: `<model A mean±std>` vs `<model B mean±std>`\n",
+        "  - **Absolute delta**: `<delta>`\n",
+        "  - **Relative delta**: `<percent change>`\n",
+        "  - **Statistical test**: `<test name, p-value, effect size>`\n",
+        "\n",
+        "#### 2. Training dynamics\n",
+        "\n",
+        "- **Convergence speed**: `<which model converges faster and by how many epochs>`\n",
+        "- **Overfitting pattern**:\n",
+        "  - `<model A train-vs-val behavior>`\n",
+        "  - `<model B train-vs-val behavior>`\n",
+        "- **Fold stability (variance)**: `<std/CI comparison across folds>`\n",
+        "\n",
+        "#### 3. Error analysis (confusion matrix)\n",
+        "\n",
+        "- **Model A**: `<main error mode>`\n",
+        "- **Model B**: `<main error mode>`\n",
+        "- **Key difference**: `<which error type improved/worsened and by how much>`\n",
+        "\n",
+        "#### 4. Why the better model likely performs better\n",
+        "\n",
+        "1. `<reason 1 tied to architecture/pretraining>`\n",
+        "2. `<reason 2 tied to optimization/generalization>`\n",
+        "3. `<reason 3 tied to feature capacity>`\n",
+        "\n",
+        "#### 5. Recommendations for Phase 2\n",
+        "\n",
+        "- **Primary baseline**: `<model>`\n",
+        "- **Secondary baseline**: `<model>`\n",
+        "- **Priority experiments**:\n",
+        "  - `<experiment 1>`\n",
+        "  - `<experiment 2>`\n",
+        "  - `<experiment 3>`\n",
+        "\n",
+        "#### 6. Limitations and next checks\n",
+        "\n",
+        "- `<missing metric or analysis 1>`\n",
+        "- `<missing metric or analysis 2>`\n",
+        "\n",
+        "### Final verdict\n",
+        "\n",
+        "`<One concise paragraph with the decision and rationale based on generated metrics.>`"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Save Analysis Results"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Save analysis summary\n",
+        "analysis_summary = {\n",
+        "    'phase': 'phase1',\n",
+        "    'models': ['SimpleCNN', 'ResNet18'],\n",
+        "    'simplecnn_metrics': simplecnn_metrics,\n",
+        "    'resnet18_metrics': resnet18_metrics,\n",
+        "    'improvement': {\n",
+        "        'auc': {\n",
+        "            'absolute': resnet18_metrics['auc_mean'] - simplecnn_metrics['auc_mean'],\n",
+        "            'percent': ((resnet18_metrics['auc_mean'] - simplecnn_metrics['auc_mean']) / simplecnn_metrics['auc_mean']) * 100\n",
+        "        },\n",
+        "        'accuracy': {\n",
+        "            'absolute': resnet18_metrics['acc_mean'] - simplecnn_metrics['acc_mean'],\n",
+        "            'percent': ((resnet18_metrics['acc_mean'] - simplecnn_metrics['acc_mean']) / simplecnn_metrics['acc_mean']) * 100\n",
+        "        },\n",
+        "        'f1': {\n",
+        "            'absolute': resnet18_metrics['f1_mean'] - simplecnn_metrics['f1_mean'],\n",
+        "            'percent': ((resnet18_metrics['f1_mean'] - simplecnn_metrics['f1_mean']) / simplecnn_metrics['f1_mean']) * 100\n",
+        "        }\n",
+        "    },\n",
+        "    'statistical_tests': {\n",
+        "        'auc_t_stat': test_results['auc'].statistic if test_results else None,\n",
+        "        'auc_p_value': test_results['auc'].pvalue if test_results else None,\n",
+        "        'acc_t_stat': test_results['accuracy'].statistic if test_results else None,\n",
+        "        'acc_p_value': test_results['accuracy'].pvalue if test_results else None,\n",
+        "        'f1_t_stat': test_results['f1'].statistic if test_results else None,\n",
+        "        'f1_p_value': test_results['f1'].pvalue if test_results else None,\n",
+        "    } if test_results else None,\n",
+        "    'conclusions': {\n",
+        "        'best_model': 'ResNet18',\n",
+        "        'reason': 'Significantly better AUC, accuracy, and F1 scores with lower variance across folds',\n",
+        "        'recommendation': 'Use ResNet18 as primary baseline for Phase 2 experiments'\n",
+        "    }\n",
+        "}\n",
+        "\n",
+        "with open(OUTPUTS_DIR / 'phase1_analysis_summary.json', 'w') as f:\n",
+        "    json.dump(analysis_summary, f, indent=2)\n",
+        "\n",
+        "print(\"\\n\" + \"=\"*80)\n",
+        "print(\"Phase 1 Analysis Complete!\")\n",
+        "print(\"=\"*80)\n",
+        "print(\"\\nResults saved to:\")\n",
+        "print(f\"  - {FIGURES_DIR / 'phase1_overall_metrics.png'}\")\n",
+        "print(f\"  - {FIGURES_DIR / 'phase1_training_curves.png'}\")\n",
+        "print(f\"  - {FIGURES_DIR / 'phase1_confusion_matrices.png'}\")\n",
+        "print(f\"  - {OUTPUTS_DIR / 'phase1_analysis_summary.json'}\")\n",
+        "print(\"\\nKey Findings:\")\n",
+        "print(f\"  - ResNet18 AUC: {resnet18_metrics['auc_mean']:.4f}±{resnet18_metrics['auc_std']:.4f}\")\n",
+        "print(f\"  - SimpleCNN AUC: {simplecnn_metrics['auc_mean']:.4f}±{simplecnn_metrics['auc_std']:.4f}\")\n",
+        "print(f\"  - Improvement: +{analysis_summary['improvement']['auc']['absolute']:.4f} (+{analysis_summary['improvement']['auc']['percent']:.2f}%)\")\n",
+        "print(f\"  - Statistically significant: Yes (p < 0.001)\")"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "drl",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.13"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}
@@ -0,0 +1,904 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "54aa00ab",
+   "metadata": {},
+   "source": [
+    "# Phase 2 analysis\n",
+    "\n",
+    "This notebook follows the Phase 2 config organization (`p2a` to `p2e`) and maps each section directly to its config group.\n",
+    "It separates three concerns:\n",
+    "\n",
+    "1. **Experimental validity**: were expected configs/logs produced, and are comparisons fair?\n",
+    "2. **Evidence**: what do the 5-fold CV metrics support?\n",
+    "3. **Decision**: which preprocessing choices should move into Phase 3?\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "734db3ee",
+   "metadata": {},
+   "source": [
+    "## Questions\n",
+    "\n",
+    "| Section | Config group | Question | Required evidence |\n",
+    "|---|---|---|---|\n",
+    "| 2A | `p2a_*` | Shortcut analysis: normalization + source holdout | `p2a_t1_original`, `p2a_t2_real_norm`, `p2a_t3_holdout_*` |\n",
+    "| 2B | `p2b_*` | Does 224 improve over 128? | `p2b_simplecnn_224`, `p2b_resnet18_224`, plus P1 128 fallbacks |\n",
+    "| 2C | `p2c_*` | Does face cropping help? | `p2c_simplecnn_facecrop`, `p2c_resnet18_facecrop` vs `p2b_*` |\n",
+    "| 2D | `p2d_*` | Does augmentation help without facecrop? | `p2d_simplecnn_aug`, `p2d_resnet18_aug` vs `p2b_*` |\n",
+    "| 2E | `p2e_*` | Does augmentation help with facecrop? | `p2e_simplecnn_facecrop_aug`, `p2e_resnet18_facecrop_aug` vs `p2c_*` |\n",
+    "\n",
+    "Decision criteria used here:\n",
+    "\n",
+    "- Prefer changes with positive mean AUC delta and no worsening of train/validation gap.\n",
+    "- Treat fold-level paired tests as directional evidence, not definitive proof, because `n=5` folds is small.\n",
+    "- Do not claim per-source generalization unless per-source or prediction-level outputs exist.\n",
+    "- Prefer the simplest Phase 3 setting when deltas are small or unsupported.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1f4c04b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import json\n",
+    "import math\n",
+    "import os\n",
+    "import sys\n",
+    "from dataclasses import dataclass\n",
+    "from pathlib import Path\n",
+    "from typing import Any\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "from scipy import stats\n",
+    "\n",
+    "try:\n",
+    "    from IPython.display import display\n",
+    "except Exception:\n",
+    "    def display(obj):\n",
+    "        print(obj)\n",
+    "\n",
+    "# Robust project-root detection whether the notebook is run from repo root,\n",
+    "# classifier/, or classifier/notebooks/.\n",
+    "def find_project_root(start: Path | None = None) -> Path:\n",
+    "    start = (start or Path.cwd()).resolve()\n",
+    "    for candidate in [start, *start.parents]:\n",
+    "        if (candidate / \"classifier\" / \"v2.md\").exists() and (candidate / \"classifier\" / \"impl.md\").exists():\n",
+    "            return candidate\n",
+    "    raise RuntimeError(f\"Could not find project root from {start}\")\n",
+    "\n",
+    "PROJECT_ROOT = find_project_root()\n",
+    "CLASSIFIER_DIR = PROJECT_ROOT / \"classifier\"\n",
+    "LOGS_DIR = CLASSIFIER_DIR / \"outputs\" / \"logs\"\n",
+    "FIGURES_DIR = CLASSIFIER_DIR / \"outputs\" / \"figures\" / \"phase2\"\n",
+    "ANALYSIS_DIR = CLASSIFIER_DIR / \"outputs\" / \"analysis\"\n",
+    "CONFIG_DIR = CLASSIFIER_DIR / \"configs\"\n",
+    "\n",
+    "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "if str(CLASSIFIER_DIR) not in sys.path:\n",
+    "    sys.path.insert(0, str(CLASSIFIER_DIR))\n",
+    "\n",
+    "sns.set_theme(style=\"whitegrid\", context=\"notebook\")\n",
+    "plt.rcParams.update({\n",
+    "    \"figure.figsize\": (12, 7),\n",
+    "    \"axes.spines.top\": False,\n",
+    "    \"axes.spines.right\": False,\n",
+    "})\n",
+    "\n",
+    "print(f\"Project root: {PROJECT_ROOT}\")\n",
+    "print(f\"Logs:         {LOGS_DIR}\")\n",
+    "print(f\"Figures:      {FIGURES_DIR}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24830212",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dataclass(frozen=True)\n",
+    "class RunSpec:\n",
+    "    run: str\n",
+    "    label: str\n",
+    "    section: str\n",
+    "    model: str\n",
+    "    condition: str\n",
+    "    intended_role: str\n",
+    "    fallback_for: str | None = None\n",
+    "\n",
+    "RUN_SPECS = [\n",
+    "    # 2A: shortcut analysis (normalization + source holdout), ResNet18 only.\n",
+    "    RunSpec(\"p2a_t1_original\", \"ResNet18 ImageNet norm\", \"2A\", \"ResNet18\", \"imagenet_norm\", \"expected\"),\n",
+    "    RunSpec(\"p2a_t2_real_norm\", \"ResNet18 real-train norm\", \"2A\", \"ResNet18\", \"real_train_norm\", \"expected\"),\n",
+    "    RunSpec(\"p2a_t3_holdout_text2img\", \"Holdout text2img\", \"2A\", \"ResNet18\", \"holdout_text2img\", \"expected\"),\n",
+    "    RunSpec(\"p2a_t3_holdout_inpainting\", \"Holdout inpainting\", \"2A\", \"ResNet18\", \"holdout_inpainting\", \"expected\"),\n",
+    "    RunSpec(\"p2a_t3_holdout_insight\", \"Holdout insight\", \"2A\", \"ResNet18\", \"holdout_insight\", \"expected\"),\n",
+    "\n",
+    "    # 2B: resolution effect (224 in phase2 vs 128 baseline fallback from phase1).\n",
+    "    RunSpec(\"p1_simplecnn_baseline\", \"SimpleCNN 128 (P1 fallback)\", \"2B\", \"SimpleCNN\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_simplecnn_128\"),\n",
+    "    RunSpec(\"p1_resnet18_baseline\", \"ResNet18 128 (P1 fallback)\", \"2B\", \"ResNet18\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_resnet18_128\"),\n",
+    "    RunSpec(\"p2b_simplecnn_224\", \"SimpleCNN 224\", \"2B\", \"SimpleCNN\", \"224_no_crop_no_aug\", \"expected\"),\n",
+    "    RunSpec(\"p2b_resnet18_224\", \"ResNet18 224\", \"2B\", \"ResNet18\", \"224_no_crop_no_aug\", \"expected\"),\n",
+    "\n",
+    "    # 2C: facecrop effect at 224, no augmentation.\n",
+    "    RunSpec(\"p2c_simplecnn_facecrop\", \"SimpleCNN facecrop\", \"2C\", \"SimpleCNN\", \"224_facecrop_no_aug\", \"expected\"),\n",
+    "    RunSpec(\"p2c_resnet18_facecrop\", \"ResNet18 facecrop\", \"2C\", \"ResNet18\", \"224_facecrop_no_aug\", \"expected\"),\n",
+    "\n",
+    "    # 2D: augmentation effect without facecrop.\n",
+    "    RunSpec(\"p2d_simplecnn_aug\", \"SimpleCNN light aug\", \"2D\", \"SimpleCNN\", \"224_no_crop_aug\", \"expected\"),\n",
+    "    RunSpec(\"p2d_resnet18_aug\", \"ResNet18 light aug\", \"2D\", \"ResNet18\", \"224_no_crop_aug\", \"expected\"),\n",
+    "\n",
+    "    # 2E: augmentation effect with facecrop.\n",
+    "    RunSpec(\"p2e_simplecnn_facecrop_aug\", \"SimpleCNN facecrop + aug\", \"2E\", \"SimpleCNN\", \"224_facecrop_aug\", \"expected\"),\n",
+    "    RunSpec(\"p2e_resnet18_facecrop_aug\", \"ResNet18 facecrop + aug\", \"2E\", \"ResNet18\", \"224_facecrop_aug\", \"expected\"),\n",
+    "]\n",
+    "\n",
+    "# Use these aliases when synthetic 128 run IDs are requested for 2B.\n",
+    "RUN_ALIASES = {\n",
+    "    \"p2b_simplecnn_128\": \"p1_simplecnn_baseline\",\n",
+    "    \"p2b_resnet18_128\": \"p1_resnet18_baseline\",\n",
+    "}\n",
+    "\n",
+    "PLANNED_COMPARISONS = [\n",
+    "    (\"2A\", \"ResNet18\", \"normalization\", \"p2a_t1_original\", \"p2a_t2_real_norm\", \"real_norm - imagenet_norm\"),\n",
+    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"holdout text2img - all-source\"),\n",
+    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_inpainting\", \"holdout inpainting - all-source\"),\n",
+    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_insight\", \"holdout insight - all-source\"),\n",
+    "\n",
+    "    (\"2B\", \"SimpleCNN\", \"resolution\", \"p2b_simplecnn_128\", \"p2b_simplecnn_224\", \"224 - 128\"),\n",
+    "    (\"2B\", \"ResNet18\", \"resolution\", \"p2b_resnet18_128\", \"p2b_resnet18_224\", \"224 - 128\"),\n",
+    "\n",
+    "    (\"2C\", \"SimpleCNN\", \"facecrop\", \"p2b_simplecnn_224\", \"p2c_simplecnn_facecrop\", \"facecrop - no facecrop\"),\n",
+    "    (\"2C\", \"ResNet18\", \"facecrop\", \"p2b_resnet18_224\", \"p2c_resnet18_facecrop\", \"facecrop - no facecrop\"),\n",
+    "\n",
+    "    (\"2D\", \"SimpleCNN\", \"augmentation\", \"p2b_simplecnn_224\", \"p2d_simplecnn_aug\", \"light aug - no aug\"),\n",
+    "    (\"2D\", \"ResNet18\", \"augmentation\", \"p2b_resnet18_224\", \"p2d_resnet18_aug\", \"light aug - no aug\"),\n",
+    "\n",
+    "    (\"2E\", \"SimpleCNN\", \"facecrop + augmentation\", \"p2c_simplecnn_facecrop\", \"p2e_simplecnn_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
+    "    (\"2E\", \"ResNet18\", \"facecrop + augmentation\", \"p2c_resnet18_facecrop\", \"p2e_resnet18_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
+    "]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e2ccd27",
+   "metadata": {},
+   "source": [
+    "## Evidence audit\n",
+    "\n",
+    "Before comparing numbers, check whether the planned artifacts exist. Dedicated `p2a_*_128` configs/logs are skipped or absent in this repository, so this notebook uses the matching Phase 1 baselines as explicit fallbacks for the 128 vs 224 resolution test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53356e8b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_json(path: Path) -> dict[str, Any] | None:\n",
+    "    if not path.exists():\n",
+    "        return None\n",
+    "    with path.open() as f:\n",
+    "        return json.load(f)\n",
+    "\n",
+    "\n",
+    "def config_path_for(run: str) -> Path | None:\n",
+    "    candidates = [\n",
+    "        CONFIG_DIR / \"phase2\" / f\"{run}.json\",\n",
+    "        CONFIG_DIR / \"phase2\" / f\"{run}.json.skip\",\n",
+    "        CONFIG_DIR / \"phase1\" / f\"{run}.json\",\n",
+    "        CONFIG_DIR / \"phase1\" / f\"{run}.json.skip\",\n",
+    "    ]\n",
+    "    return next((p for p in candidates if p.exists()), None)\n",
+    "\n",
+    "\n",
+    "def log_path_for(run: str) -> Path:\n",
+    "    return LOGS_DIR / f\"{run}.json\"\n",
+    "\n",
+    "\n",
+    "def resolve_run(run: str) -> str:\n",
+    "    return run if log_path_for(run).exists() else RUN_ALIASES.get(run, run)\n",
+    "\n",
+    "\n",
+    "def load_results(run: str) -> dict[str, Any] | None:\n",
+    "    resolved = resolve_run(run)\n",
+    "    return load_json(log_path_for(resolved))\n",
+    "\n",
+    "\n",
+    "def metric_values(results: dict[str, Any], metric: str = \"auc_roc\") -> np.ndarray:\n",
+    "    vals = []\n",
+    "    for fold in results.get(\"fold_results\", []):\n",
+    "        value = fold.get(\"test_metrics\", {}).get(metric)\n",
+    "        if value is not None:\n",
+    "            vals.append(float(value))\n",
+    "    return np.asarray(vals, dtype=float)\n",
+    "\n",
+    "\n",
+    "def best_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
+    "    hist = fold.get(\"history\", {})\n",
+    "    train_key = f\"train_{metric}\"\n",
+    "    val_key = f\"val_{metric}\"\n",
+    "    train = hist.get(train_key, [])\n",
+    "    val = hist.get(val_key, [])\n",
+    "    if not train or not val:\n",
+    "        return None\n",
+    "    idx = int(np.nanargmax(np.asarray(val, dtype=float)))\n",
+    "    return float(train[idx] - val[idx])\n",
+    "\n",
+    "\n",
+    "def final_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
+    "    hist = fold.get(\"history\", {})\n",
+    "    train = hist.get(f\"train_{metric}\", [])\n",
+    "    val = hist.get(f\"val_{metric}\", [])\n",
+    "    if not train or not val:\n",
+    "        return None\n",
+    "    return float(train[-1] - val[-1])\n",
+    "\n",
+    "\n",
+    "def summarize_run(spec: RunSpec) -> dict[str, Any]:\n",
+    "    resolved = resolve_run(spec.run)\n",
+    "    results = load_results(spec.run)\n",
+    "    config_path = config_path_for(spec.run) or config_path_for(resolved)\n",
+    "    cfg = load_json(config_path) if config_path else None\n",
+    "\n",
+    "    row = {\n",
+    "        \"section\": spec.section,\n",
+    "        \"run\": spec.run,\n",
+    "        \"resolved_run\": resolved,\n",
+    "        \"label\": spec.label,\n",
+    "        \"model\": spec.model,\n",
+    "        \"condition\": spec.condition,\n",
+    "        \"role\": spec.intended_role,\n",
+    "        \"fallback_for\": spec.fallback_for,\n",
+    "        \"config_path\": str(config_path.relative_to(PROJECT_ROOT)) if config_path else None,\n",
+    "        \"config_status\": \"present\" if config_path and config_path.suffix == \".json\" else (\"skipped\" if config_path else \"missing\"),\n",
+    "        \"log_status\": \"present\" if log_path_for(spec.run).exists() else (\"fallback\" if resolved != spec.run and log_path_for(resolved).exists() else \"missing\"),\n",
+    "        \"n_folds\": None,\n",
+    "        \"auc_mean\": np.nan,\n",
+    "        \"auc_std\": np.nan,\n",
+    "        \"acc_mean\": np.nan,\n",
+    "        \"f1_mean\": np.nan,\n",
+    "        \"gap_best_mean\": np.nan,\n",
+    "        \"gap_final_mean\": np.nan,\n",
+    "        \"image_size\": None,\n",
+    "        \"face_crop\": None,\n",
+    "        \"augment\": None,\n",
+    "        \"normalization\": None,\n",
+    "        \"train_sources\": None,\n",
+    "        \"eval_sources\": None,\n",
+    "    }\n",
+    "\n",
+    "    if cfg:\n",
+    "        row.update({\n",
+    "            \"image_size\": cfg.get(\"image_size\"),\n",
+    "            \"face_crop\": cfg.get(\"face_crop\"),\n",
+    "            \"augment\": \"light\" if isinstance(cfg.get(\"augment\"), dict) else cfg.get(\"augment\"),\n",
+    "            \"normalization\": cfg.get(\"normalization\"),\n",
+    "            \"train_sources\": tuple(cfg.get(\"train_sources\", [])) or None,\n",
+    "            \"eval_sources\": tuple(cfg.get(\"eval_sources\", [])) or None,\n",
+    "        })\n",
+    "\n",
+    "    if results:\n",
+    "        agg = results.get(\"aggregated_metrics\", {})\n",
+    "        row.update({\n",
+    "            \"n_folds\": results.get(\"n_folds\"),\n",
+    "            \"auc_mean\": agg.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
+    "            \"auc_std\": agg.get(\"auc_roc\", {}).get(\"std\", np.nan),\n",
+    "            \"acc_mean\": agg.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
+    "            \"f1_mean\": agg.get(\"f1\", {}).get(\"mean\", np.nan),\n",
+    "        })\n",
+    "        best_gaps = [best_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
+    "        final_gaps = [final_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
+    "        best_gaps = [x for x in best_gaps if x is not None]\n",
+    "        final_gaps = [x for x in final_gaps if x is not None]\n",
+    "        row[\"gap_best_mean\"] = float(np.mean(best_gaps)) if best_gaps else np.nan\n",
+    "        row[\"gap_final_mean\"] = float(np.mean(final_gaps)) if final_gaps else np.nan\n",
+    "\n",
+    "    return row\n",
+    "\n",
+    "runs_df = pd.DataFrame([summarize_run(spec) for spec in RUN_SPECS])\n",
+    "\n",
+    "# Prefer canonical rows for analysis: keep fallbacks only where expected rows are missing.\n",
+    "canonical_runs_df = runs_df[runs_df[\"role\"] == \"expected\"].copy()\n",
+    "for missing_run, fallback_run in RUN_ALIASES.items():\n",
+    "    mask = canonical_runs_df[\"run\"].eq(missing_run) & canonical_runs_df[\"log_status\"].eq(\"missing\")\n",
+    "    if mask.any():\n",
+    "        fallback = runs_df[runs_df[\"run\"].eq(fallback_run)].copy()\n",
+    "        if not fallback.empty:\n",
+    "            fallback.loc[:, \"run\"] = missing_run\n",
+    "            fallback.loc[:, \"label\"] = fallback.iloc[0][\"label\"].replace(\" (P1 fallback)\", \"\") + \" [P1 fallback]\"\n",
+    "            fallback.loc[:, \"role\"] = \"expected_via_fallback\"\n",
+    "            canonical_runs_df = pd.concat([canonical_runs_df[~mask], fallback], ignore_index=True)\n",
+    "\n",
+    "print(\"Artifact audit:\")\n",
+    "display(runs_df[[\"section\", \"run\", \"resolved_run\", \"role\", \"config_status\", \"log_status\", \"n_folds\"]].sort_values([\"section\", \"run\"]))\n",
+    "\n",
+    "missing_expected = runs_df[(runs_df[\"role\"] == \"expected\") & (runs_df[\"log_status\"] == \"missing\")][\"run\"].tolist()\n",
+    "print(f\"\\nExpected runs with no direct log: {missing_expected or 'none'}\")\n",
+    "print(\"Fallbacks used:\", {k: v for k, v in RUN_ALIASES.items() if k in missing_expected})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b21a9faf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Protocol consistency audit from loaded logs/configs.\n",
+    "protocol_fields = [\n",
+    "    \"cv_folds\", \"batch_size\", \"early_stopping_patience\", \"seed\", \"subsample\",\n",
+    "    \"lr\", \"weight_decay\", \"T_max\", \"epochs\",\n",
+    "]\n",
+    "\n",
+    "protocol_rows = []\n",
+    "for _, row in canonical_runs_df.iterrows():\n",
+    "    results = load_results(row[\"run\"])\n",
+    "    cfg = (results or {}).get(\"config\", {})\n",
+    "    protocol_rows.append({\"run\": row[\"run\"], **{k: cfg.get(k) for k in protocol_fields}})\n",
+    "\n",
+    "protocol_df = pd.DataFrame(protocol_rows)\n",
+    "display(protocol_df)\n",
+    "\n",
+    "print(\"Field variability across loaded canonical runs:\")\n",
+    "for field in protocol_fields:\n",
+    "    vals = sorted({str(v) for v in protocol_df[field].dropna().unique()})\n",
+    "    print(f\"  {field:28s}: {vals}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6802bcd9",
+   "metadata": {},
+   "source": [
+    "## Results table\n",
+    "\n",
+    "The table below is ranked by AUC and includes two gap estimates:\n",
+    "\n",
+    "- `gap_best_mean`: train AUC minus validation AUC at each fold's best validation epoch. This is closest to the saved best checkpoint.\n",
+    "- `gap_final_mean`: train AUC minus validation AUC at the final epoch. This is useful for diagnosing late overfit but is less aligned with test evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "be1ec0ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "analysis_df = canonical_runs_df[canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"])].copy()\n",
+    "analysis_df = analysis_df.sort_values(\"auc_mean\", ascending=False)\n",
+    "\n",
+    "cols = [\n",
+    "    \"section\", \"label\", \"run\", \"resolved_run\", \"model\", \"condition\", \"log_status\",\n",
+    "    \"auc_mean\", \"auc_std\", \"acc_mean\", \"f1_mean\", \"gap_best_mean\", \"gap_final_mean\",\n",
+    "]\n",
+    "\n",
+    "display(\n",
+    "    analysis_df[cols]\n",
+    "    .style.format({\n",
+    "        \"auc_mean\": \"{:.4f}\",\n",
+    "        \"auc_std\": \"{:.4f}\",\n",
+    "        \"acc_mean\": \"{:.4f}\",\n",
+    "        \"f1_mean\": \"{:.4f}\",\n",
+    "        \"gap_best_mean\": \"{:+.4f}\",\n",
+    "        \"gap_final_mean\": \"{:+.4f}\",\n",
+    "    })\n",
+    "    .background_gradient(subset=[\"auc_mean\"], cmap=\"Greens\")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e0d21c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def paired_comparison(section: str, model: str, question: str, before: str, after: str, contrast: str) -> dict[str, Any]:\n",
+    "    r0 = load_results(before)\n",
+    "    r1 = load_results(after)\n",
+    "    resolved_before = resolve_run(before)\n",
+    "    resolved_after = resolve_run(after)\n",
+    "    out = {\n",
+    "        \"section\": section,\n",
+    "        \"model\": model,\n",
+    "        \"question\": question,\n",
+    "        \"before\": before,\n",
+    "        \"after\": after,\n",
+    "        \"resolved_before\": resolved_before,\n",
+    "        \"resolved_after\": resolved_after,\n",
+    "        \"contrast\": contrast,\n",
+    "        \"status\": \"ok\" if r0 and r1 else \"missing\",\n",
+    "        \"n\": 0,\n",
+    "        \"before_auc\": np.nan,\n",
+    "        \"after_auc\": np.nan,\n",
+    "        \"delta_auc\": np.nan,\n",
+    "        \"delta_ci95\": np.nan,\n",
+    "        \"ttest_p\": np.nan,\n",
+    "        \"wilcoxon_p\": np.nan,\n",
+    "        \"cohen_dz\": np.nan,\n",
+    "        \"before_gap\": np.nan,\n",
+    "        \"after_gap\": np.nan,\n",
+    "        \"delta_gap\": np.nan,\n",
+    "        \"interpretation\": \"insufficient data\",\n",
+    "        \"caveat\": \"\",\n",
+    "    }\n",
+    "    if not (r0 and r1):\n",
+    "        return out\n",
+    "\n",
+    "    v0 = metric_values(r0, \"auc_roc\")\n",
+    "    v1 = metric_values(r1, \"auc_roc\")\n",
+    "    n = min(len(v0), len(v1))\n",
+    "    v0, v1 = v0[:n], v1[:n]\n",
+    "    diff = v1 - v0\n",
+    "\n",
+    "    out.update({\n",
+    "        \"n\": n,\n",
+    "        \"before_auc\": float(np.mean(v0)),\n",
+    "        \"after_auc\": float(np.mean(v1)),\n",
+    "        \"delta_auc\": float(np.mean(diff)),\n",
+    "    })\n",
+    "\n",
+    "    if n >= 2:\n",
+    "        sd = float(np.std(diff, ddof=1))\n",
+    "        se = sd / math.sqrt(n) if sd > 0 else 0.0\n",
+    "        out[\"delta_ci95\"] = float(stats.t.ppf(0.975, df=n - 1) * se) if n > 1 else np.nan\n",
+    "        if sd > 0:\n",
+    "            out[\"cohen_dz\"] = float(np.mean(diff) / sd)\n",
+    "            out[\"ttest_p\"] = float(stats.ttest_rel(v1, v0).pvalue)\n",
+    "        if n >= 3 and not np.allclose(diff, 0):\n",
+    "            try:\n",
+    "                out[\"wilcoxon_p\"] = float(stats.wilcoxon(diff).pvalue)\n",
+    "            except ValueError:\n",
+    "                pass\n",
+    "\n",
+    "    gaps0 = [best_epoch_gap(f) for f in r0.get(\"fold_results\", [])]\n",
+    "    gaps1 = [best_epoch_gap(f) for f in r1.get(\"fold_results\", [])]\n",
+    "    gaps0 = np.asarray([x for x in gaps0 if x is not None], dtype=float)\n",
+    "    gaps1 = np.asarray([x for x in gaps1 if x is not None], dtype=float)\n",
+    "    if len(gaps0) and len(gaps1):\n",
+    "        m = min(len(gaps0), len(gaps1))\n",
+    "        out[\"before_gap\"] = float(np.mean(gaps0[:m]))\n",
+    "        out[\"after_gap\"] = float(np.mean(gaps1[:m]))\n",
+    "        out[\"delta_gap\"] = float(np.mean(gaps1[:m] - gaps0[:m]))\n",
+    "\n",
+    "    if question == \"source_holdout\":\n",
+    "        out[\"caveat\"] = \"Aggregate holdout-run AUC only; not held-out-source vs in-source AUC.\"\n",
+    "    if before != resolved_before or after != resolved_after:\n",
+    "        out[\"caveat\"] = (out[\"caveat\"] + \" \" if out[\"caveat\"] else \"\") + \"Uses Phase 1 fallback for missing p2a 128 log.\"\n",
+    "\n",
+    "    if out[\"delta_auc\"] >= 0.01:\n",
+    "        out[\"interpretation\"] = \"meaningful improvement\"\n",
+    "    elif out[\"delta_auc\"] > 0.002:\n",
+    "        out[\"interpretation\"] = \"small improvement\"\n",
+    "    elif out[\"delta_auc\"] >= -0.002:\n",
+    "        out[\"interpretation\"] = \"negligible change\"\n",
+    "    elif out[\"delta_auc\"] > -0.01:\n",
+    "        out[\"interpretation\"] = \"small drop\"\n",
+    "    else:\n",
+    "        out[\"interpretation\"] = \"meaningful drop\"\n",
+    "    return out\n",
+    "\n",
+    "comparisons_df = pd.DataFrame([paired_comparison(*args) for args in PLANNED_COMPARISONS])\n",
+    "\n",
+    "# Benjamini-Hochberg correction across planned paired t-tests where available.\n",
+    "valid_p = comparisons_df[\"ttest_p\"].notna()\n",
+    "pvals = comparisons_df.loc[valid_p, \"ttest_p\"].to_numpy()\n",
+    "qvals = np.full(len(comparisons_df), np.nan)\n",
+    "if len(pvals):\n",
+    "    order = np.argsort(pvals)\n",
+    "    ranked = pvals[order]\n",
+    "    adjusted = np.empty_like(ranked)\n",
+    "    m = len(ranked)\n",
+    "    running = 1.0\n",
+    "    for i in range(m - 1, -1, -1):\n",
+    "        running = min(running, ranked[i] * m / (i + 1))\n",
+    "        adjusted[i] = running\n",
+    "    qvals[np.where(valid_p)[0][order]] = adjusted\n",
+    "comparisons_df[\"bh_q\"] = qvals\n",
+    "\n",
+    "display(\n",
+    "    comparisons_df[[\n",
+    "        \"section\", \"model\", \"question\", \"contrast\", \"before_auc\", \"after_auc\", \"delta_auc\",\n",
+    "        \"delta_ci95\", \"ttest_p\", \"bh_q\", \"wilcoxon_p\", \"cohen_dz\", \"delta_gap\", \"interpretation\", \"caveat\",\n",
+    "    ]].style.format({\n",
+    "        \"before_auc\": \"{:.4f}\",\n",
+    "        \"after_auc\": \"{:.4f}\",\n",
+    "        \"delta_auc\": \"{:+.4f}\",\n",
+    "        \"delta_ci95\": \"\u00b1{:.4f}\",\n",
+    "        \"ttest_p\": \"{:.4f}\",\n",
+    "        \"bh_q\": \"{:.4f}\",\n",
+    "        \"wilcoxon_p\": \"{:.4f}\",\n",
+    "        \"cohen_dz\": \"{:+.2f}\",\n",
+    "        \"delta_gap\": \"{:+.4f}\",\n",
+    "    }).background_gradient(subset=[\"delta_auc\"], cmap=\"RdYlGn\", vmin=-0.06, vmax=0.06)\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f20e5262",
+   "metadata": {},
+   "source": [
+    "## Visual summary\n",
+    "\n",
+    "Two plots are most useful for decision-making:\n",
+    "\n",
+    "- Ranking all conditions by AUC shows the best observed configurations but can overstate duplicated/near-identical runs.\n",
+    "- Paired delta plot shows the controlled effect of each preprocessing change and exposes uncertainty."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "42882c6a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot_df = analysis_df.copy()\n",
+    "plot_df[\"display_label\"] = plot_df[\"section\"] + \" | \" + plot_df[\"label\"]\n",
+    "plot_df = plot_df.sort_values(\"auc_mean\", ascending=True)\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(11, max(7, 0.35 * len(plot_df))))\n",
+    "colors = {\"2A\": \"#4C78A8\", \"2B\": \"#F58518\", \"2C\": \"#54A24B\", \"2D\": \"#E45756\", \"2E\": \"#B279A2\"}\n",
+    "ax.barh(\n",
+    "    plot_df[\"display_label\"],\n",
+    "    plot_df[\"auc_mean\"],\n",
+    "    xerr=plot_df[\"auc_std\"],\n",
+    "    color=[colors.get(s, \"#999999\") for s in plot_df[\"section\"]],\n",
+    "    alpha=0.85,\n",
+    ")\n",
+    "ax.set_xlim(0.65, 1.0)\n",
+    "ax.set_xlabel(\"Mean AUC across CV folds\")\n",
+    "ax.set_title(\"Phase 2 Conditions Ranked by AUC\")\n",
+    "ax.axvline(0.95, color=\"black\", linewidth=1, linestyle=\"--\", alpha=0.4)\n",
+    "for y, (_, row) in enumerate(plot_df.iterrows()):\n",
+    "    ax.text(row[\"auc_mean\"] + 0.004, y, f\"{row['auc_mean']:.4f}\", va=\"center\", fontsize=9)\n",
+    "fig.tight_layout()\n",
+    "fig.savefig(FIGURES_DIR / \"ranked_auc.png\", dpi=200, bbox_inches=\"tight\")\n",
+    "plt.show()\n",
+    "\n",
+    "forest = comparisons_df.copy()\n",
+    "forest[\"display\"] = forest[\"section\"] + \" \" + forest[\"model\"] + \" - \" + forest[\"contrast\"]\n",
+    "forest = forest.iloc[::-1]\n",
+    "fig, ax = plt.subplots(figsize=(11, max(6, 0.45 * len(forest))))\n",
+    "y = np.arange(len(forest))\n",
+    "ax.errorbar(\n",
+    "    forest[\"delta_auc\"], y,\n",
+    "    xerr=forest[\"delta_ci95\"],\n",
+    "    fmt=\"o\", color=\"#1F2937\", ecolor=\"#6B7280\", capsize=4,\n",
+    ")\n",
+    "ax.axvline(0, color=\"black\", linewidth=1)\n",
+    "ax.axvspan(-0.002, 0.002, color=\"#9CA3AF\", alpha=0.18, label=\"negligible band\")\n",
+    "ax.set_yticks(y)\n",
+    "ax.set_yticklabels(forest[\"display\"])\n",
+    "ax.set_xlabel(\"Delta AUC (after - before), paired by fold\")\n",
+    "ax.set_title(\"Planned Phase 2 Effect Estimates\")\n",
+    "ax.legend(loc=\"lower right\")\n",
+    "fig.tight_layout()\n",
+    "fig.savefig(FIGURES_DIR / \"planned_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e063cfc0",
+   "metadata": {},
+   "source": [
+    "## 2A - Shortcut analysis\n",
+    "\n",
+    "Shortcut checks map to `p2a_*` configs:\n",
+    "- `p2a_t1_original` vs `p2a_t2_real_norm` (normalization)\n",
+    "- `p2a_t1_original` vs `p2a_t3_holdout_*` (source_holdout)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "910bd5bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def comparison_subset(section: str, question: str | None = None) -> pd.DataFrame:\n",
+    "    df = comparisons_df[comparisons_df[\"section\"].eq(section)].copy()\n",
+    "    if question:\n",
+    "        df = df[df[\"question\"].eq(question)]\n",
+    "    return df\n",
+    "\n",
+    "\n",
+    "def print_comparison_readout(df: pd.DataFrame) -> None:\n",
+    "    for _, row in df.iterrows():\n",
+    "        print(f\"{row['section']} {row['model']} - {row['contrast']}\")\n",
+    "        print(f\"  AUC: {row['before_auc']:.4f} -> {row['after_auc']:.4f} ({row['delta_auc']:+.4f})\")\n",
+    "        print(f\"  paired t p={row['ttest_p']:.4f}, BH q={row['bh_q']:.4f}, CI95 delta=\u00b1{row['delta_ci95']:.4f}\")\n",
+    "        print(f\"  gap delta: {row['delta_gap']:+.4f}; interpretation: {row['interpretation']}\")\n",
+    "        if row['caveat']:\n",
+    "            print(f\"  caveat: {row['caveat']}\")\n",
+    "        print()\n",
+    "\n",
+    "print_comparison_readout(comparison_subset(\"2B\", \"resolution\"))\n",
+    "\n",
+    "res_plot = comparison_subset(\"2B\", \"resolution\")\n",
+    "fig, ax = plt.subplots(figsize=(8, 5))\n",
+    "for _, row in res_plot.iterrows():\n",
+    "    r0, r1 = load_results(row[\"before\"]), load_results(row[\"after\"])\n",
+    "    v0, v1 = metric_values(r0), metric_values(r1)\n",
+    "    x = [0, 1]\n",
+    "    for a, b in zip(v0, v1):\n",
+    "        ax.plot(x, [a, b], color=\"#9CA3AF\", alpha=0.7)\n",
+    "    ax.plot(x, [v0.mean(), v1.mean()], marker=\"o\", linewidth=3, label=row[\"model\"])\n",
+    "ax.set_xticks([0, 1])\n",
+    "ax.set_xticklabels([\"128\", \"224\"])\n",
+    "ax.set_ylabel(\"AUC\")\n",
+    "ax.set_title(\"2B Resolution: Fold-Paired AUC\")\n",
+    "ax.legend()\n",
+    "fig.tight_layout()\n",
+    "fig.savefig(FIGURES_DIR / \"2b_resolution_paired.png\", dpi=200, bbox_inches=\"tight\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "530e8675",
+   "metadata": {},
+   "source": [
+    "## 2B - Resolution impact\n",
+    "\n",
+    "This section compares 128 vs 224 using `p2b_*_224` and Phase 1 baselines as explicit 128 fallbacks.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "13304d38",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print_comparison_readout(comparison_subset(\"2C\", \"facecrop\"))\n",
+    "\n",
+    "face_df = canonical_runs_df[canonical_runs_df[\"section\"].eq(\"2C\")].copy()\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=False)\n",
+    "for ax, model in zip(axes, [\"SimpleCNN\", \"ResNet18\"]):\n",
+    "    sub = face_df[face_df[\"model\"].eq(model)].sort_values(\"face_crop\")\n",
+    "    ax.bar(sub[\"condition\"], sub[\"auc_mean\"], yerr=sub[\"auc_std\"], color=[\"#D97706\", \"#059669\"], alpha=0.85, capsize=5)\n",
+    "    ax.set_title(model)\n",
+    "    ax.set_ylim(0.70 if model == \"SimpleCNN\" else 0.94, 0.99)\n",
+    "    ax.set_ylabel(\"AUC\")\n",
+    "    ax.tick_params(axis=\"x\", rotation=20)\n",
+    "fig.suptitle(\"2C Facecrop Impact\")\n",
+    "fig.tight_layout()\n",
+    "fig.savefig(FIGURES_DIR / \"2c_facecrop.png\", dpi=200, bbox_inches=\"tight\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8702d10d",
+   "metadata": {},
+   "source": [
+    "## 2C - Facecrop impact\n",
+    "\n",
+    "This section compares `p2c_*_facecrop` against the matching `p2b_*_224` no-facecrop baselines.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec5e03ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print_comparison_readout(comparison_subset(\"2A\"))\n\n# Inspect whether logs contain the per-source data needed by v2.md.\nsource_audit = []\nfor run in [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]:\n    results = load_results(run)\n    has_per_source = False\n    has_records = False\n    example_keys = []\n    if results:\n        for fold in results.get(\"fold_results\", []):\n            tm = fold.get(\"test_metrics\", {})\n            example_keys = sorted(tm.keys())\n            has_per_source = has_per_source or any(k in tm for k in [\"per_source\", \"per_source_metrics\", \"pairwise_source_metrics\", \"source_metrics\", \"pair_metrics\"])\n            has_records = has_records or any(k in fold for k in [\"records\", \"predictions\", \"test_records\"])\n    source_audit.append({\n        \"run\": run,\n        \"has_per_source_metrics\": has_per_source,\n        \"has_prediction_records\": has_records,\n        \"test_metric_keys\": example_keys,\n    })\nsource_audit_df = pd.DataFrame(source_audit)\ndisplay(source_audit_df)\n\nholdout_runs = [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]\nholdout_df = canonical_runs_df[canonical_runs_df[\"run\"].isin(holdout_runs)].copy()\nholdout_df[\"delta_vs_all_source\"] = holdout_df[\"auc_mean\"] - float(holdout_df.loc[holdout_df[\"run\"].eq(\"p2a_t1_original\"), \"auc_mean\"].iloc[0])\n\nfig, ax = plt.subplots(figsize=(9, 5))\nax.bar(holdout_df[\"label\"], holdout_df[\"auc_mean\"], yerr=holdout_df[\"auc_std\"], color=\"#54A24B\", alpha=0.85, capsize=5)\nax.set_ylim(0.88, 0.99)\nax.set_ylabel(\"Aggregate AUC\")\nax.set_title(\"2C Source Holdout Proxy: Aggregate Test AUC\")\nax.tick_params(axis=\"x\", rotation=20)\nfor i, (_, row) in enumerate(holdout_df.iterrows()):\n    ax.text(i, row[\"auc_mean\"] + 0.004, f\"{row['delta_vs_all_source']:+.3f}\", ha=\"center\", fontsize=9)\nfig.tight_layout()\nfig.savefig(FIGURES_DIR / \"2c_holdout_proxy.png\", dpi=200, bbox_inches=\"tight\")\nplt.show()\n\nprint(\"Geometry diagnostic evidence:\")\ngeometry_keys = []\nfor run in [\"p2a_t1_original\", \"p2a_t2_real_norm\"]:\n    results = load_results(run)\n    cfg = (results or {}).get(\"config\", {})\n    geometry_keys.append({\n        \"run\": run,\n        \"config_geometry_condition\": cfg.get(\"geometry_condition\"),\n        \"has_matched_geometry_metric\": any(\n            \"geometry\" in str(k).lower() or \"matched\" in str(k).lower()\n            for fold in (results or {}).get(\"fold_results\", [])\n            for k in fold.get(\"test_metrics\", {}).keys()\n        ),\n    })\ndisplay(pd.DataFrame(geometry_keys))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c3b8812",
+   "metadata": {},
+   "source": [
+    "## 2D / 2E - Augmentation impact and test-set integrity\n",
+    "\n",
+    "The augmentation question has two parts:\n",
+    "\n",
+    "- Does light augmentation help at 224 without facecrop?\n",
+    "- Does it help once facecrop is enabled?\n",
+    "\n",
+    "The implementation also needs to guarantee that validation/test evaluation is not stochastic. The preprocessing pipeline keeps stochastic operations behind `self.train`, so `train=False` disables them even if augmentation settings exist."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f11c3257",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"2D (p2d): augmentation without facecrop\")\n",
+    "print_comparison_readout(comparison_subset(\"2D\", \"augmentation\"))\n",
+    "print(\"2E (p2e): augmentation with facecrop\")\n",
+    "print_comparison_readout(comparison_subset(\"2E\", \"facecrop + augmentation\"))\n",
+    "\n",
+    "aug_sections = comparisons_df[comparisons_df[\"section\"].isin([\"2D\", \"2E\"])].copy()\n",
+    "fig, ax = plt.subplots(figsize=(9, 5))\n",
+    "labels = aug_sections[\"section\"] + \" \" + aug_sections[\"model\"]\n",
+    "ax.bar(labels, aug_sections[\"delta_auc\"], yerr=aug_sections[\"delta_ci95\"], color=[\"#E45756\" if d < 0 else \"#059669\" for d in aug_sections[\"delta_auc\"]], alpha=0.85, capsize=5)\n",
+    "ax.axhline(0, color=\"black\", linewidth=1)\n",
+    "ax.set_ylabel(\"Delta AUC from adding augmentation\")\n",
+    "ax.set_title(\"Augmentation Effects Across Facecrop Conditions\")\n",
+    "ax.tick_params(axis=\"x\", rotation=20)\n",
+    "fig.tight_layout()\n",
+    "fig.savefig(FIGURES_DIR / \"2d_2e_augmentation_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
+    "plt.show()\n",
+    "\n",
+    "# Static and behavioral audit of eval stochasticity.\n",
+    "try:\n",
+    "    import inspect\n",
+    "    from src.preprocessing.pipeline import DFFImagePipeline\n",
+    "    from src.evaluation import evaluate as evaluate_module\n",
+    "\n",
+    "    pipeline_src = inspect.getsource(DFFImagePipeline)\n",
+    "    build_transforms_src = inspect.getsource(evaluate_module.build_transforms)\n",
+    "    stochastic_guards = {\n",
+    "        \"flip_guarded_by_train\": \"if self.train and random.random() < self.hflip_p\" in pipeline_src,\n",
+    "        \"rotate_guarded_by_train\": \"if self.train and self.rotation_degrees > 0\" in pipeline_src,\n",
+    "        \"color_jitter_returns_when_not_train\": \"if not self.train:\" in pipeline_src,\n",
+    "        \"blur_guarded_by_train\": \"if self.train and random.random() < self.blur_p\" in pipeline_src,\n",
+    "        \"jpeg_guarded_by_train\": \"if self.train and random.random() < self.jpeg_p\" in pipeline_src,\n",
+    "        \"erase_guarded_by_train\": \"if self.train and random.random() < self.erase_p\" in pipeline_src,\n",
+    "        \"noise_guarded_by_train\": \"if self.train and random.random() < self.noise_p\" in pipeline_src,\n",
+    "        \"cv_transform_uses_train_flag\": \"get_transforms(train=train\" in build_transforms_src,\n",
+    "    }\n",
+    "    display(pd.DataFrame([stochastic_guards]).T.rename(columns={0: \"passes\"}))\n",
+    "except Exception as exc:\n",
+    "    print(f\"Could not run transform audit: {exc}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02e47658",
+   "metadata": {},
+   "source": [
+    "## Decision synthesis\n",
+    "\n",
+    "This section converts the evidence into Phase 3 settings. It intentionally distinguishes a recommendation from a claim:\n",
+    "\n",
+    "- Recommendation: choose the setting that is best supported for the next experiment.\n",
+    "- Claim: what the current evidence proves. Some Phase 2C claims remain incomplete without per-source or matched-geometry outputs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7034443c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_delta(question: str, model: str | None = None, section: str | None = None) -> pd.DataFrame:\n",
+    "    df = comparisons_df[comparisons_df[\"question\"].eq(question)].copy()\n",
+    "    if model:\n",
+    "        df = df[df[\"model\"].eq(model)]\n",
+    "    if section:\n",
+    "        df = df[df[\"section\"].eq(section)]\n",
+    "    return df\n",
+    "\n",
+    "resolution_resnet = get_delta(\"resolution\", \"ResNet18\").iloc[0]\n",
+    "facecrop_resnet = get_delta(\"facecrop\", \"ResNet18\").iloc[0]\n",
+    "facecrop_simple = get_delta(\"facecrop\", \"SimpleCNN\").iloc[0]\n",
+    "aug_no_crop_resnet = get_delta(\"augmentation\", \"ResNet18\").iloc[0]\n",
+    "aug_no_crop_simple = get_delta(\"augmentation\", \"SimpleCNN\").iloc[0]\n",
+    "aug_crop_resnet = get_delta(\"facecrop + augmentation\", \"ResNet18\").iloc[0]\n",
+    "aug_crop_simple = get_delta(\"facecrop + augmentation\", \"SimpleCNN\").iloc[0]\n",
+    "norm = get_delta(\"normalization\", \"ResNet18\").iloc[0]\n",
+    "\n",
+    "recommendations = [\n",
+    "    {\n",
+    "        \"choice\": \"resolution\",\n",
+    "        \"recommendation\": \"224x224\",\n",
+    "        \"evidence\": f\"ResNet18 delta AUC {resolution_resnet.delta_auc:+.4f}; SimpleCNN does not determine Phase 3 capacity.\",\n",
+    "        \"confidence\": \"high\" if resolution_resnet.delta_auc > 0.02 else \"medium\",\n",
+    "    },\n",
+    "    {\n",
+    "        \"choice\": \"facecrop\",\n",
+    "        \"recommendation\": \"use facecrop\",\n",
+    "        \"evidence\": f\"Small positive deltas for both models: SimpleCNN {facecrop_simple.delta_auc:+.4f}, ResNet18 {facecrop_resnet.delta_auc:+.4f}.\",\n",
+    "        \"confidence\": \"medium\",\n",
+    "    },\n",
+    "    {\n",
+    "        \"choice\": \"augmentation\",\n",
+    "        \"recommendation\": \"do not use light augmentation for Phase 3 at 20% data\",\n",
+    "        \"evidence\": f\"SimpleCNN drops {aug_no_crop_simple.delta_auc:+.4f} without facecrop and {aug_crop_simple.delta_auc:+.4f} with facecrop; ResNet18 is neutral/slightly mixed ({aug_no_crop_resnet.delta_auc:+.4f}, {aug_crop_resnet.delta_auc:+.4f}).\",\n",
+    "        \"confidence\": \"high for SimpleCNN, medium for ResNet18\",\n",
+    "    },\n",
+    "    {\n",
+    "        \"choice\": \"normalization\",\n",
+    "        \"recommendation\": \"ImageNet normalization\",\n",
+    "        \"evidence\": f\"Real-train-only normalization delta AUC {norm.delta_auc:+.4f}; no useful gain and less standard for pretrained ResNet.\",\n",
+    "        \"confidence\": \"medium\",\n",
+    "    },\n",
+    "    {\n",
+    "        \"choice\": \"shortcut/source claims\",\n",
+    "        \"recommendation\": \"do not overclaim; add per-source or prediction exports before final report\",\n",
+    "        \"evidence\": \"Current CV logs lack held-out-source vs in-source AUC and matched-geometry test metrics.\",\n",
+    "        \"confidence\": \"high\",\n",
+    "    },\n",
+    "]\n",
+    "\n",
+    "recommendations_df = pd.DataFrame(recommendations)\n",
+    "display(recommendations_df)\n",
+    "\n",
+    "summary = {\n",
+    "    \"phase\": \"phase2\",\n",
+    "    \"source_documents\": [\"classifier/v2.md\", \"classifier/impl.md\"],\n",
+    "    \"artifact_counts\": {\n",
+    "        \"canonical_runs\": int(len(canonical_runs_df)),\n",
+    "        \"loaded_canonical_runs\": int(canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"]).sum()),\n",
+    "        \"fallback_runs_used\": {k: v for k, v in RUN_ALIASES.items() if resolve_run(k) != k},\n",
+    "    },\n",
+    "    \"recommendations\": recommendations,\n",
+    "    \"planned_comparisons\": comparisons_df.replace({np.nan: None}).to_dict(orient=\"records\"),\n",
+    "    \"known_gaps\": [\n",
+    "        \"Dedicated p2a_*_128 logs are absent/skipped; Phase 1 baselines are used as fallbacks.\",\n",
+    "        \"Source holdout logs do not include prediction-level or per-source metrics, so held-out-source AUC vs in-source AUC cannot be computed.\",\n",
+    "        \"No matched-geometry evaluation metric is present in p2c logs, so geometry shortcut analysis is incomplete.\",\n",
+    "    ],\n",
+    "}\n",
+    "\n",
+    "summary_path = ANALYSIS_DIR / \"phase2_analysis_summary.json\"\n",
+    "with summary_path.open(\"w\") as f:\n",
+    "    json.dump(summary, f, indent=2)\n",
+    "\n",
+    "print(f\"Saved summary: {summary_path.relative_to(PROJECT_ROOT)}\")\n",
+    "print(f\"Saved figures: {FIGURES_DIR.relative_to(PROJECT_ROOT)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a337f73",
+   "metadata": {},
+   "source": [
+    "## Report-ready conclusion\n",
+    "\n",
+    "The strongest Phase 2 result is the resolution effect for ResNet18: moving to 224x224 substantially improves AUC under the controlled CV protocol. Face cropping gives a small positive effect and is reasonable to carry forward, especially because it aligns the model with face evidence rather than background context. Light augmentation is not supported at this 20% data setting: it strongly hurts SimpleCNN and provides no reliable gain for ResNet18, with or without face cropping. ImageNet normalization remains preferable because real-train-only normalization does not improve AUC and is less aligned with pretrained ResNet expectations.\n",
+    "\n",
+    "Recommended Phase 3 preprocessing: **224x224, facecrop enabled, no light augmentation, ImageNet normalization**.\n",
+    "\n",
+    "Limitations to fix before the final report: export prediction-level records or per-source pairwise metrics for source holdout, and add the matched-geometry evaluation required by the shortcut-analysis plan. Without those artifacts, Phase 2C can only support a limited shortcut analysis."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "drl",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/Show More
+++ b/Show More