Delete generator/notebooks/_build.py

Final final
Final polish
2026-05-15 00:25:44 +01:00 · 2026-05-14 23:19:26 +01:00 · 2026-05-14 21:16:03 +01:00 · 2026-05-14 20:20:46 +01:00 · 2026-05-14 16:25:30 +01:00 · 2026-05-14 16:20:33 +01:00
226 changed files with 25025 additions and 7827 deletions
@@ -67,3 +67,6 @@ generator/outputs/samples/*
 .venv/
 .ipynb_checkpoints/
 __pycache__/
 #Presentation
 presentation_inputs.zip
@@ -1,125 +1,264 @@
-# DRL_PROJ — DeepFake Detection
+# Deep learning face project
-Deep learning project for binary deepfake detection on the DeepFakeFace dataset.
+This repository contains a two-part deep learning project on the
 DeepFakeFace (DFF) dataset:
-## Project structure
+1. **Classifier:** detect whether a face image is real or fake.
 2. **Generator:** train generative models that produce new fake face images.
-```
+The project is written as an experimental report. The notebooks are the main
 deliverable: they show the pipeline, the intermediate failures, the ablations,
 the decisions, and the final models. Read them in order.
 ## Project story
 The work follows the same principle in both parts: start with a simple
 baseline, inspect what fails, change one important factor at a time, and keep
 the evidence tied to saved logs and saved artifacts.
 For the **classifier**, the story moves from dataset understanding to
 preprocessing, baseline models, controlled ablations, Grad-CAM inspection,
 stronger model families, and data scaling. The final practical classifier is a
 ResNet50-style pipeline using face crops, 224×224 inputs, ImageNet/default
 normalization, and no stochastic augmentation at validation/test time.
 For the **generator**, the story starts with raw baseline failures, then locks
 the data pipeline before comparing three parallel model-family branches:
 GAN, VAE, and DDPM. The final comparison keeps quality versus speed central:
 DDPM gives the best saved FID and visual quality, GAN is the best
 quality-speed compromise, and VAE is the fastest but smoothest option.
 ## How to read the project
 Start with the classifier notebooks, then read the generator notebooks. The
 generator has one linear setup stage followed by three parallel branches:
 GAN, VAE, and DDPM. Those branches are numbered in reading order, but they are
 conceptually parallel experiments after the pipeline is selected.
 ### Classifier notebooks
 Read these first:
 1. `classifier/notebooks/01_eda.ipynb`  
   Dataset composition, real/fake source mapping, image statistics, and
   shortcut risks.
 2. `classifier/notebooks/02_preprocessing.ipynb`  
   Deterministic preprocessing, train-only augmentation, face crops, and
   normalization.
 3. `classifier/notebooks/03_phase1_analysis.ipynb`  
   SimpleCNN and ResNet18 controlled baselines.
 4. `classifier/notebooks/04_phase2_analysis.ipynb`  
   Resolution, normalization, source holdouts, facecrop, and augmentation
   ablations.
 5. `classifier/notebooks/05_gradcam_analysis.ipynb`  
   Qualitative localization analysis across the classifier pipeline.
 6. `classifier/notebooks/06_phase3_model_family_analysis.ipynb`  
   Stronger pretrained model families and the ResNet50 practical choice.
 7. `classifier/notebooks/07_phase4_data_scaling_analysis.ipynb`  
   Data scaling for strong backbones and the final classifier decision.
 ### Generator notebooks
 Read these after the classifier:
 1. `generator/notebooks/01_baseline_sanity_check.ipynb`  
   Raw baseline failures and why the data pipeline must be fixed first.
 2. `generator/notebooks/02_pipeline_selection.ipynb`  
   Controlled pipeline ablations: resolution, alignment, augmentation, and
   raw/aligned mixing.
 3. `generator/notebooks/03_gan_stability_progression.ipynb`  
   GAN branch: DCGAN → WGAN-GP → spectral normalization + GroupNorm +
   self-attention → 128×128 check.
 4. `generator/notebooks/04_vae_loss_progression.ipynb`  
   VAE branch: MSE + KL → perceptual loss → PatchGAN adversarial loss.
 5. `generator/notebooks/05_ddpm_recipe_progression.ipynb`  
   DDPM branch: linear schedule → cosine schedule → v-prediction → wider
   backbone.
 6. `generator/notebooks/06_final_family_comparison.ipynb`  
   Final comparison of the selected GAN, VAE, and DDPM recipes under saved
   Phase 5 conditions.
 7. `generator/notebooks/07_final_sample_showcase.ipynb`  
   Curated final sample examples from saved outputs. This is qualitative
   showcase material, not a replacement for FID.
 ## What the notebooks do
 The notebooks are analysis/report chapters. They load existing configs, logs,
 figures, saved sample grids, checkpoints, and prediction summaries. They are
 not intended to launch new training runs.
 When a notebook shows a plot or image grid, the surrounding markdown explains:
 - what the artifact shows;
 - why it is needed;
 - how it supports the phase decision;
 - what limitation remains.
 This is important because the project is evaluated not only by final
 performance, but by the documented evolution of the solution.
 ## Repository layout
 ```text
 DRL_PROJ/
-  classifier/       ← discriminative model (real vs. fake classifier)
+  classifier/
-    src/            ← model definitions, training, evaluation, preprocessing
+    configs/       experiment configs by phase
-    configs/        ← experiment configs organised by phase
+    notebooks/     classifier report notebooks
-      phase1/       ← baseline models (SimpleCNN, ResNet18)
+    outputs/       saved logs, figures, Grad-CAM panels, checkpoints
-      phase2/       ← architecture sweep (ResNet variants, face-crop)
+    src/           classifier data, models, training, evaluation
-      phase3/       ← EfficientNet, ViT, frequency-aware training
+    tests/         unit and smoke tests
-      phase4/       ← ensemble strategies
+    tools/         facecrop, Grad-CAM, inference, reevaluation helpers
-    tools/          ← analyse.py, ensemble.py, inference.py, facecrop.py
+
-    notebooks/      ← EDA, preprocessing, evaluation, GradCAM
+  generator/
-    outputs/        ← models, logs, figures (gitignored except .pt/.json)
+    configs/       generator configs by phase/family
-    run.py          ← main training entry point
+    notebooks/     generator report notebooks and notebook builder
-  generator/        ← generative model (GAN / VAE / diffusion) — in progress
+    outputs/       saved logs, sample grids, final showcase artifacts
-  pipeline/         ← Vast.ai ephemeral GPU orchestration
+    src/           generator data, models, training, metrics
-  data/             ← dataset root (gitignored)
+    tests/         unit and smoke tests
-  cropped/          ← MTCNN pre-cropped faces (gitignored)
+    tools/         sampling and utility scripts
-    classifier/     ← bbox crops for the classifier
+
-    generator/      ← landmark-aligned crops for the generator
+  data/            original DFF dataset root, not committed
  cropped/         preprocessed face crops, not committed
  docs/            project statement and supporting documents
  pipeline/        optional remote/GPU orchestration helpers
 ```
 ## Rebuilding the generator notebooks
 The generator notebooks are generated from a single source file:
 ```bash
 cd generator/notebooks
 python _build.py
 ```
 That builder writes the numbered generator notebooks listed above. It uses
 existing saved logs and artifacts; it does not train models.
 ## Setup
-Create a local environment when you want to run the code directly on a machine you control:
+Create a conda environment and install the project requirements:
 ```bash
-python3 -m venv .venv
+conda create -n drl python=3.12
-source .venv/bin/activate
+conda activate drl
 python -m pip install --upgrade pip setuptools wheel
 python -m pip install -r requirements.txt
 ```
-## Local Training
+Use **Python 3.12**; some dependencies (for example `facenet-pytorch`) are
 unreliable on 3.13+.
 The raw dataset should be placed under `data/`. Preprocessed crops are stored
 under `cropped/`. These folders are intentionally not committed. To download
 and extract the dataset:
 ```bash
-python3 classifier/run.py classifier/configs/phase2/p2_resnet18_facecrop.json
+python classifier/tools/fetch_ds.py
-python3 classifier/run.py classifier/configs/phase3/p3_efficientnet_b0.json
+python classifier/tools/fetch_ds.py --data-dir /path/to/DFF
 ```
-## Ephemeral Vast.ai Pipeline
+Expected layout under the data root: `wiki/<identity>/*.jpg`,
 `inpainting/...`, `text2img/...`, `insight/...`.
-The deployment/orchestration path now lives under [`pipeline/`](/run/host/mnt/shared/UP/DRL/DRL_PROJ/pipeline/README.md).
+## Classifier — training
-One-time setup:
+From the repository root:
 ```bash
-cat > pipeline/.env <<'EOF'
+# CPU (slow but valid)
-VAST_API_KEY=<your-api-key>
+python classifier/run.py classifier/configs/phase4/p4_convnext_tiny_100pct.json
-VAST_SSH_PRIVATE_KEY=/home/your-user/.ssh/id_ed25519
+
-EOF
+# GPU when CUDA is available
 python classifier/run.py classifier/configs/phase4/p4_convnext_tiny_100pct.json --use-gpu
 ```
-End-to-end ephemeral run:
+Training uses 5-fold stratified group cross-validation. Per-fold checkpoints
 are saved as `classifier/outputs/models/{run_name}_fold{k}_best.pt` (and
 `_final.pt`). Override data or output locations with `--data-dir` and
 `--output-root`.
 **Primary delivery model** (best Phase 4 detector): config
 `classifier/configs/phase4/p4_convnext_tiny_100pct.json` with per-fold
 weights `classifier/outputs/models/p4_convnext_tiny_100pct_fold*_best.pt`.
 ## Classifier — inference
 Classify a single image as real or fake:
 ```bash
-python3 -m pipeline run classifier/configs/phase2/p2_resnet18_facecrop.json --upload-data
+python classifier/tools/inference.py image.jpg classifier/configs/phase4/p4_convnext_tiny_100pct.json
 ```
-Interactive offer selection:
+This loads the config and the matching checkpoint, runs the image through the
 model, and prints a result like:
 ```
 Image : image.jpg
 Model : p4_convnext_tiny_100pct (convnext_tiny)
 Device: cuda
 Result: FAKE  (confidence: 74.7%)
 P(fake): 0.7466   P(real): 0.2534
 ```
 If you omit `--checkpoint`, the tool automatically looks for a saved
 checkpoint under `classifier/outputs/models/` — first the single-run
 `{run_name}_best.pt`, then CV fold files `{run_name}_fold{k}_best.pt`, then
 `{run_name}_fold{k}_final.pt`. To use a specific fold:
 ```bash
-python3 -m pipeline offers --select-offer
+python classifier/tools/inference.py image.jpg classifier/configs/phase4/p4_convnext_tiny_100pct.json \
  --checkpoint classifier/outputs/models/p4_convnext_tiny_100pct_fold0_best.pt
 ```
-You can override the ranking mode per run:
+## Generator — training
 From the repository root:
 ```bash
-python3 -m pipeline offers --sort price
+python generator/run.py generator/configs/phase0/p0_vae.json
-python3 -m pipeline offers --sort performance
+python generator/run.py generator/configs/phase0/p0_ddpm.json
 python3 -m pipeline offers --sort performance --price 0.14
 ```
-You can also filter by region:
+Generator training expects real-face images (default source is `wiki`); use
 `--data-dir` to point at your dataset tree. Checkpoints are saved under
 `generator/outputs/models/{run_name}_final_ema.pt` (EMA shadow) and
 `{run_name}_best_ema.pt` (lowest-FID snapshot).
 ## Generator — inference (sampling)
 Generate 4×4 sample grids from Phase 5 EMA checkpoints:
 ```bash
-python3 -m pipeline offers --select-offer --region europe
+python generator/tools/sampling.py --models p5_gan p5_vae p5_ddpm --samples 10
 python3 -m pipeline offers --select-offer --region Portugal
 python3 -m pipeline offers --select-offer --region US
 python3 -m pipeline offers --select-offer --region europe --price 0.14
 ```
-To inspect which region strings are currently available from the search results:
+Options:
-```bash
+- `--models` — which models to sample from (`p5_gan`, `p5_vae`, `p5_ddpm`;
-python3 -m pipeline offers --list-regions
+  defaults to all three).
-```
+- `--samples` — number of grids per model (default 10).
 - `--output-dir` — where to write the PNGs (default
  `generator/outputs/samples/final_comparison/`).
 - `--truncation` — optional latent truncation for the GAN (lower = less
  diversity but sharper).
 - `--device` — `cuda` or `cpu` (default: auto-detect).
-That command:
+Each grid is a 4×4 PNG of 16 images sampled from the model's EMA weights.
- ensures your SSH public key is registered with Vast.ai
+GAN samples are drawn from random latent vectors, VAE samples decode from the
- searches offers using the filters in `pipeline/defaults/vast.json`
+learned prior, and DDPM samples use 50-step DDIM.
 - creates an instance
 - waits for SSH readiness
 - syncs the repo
 - uploads `data/` when `--upload-data` is set
 - runs `python3 classifier/run.py ...`
 - downloads `classifier/outputs/`
 - for generator runs, rsyncs `generator/outputs/` back every 25 epochs and again at completion
 - destroys the instance automatically unless `--keep-on-failure` is set
-Useful commands:
+## Final takeaway
-```bash
+The project is best understood as a sequence of controlled decisions:
 python3 -m pipeline up
 python3 -m pipeline status <instance_id>
 python3 -m pipeline down <instance_id>
 ```
-To override the default Vast search/runtime settings, copy `pipeline/defaults/vast.json`, edit it, and pass:
+1. cleanly define the data and preprocessing;
 2. establish simple baselines;
 3. improve one factor at a time;
 4. compare model families using saved evidence;
 5. report both performance and limitations.
-```bash
+The classifier becomes reliable through source-aware preprocessing, stronger
-python3 -m pipeline run classifier/configs/phase3/p3_efficientnet_b0.json --pipeline-config /path/to/vast.override.json
+pretrained backbones, and scaling. The generator improves by first locking the
-```
+face-aligned pipeline and then selecting the best recipe inside each model
-
+family before the final GAN/VAE/DDPM comparison.
 The default policy in `pipeline/defaults/vast.json` now targets:
 - `1x` GPU
 - `RTX 3090` or `RTX 3090 Ti`
 - `<= $0.20/hour`
 - sorted by `dlperf` descending
 - uses `vastai/pytorch:latest` as the default image
@@ -0,0 +1,7 @@
 {
  "pretrained": true,
  "epochs": 15,
  "image_size": 224,
  "augment": false,
  "data_dir": "cropped/classifier"
 }
@@ -0,0 +1,6 @@
 {
  "extends": "_base.json",
  "run_name": "p4_convnext_tiny_100pct",
  "backbone": "convnext_tiny",
  "subsample": 1.0
 }
@@ -0,0 +1,6 @@
 {
  "extends": "_base.json",
  "run_name": "p4_convnext_tiny_50pct",
  "backbone": "convnext_tiny",
  "subsample": 0.5
 }
@@ -0,0 +1,6 @@
 {
  "extends": "_base.json",
  "run_name": "p4_efficientnet_b0_100pct",
  "backbone": "efficientnet_b0",
  "subsample": 1.0
 }
@@ -0,0 +1,6 @@
 {
  "extends": "_base.json",
  "run_name": "p4_efficientnet_b0_50pct",
  "backbone": "efficientnet_b0",
  "subsample": 0.5
 }
@@ -0,0 +1,6 @@
 {
  "extends": "_base.json",
  "run_name": "p4_resnet50_100pct",
  "backbone": "resnet50",
  "subsample": 1.0
 }
@@ -0,0 +1,6 @@
 {
  "extends": "_base.json",
  "run_name": "p4_resnet50_50pct",
  "backbone": "resnet50",
  "subsample": 0.5
 }
@@ -0,0 +1,18 @@
 {
  "extends": "../shared.json",
  "run_name": "smoke",
  "backbone": "simple_cnn",
  "cnn_preset": "micro",
  "dropout": 0.0,
  "epochs": 1,
  "cv_folds": 2,
  "image_size": 64,
  "batch_size": 8,
  "num_workers": 0,
  "early_stopping_patience": 0,
  "subsample": 1.0,
  "augment": false,
  "lr": 0.001,
  "T_max": 1,
  "data_dir": "data"
 }
@@ -1,702 +0,0 @@
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Phase 1 analysis: Architecture baseline\n",
        "\n",
        "This notebook analyzes the results of Phase 1 experiments comparing SimpleCNN and ResNet18 baselines under identical conditions.\n",
        "\n",
        "## Experimental setup\n",
        "- **Models**: SimpleCNN (medium preset), ResNet18 (pretrained)\n",
        "- **Data**: 20% subsample\n",
        "- **Resolution**: 128×128\n",
        "- **Face crop**: No\n",
        "- **Augmentation**: No\n",
        "- **Optimizer**: AdamW (lr=1e-4, weight_decay=1e-4)\n",
        "- **Scheduler**: CosineAnnealingLR (T_max=15)\n",
        "- **Epochs**: 15 with early stopping (patience=5)\n",
        "- **Batch size**: 32\n",
        "- **Cross-validation**: 5-fold stratified group CV by basename\n",
        "- **Seed**: 42"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import json\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "from pathlib import Path\n",
        "from scipy import stats\n",
        "\n",
        "# Set style\n",
        "sns.set_style(\"whitegrid\")\n",
        "plt.rcParams['figure.figsize'] = (12, 6)\n",
        "plt.rcParams['font.size'] = 10\n",
        "\n",
        "# Paths\n",
        "OUTPUTS_DIR = Path(\"../outputs/logs\")\n",
        "MODELS_DIR = Path(\"../outputs/models\")\n",
        "FIGURES_DIR = Path(\"../outputs/figures\")\n",
        "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
        "\n",
        "print(\"Phase 1 Analysis: Architecture Baseline\")\n",
        "print(\"=\"*50)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load CV results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def load_cv_results(run_name):\n",
        "    \"\"\"Load cross-validation results from JSON file.\"\"\"\n",
        "    results_path = OUTPUTS_DIR / f\"{run_name}.json\"\n",
        "    if not results_path.exists():\n",
        "        print(f\"Warning: {results_path} not found\")\n",
        "        return None\n",
        "    with open(results_path) as f:\n",
        "        return json.load(f)\n",
        "\n",
        "# Load results for both models\n",
        "simplecnn_results = load_cv_results(\"p1_simplecnn_baseline\")\n",
        "resnet18_results = load_cv_results(\"p1_resnet18_baseline\")\n",
        "\n",
        "print(f\"SimpleCNN results loaded: {simplecnn_results is not None}\")\n",
        "print(f\"ResNet18 results loaded: {resnet18_results is not None}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Overall metrics comparison\n",
        "\n",
        "Compare AUC, Accuracy, and F1 scores with mean ± std and 95% confidence intervals."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def extract_aggregated_metrics(results, model_name):\n",
        "    \"\"\"Extract aggregated metrics from CV results.\"\"\"\n",
        "    if results is None:\n",
        "        return None\n",
        "    \n",
        "    agg = results['aggregated_metrics']\n",
        "    return {\n",
        "        'model': model_name,\n",
        "        'auc_mean': agg['auc_roc']['mean'],\n",
        "        'auc_std': agg['auc_roc']['std'],\n",
        "        'auc_ci': agg['auc_roc']['ci_95'],\n",
        "        'acc_mean': agg['accuracy']['mean'],\n",
        "        'acc_std': agg['accuracy']['std'],\n",
        "        'acc_ci': agg['accuracy']['ci_95'],\n",
        "        'f1_mean': agg['f1']['mean'],\n",
        "        'f1_std': agg['f1']['std'],\n",
        "        'f1_ci': agg['f1']['ci_95'],\n",
        "    }\n",
        "\n",
        "# Extract metrics\n",
        "simplecnn_metrics = extract_aggregated_metrics(simplecnn_results, 'SimpleCNN')\n",
        "resnet18_metrics = extract_aggregated_metrics(resnet18_results, 'ResNet18')\n",
        "\n",
        "# Create comparison table\n",
        "if simplecnn_metrics and resnet18_metrics:\n",
        "    comparison_df = pd.DataFrame([simplecnn_metrics, resnet18_metrics])\n",
        "    comparison_df.set_index('model', inplace=True)\n",
        "    \n",
        "    # Format for display\n",
        "    display_df = comparison_df.copy()\n",
        "    for metric in ['auc', 'acc', 'f1']:\n",
        "        display_df[f'{metric}_formatted'] = (\n",
        "            display_df[f'{metric}_mean'].apply(lambda x: f\"{x:.4f}\") + \" ± \" +\n",
        "            display_df[f'{metric}_std'].apply(lambda x: f\"{x:.4f}\") +\n",
        "            \" (95% CI: ±\" + display_df[f'{metric}_ci'].apply(lambda x: f\"{x:.4f}\") + \")\"\n",
        "        )\n",
        "    \n",
        "    print(\"\\nOverall Metrics Comparison (5-fold CV):\")\n",
        "    print(\"=\"*80)\n",
        "    for col in ['auc_formatted', 'acc_formatted', 'f1_formatted']:\n",
        "        metric_name = col.replace('_formatted', '').upper()\n",
        "        print(f\"\\n{metric_name}:\")\n",
        "        for model in display_df.index:\n",
        "            print(f\"  {model}: {display_df.loc[model, col]}\")\n",
        "    \n",
        "    # Print improvement\n",
        "    print(\"\\n\" + \"=\"*80)\n",
        "    print(\"ResNet18 vs SimpleCNN Improvement:\")\n",
        "    print(\"=\"*80)\n",
        "    for metric in ['auc', 'acc', 'f1']:\n",
        "        mean_diff = resnet18_metrics[f'{metric}_mean'] - simplecnn_metrics[f'{metric}_mean']\n",
        "        pct_improvement = (mean_diff / simplecnn_metrics[f'{metric}_mean']) * 100\n",
        "        print(f\"  {metric.upper()}: +{mean_diff:.4f} (+{pct_improvement:.2f}%)\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Visualization: Overall metrics comparison"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if simplecnn_metrics and resnet18_metrics:\n",
        "    fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
        "    \n",
        "    models = ['SimpleCNN', 'ResNet18']\n",
        "    metrics_data = {\n",
        "        'AUC-ROC': [simplecnn_metrics['auc_mean'], resnet18_metrics['auc_mean']],\n",
        "        'Accuracy': [simplecnn_metrics['acc_mean'], resnet18_metrics['acc_mean']],\n",
        "        'F1 Score': [simplecnn_metrics['f1_mean'], resnet18_metrics['f1_mean']],\n",
        "    }\n",
        "    errors = {\n",
        "        'AUC-ROC': [simplecnn_metrics['auc_std'], resnet18_metrics['auc_std']],\n",
        "        'Accuracy': [simplecnn_metrics['acc_std'], resnet18_metrics['acc_std']],\n",
        "        'F1 Score': [simplecnn_metrics['f1_std'], resnet18_metrics['f1_std']],\n",
        "    }\n",
        "    \n",
        "    colors = ['#e74c3c', '#2ecc71']  # Red for SimpleCNN, Green for ResNet18\n",
        "    \n",
        "    for idx, (metric_name, values) in enumerate(metrics_data.items()):\n",
        "        ax = axes[idx]\n",
        "        bars = ax.bar(models, values, yerr=errors[metric_name], capsize=5, alpha=0.7, color=colors)\n",
        "        ax.set_ylabel(metric_name)\n",
        "        ax.set_title(f'{metric_name} Comparison')\n",
        "        ax.set_ylim(0.5, 1.0)\n",
        "        \n",
        "        # Add value labels on bars\n",
        "        for bar, value in zip(bars, values):\n",
        "            height = bar.get_height()\n",
        "            ax.text(bar.get_x() + bar.get_width()/2., height,\n",
        "                   f'{value:.4f}',\n",
        "                   ha='center', va='bottom', fontweight='bold')\n",
        "    \n",
        "    plt.tight_layout()\n",
        "    plt.savefig(FIGURES_DIR / 'phase1_overall_metrics.png', dpi=300, bbox_inches='tight')\n",
        "    plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Per-source metrics\n",
        "\n",
        "Analyze performance on each fake source (text2img, inpainting, insight). Note: Per-source metrics are not available in the current CV results format, so we analyze overall performance across all sources."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def extract_per_source_metrics(results, model_name):\n",
        "    \"\"\"Extract per-source metrics from CV results.\"\"\"\n",
        "    if results is None:\n",
        "        return None\n",
        "    \n",
        "    # Collect per-source metrics across folds\n",
        "    source_metrics = {}\n",
        "    \n",
        "    for fold_result in results['fold_results']:\n",
        "        # Check if per_source metrics are available\n",
        "        if 'per_source' in fold_result['test_metrics']:\n",
        "            for source, metrics in fold_result['test_metrics']['per_source'].items():\n",
        "                if source not in source_metrics:\n",
        "                    source_metrics[source] = {'auc': [], 'acc': [], 'f1': []}\n",
        "                if 'auc_roc' in metrics and metrics['auc_roc'] is not None:\n",
        "                    source_metrics[source]['auc'].append(metrics['auc_roc'])\n",
        "                if 'accuracy' in metrics:\n",
        "                    source_metrics[source]['acc'].append(metrics['accuracy'])\n",
        "                if 'f1' in metrics and metrics['f1'] is not None:\n",
        "                    source_metrics[source]['f1'].append(metrics['f1'])\n",
        "    \n",
        "    # Aggregate per-source metrics\n",
        "    aggregated = {}\n",
        "    for source, metrics in source_metrics.items():\n",
        "        aggregated[source] = {\n",
        "            'auc_mean': np.mean(metrics['auc']) if metrics['auc'] else None,\n",
        "            'auc_std': np.std(metrics['auc']) if len(metrics['auc']) > 1 else 0,\n",
        "            'acc_mean': np.mean(metrics['acc']) if metrics['acc'] else None,\n",
        "            'acc_std': np.std(metrics['acc']) if len(metrics['acc']) > 1 else 0,\n",
        "            'f1_mean': np.mean(metrics['f1']) if metrics['f1'] else None,\n",
        "            'f1_std': np.std(metrics['f1']) if len(metrics['f1']) > 1 else 0,\n",
        "        }\n",
        "    \n",
        "    return {'model': model_name, 'sources': aggregated}\n",
        "\n",
        "# Extract per-source metrics\n",
        "simplecnn_source = extract_per_source_metrics(simplecnn_results, 'SimpleCNN')\n",
        "resnet18_source = extract_per_source_metrics(resnet18_results, 'ResNet18')\n",
        "\n",
        "if simplecnn_source and resnet18_source:\n",
        "    print(\"\\nPer-Source Metrics Comparison:\")\n",
        "    print(\"=\"*80)\n",
        "    \n",
        "    for source in sorted(set(simplecnn_source['sources'].keys()) | set(resnet18_source['sources'].keys())):\n",
        "        print(f\"\\nSource: {source}\")\n",
        "        print(\"-\" * 40)\n",
        "        \n",
        "        scnn = simplecnn_source['sources'].get(source, {})\n",
        "        r18 = resnet18_source['sources'].get(source, {})\n",
        "        \n",
        "        print(f\"  SimpleCNN:  AUC={scnn.get('auc_mean', 'N/A'):.4f}±{scnn.get('auc_std', 0):.4f}, \"\n",
        "              f\"Acc={scnn.get('acc_mean', 'N/A'):.4f}±{scnn.get('acc_std', 0):.4f}, \"\n",
        "              f\"F1={scnn.get('f1_mean', 'N/A'):.4f}±{scnn.get('f1_std', 0):.4f}\")\n",
        "        print(f\"  ResNet18:   AUC={r18.get('auc_mean', 'N/A'):.4f}±{r18.get('auc_std', 0):.4f}, \"\n",
        "              f\"Acc={r18.get('acc_mean', 'N/A'):.4f}±{r18.get('acc_std', 0):.4f}, \"\n",
        "              f\"F1={r18.get('f1_mean', 'N/A'):.4f}±{r18.get('f1_std', 0):.4f}\")\n",
        "else:\n",
        "    print(\"\\nNote: Per-source metrics not available in current CV results format.\")\n",
        "    print(\"The models were evaluated on all sources combined.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Train/Val/Test performance curves"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def plot_training_curves(results, model_name, ax):\n",
        "    \"\"\"Plot training curves for a model.\"\"\"\n",
        "    if results is None:\n",
        "        return\n",
        "    \n",
        "    # Aggregate histories across folds\n",
        "    all_histories = [fold['history'] for fold in results['fold_results']]\n",
        "    max_epochs = max(len(h['train_loss']) for h in all_histories)\n",
        "    \n",
        "    # Pad shorter histories with NaN\n",
        "    for history in all_histories:\n",
        "        for key in ['train_loss', 'val_loss', 'train_auc', 'val_auc']:\n",
        "            while len(history[key]) < max_epochs:\n",
        "                history[key].append(np.nan)\n",
        "    \n",
        "    # Compute mean and std across folds\n",
        "    epochs = np.arange(1, max_epochs + 1)\n",
        "    \n",
        "    train_loss_mean = np.nanmean([h['train_loss'] for h in all_histories], axis=0)\n",
        "    train_loss_std = np.nanstd([h['train_loss'] for h in all_histories], axis=0)\n",
        "    val_loss_mean = np.nanmean([h['val_loss'] for h in all_histories], axis=0)\n",
        "    val_loss_std = np.nanstd([h['val_loss'] for h in all_histories], axis=0)\n",
        "    \n",
        "    train_auc_mean = np.nanmean([h['train_auc'] for h in all_histories], axis=0)\n",
        "    train_auc_std = np.nanstd([h['train_auc'] for h in all_histories], axis=0)\n",
        "    val_auc_mean = np.nanmean([h['val_auc'] for h in all_histories], axis=0)\n",
        "    val_auc_std = np.nanstd([h['val_auc'] for h in all_histories], axis=0)\n",
        "    \n",
        "    # Plot loss\n",
        "    ax[0].plot(epochs, train_loss_mean, label=f'{model_name} (train)', marker='o', linewidth=2)\n",
        "    ax[0].fill_between(epochs, train_loss_mean - train_loss_std, train_loss_mean + train_loss_std, alpha=0.2)\n",
        "    ax[0].plot(epochs, val_loss_mean, label=f'{model_name} (val)', marker='s', linewidth=2)\n",
        "    ax[0].fill_between(epochs, val_loss_mean - val_loss_std, val_loss_mean + val_loss_std, alpha=0.2)\n",
        "    ax[0].set_xlabel('Epoch', fontweight='bold')\n",
        "    ax[0].set_ylabel('Loss', fontweight='bold')\n",
        "    ax[0].set_title('Training/Validation Loss', fontweight='bold')\n",
        "    ax[0].legend()\n",
        "    ax[0].grid(True, alpha=0.3)\n",
        "    \n",
        "    # Plot AUC\n",
        "    ax[1].plot(epochs, train_auc_mean, label=f'{model_name} (train)', marker='o', linewidth=2)\n",
        "    ax[1].fill_between(epochs, train_auc_mean - train_auc_std, train_auc_mean + train_auc_std, alpha=0.2)\n",
        "    ax[1].plot(epochs, val_auc_mean, label=f'{model_name} (val)', marker='s', linewidth=2)\n",
        "    ax[1].fill_between(epochs, val_auc_mean - val_auc_std, val_auc_mean + val_auc_std, alpha=0.2)\n",
        "    ax[1].set_xlabel('Epoch', fontweight='bold')\n",
        "    ax[1].set_ylabel('AUC-ROC', fontweight='bold')\n",
        "    ax[1].set_title('Training/Validation AUC', fontweight='bold')\n",
        "    ax[1].legend()\n",
        "    ax[1].grid(True, alpha=0.3)\n",
        "    ax[1].set_ylim(0.5, 1.0)\n",
        "\n",
        "# Plot curves for both models\n",
        "fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
        "\n",
        "plot_training_curves(simplecnn_results, 'SimpleCNN', axes[0])\n",
        "plot_training_curves(resnet18_results, 'ResNet18', axes[1])\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.savefig(FIGURES_DIR / 'phase1_training_curves.png', dpi=300, bbox_inches='tight')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Confusion matrices"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def plot_confusion_matrices(results, model_name, ax):\n",
        "    \"\"\"Plot aggregated confusion matrix across folds.\"\"\"\n",
        "    if results is None:\n",
        "        return\n",
        "    \n",
        "    # Aggregate confusion matrices across folds\n",
        "    total_cm = np.array([[0, 0], [0, 0]])\n",
        "    \n",
        "    for fold_result in results['fold_results']:\n",
        "        cm = np.array(fold_result['test_metrics']['confusion_matrix'])\n",
        "        total_cm += cm\n",
        "    \n",
        "    # Normalize\n",
        "    cm_normalized = total_cm.astype('float') / total_cm.sum(axis=1)[:, np.newaxis]\n",
        "    \n",
        "    # Plot\n",
        "    im = ax.imshow(cm_normalized, interpolation='nearest', cmap=plt.cm.Blues, vmin=0, vmax=1)\n",
        "    ax.figure.colorbar(im, ax=ax)\n",
        "    \n",
        "    # Add text annotations\n",
        "    thresh = cm_normalized.max() / 2.\n",
        "    for i in range(2):\n",
        "        for j in range(2):\n",
        "            ax.text(j, i, f'{total_cm[i, j]}\\n({cm_normalized[i, j]:.2%})',\n",
        "                   ha=\"center\", va=\"center\",\n",
        "                   color=\"white\" if cm_normalized[i, j] > thresh else \"black\", fontsize=12)\n",
        "    \n",
        "    ax.set_ylabel('True Label', fontweight='bold')\n",
        "    ax.set_xlabel('Predicted Label', fontweight='bold')\n",
        "    ax.set_title(f'{model_name} Confusion Matrix', fontweight='bold')\n",
        "    ax.set_xticks([0, 1])\n",
        "    ax.set_yticks([0, 1])\n",
        "    ax.set_xticklabels(['Real', 'Fake'])\n",
        "    ax.set_yticklabels(['Real', 'Fake'])\n",
        "\n",
        "# Plot confusion matrices\n",
        "fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n",
        "\n",
        "plot_confusion_matrices(simplecnn_results, 'SimpleCNN', axes[0])\n",
        "plot_confusion_matrices(resnet18_results, 'ResNet18', axes[1])\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.savefig(FIGURES_DIR / 'phase1_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Statistical significance testing\n",
        "\n",
        "Perform paired t-tests to determine if differences between models are statistically significant."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def perform_statistical_tests(results1, results2, model1_name, model2_name):\n",
        "    \"\"\"Perform paired t-tests between two models.\"\"\"\n",
        "    if results1 is None or results2 is None:\n",
        "        return None\n",
        "    \n",
        "    # Extract test AUC values across folds\n",
        "    auc1 = [fold['test_metrics']['auc_roc'] for fold in results1['fold_results']]\n",
        "    auc2 = [fold['test_metrics']['auc_roc'] for fold in results2['fold_results']]\n",
        "    \n",
        "    # Extract test accuracy values\n",
        "    acc1 = [fold['test_metrics']['accuracy'] for fold in results1['fold_results']]\n",
        "    acc2 = [fold['test_metrics']['accuracy'] for fold in results2['fold_results']]\n",
        "    \n",
        "    # Extract test F1 values\n",
        "    f1_1 = [fold['test_metrics']['f1'] for fold in results1['fold_results']]\n",
        "    f1_2 = [fold['test_metrics']['f1'] for fold in results2['fold_results']]\n",
        "    \n",
        "    # Perform paired t-tests\n",
        "    results = {\n",
        "        'auc': stats.ttest_rel(auc1, auc2),\n",
        "        'accuracy': stats.ttest_rel(acc1, acc2),\n",
        "        'f1': stats.ttest_rel(f1_1, f1_2),\n",
        "    }\n",
        "    \n",
        "    print(f\"\\nStatistical Significance Testing: {model1_name} vs {model2_name}\")\n",
        "    print(\"=\"*80)\n",
        "    print(f\"\\nPaired t-test (5 folds):\")\n",
        "    print(f\"{'Metric':<15} {'t-statistic':<15} {'p-value':<15} {'Significant (α=0.05)':<25}\")\n",
        "    print(\"-\"*80)\n",
        "    \n",
        "    for metric, test_result in results.items():\n",
        "        is_significant = test_result.pvalue < 0.05\n",
        "        sig_str = \"*** YES ***\" if is_significant else \"No\"\n",
        "        print(f\"{metric.capitalize():<15} {test_result.statistic:<15.4f} {test_result.pvalue:<15.6f} {sig_str:<25}\")\n",
        "    \n",
        "    # Also compute effect size (Cohen's d)\n",
        "    print(\"\\n\" + \"-\"*80)\n",
        "    print(\"Effect Sizes (Cohen's d):\")\n",
        "    print(\"-\"*80)\n",
        "    \n",
        "    def cohens_d(x1, x2):\n",
        "        n1, n2 = len(x1), len(x2)\n",
        "        var1, var2 = np.var(x1, ddof=1), np.var(x2, ddof=1)\n",
        "        pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))\n",
        "        return (np.mean(x1) - np.mean(x2)) / pooled_std\n",
        "    \n",
        "    for metric, values in {'AUC': (auc1, auc2), 'Accuracy': (acc1, acc2), 'F1': (f1_1, f1_2)}.items():\n",
        "        d = cohens_d(values[0], values[1])\n",
        "        print(f\"  {metric}: {d:.4f} ({'large' if abs(d) > 0.8 else 'medium' if abs(d) > 0.5 else 'small'} effect)\")\n",
        "    \n",
        "    return results\n",
        "\n",
        "# Perform statistical tests\n",
        "if simplecnn_results and resnet18_results:\n",
        "    test_results = perform_statistical_tests(\n",
        "        simplecnn_results, resnet18_results, 'SimpleCNN', 'ResNet18'\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Grad-CAM visualizations\n",
        "\n",
        "Generate Grad-CAM visualizations to understand what features the models focus on.\n",
        "\n",
        "**Note**: This section requires the trained models and sample images. The Grad-CAM visualization code is provided but requires:\n",
        "1. Loading the trained model checkpoints\n",
        "2. Selecting sample images from the test set\n",
        "3. Running the Grad-CAM algorithm\n",
        "\n",
        "For now, we provide the code structure that can be executed when models are available."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import sys\n",
        "sys.path.insert(0, '..')\n",
        "\n",
        "from pathlib import Path\n",
        "from src.data import DFFDataset, get_splits, build_transforms\n",
        "from src.models import get_model\n",
        "from src.utils import load_config, resolve_nested_fields\n",
        "\n",
        "OUTPUTS_DIR = Path(\"../outputs\")\n",
        "MODELS_DIR  = OUTPUTS_DIR / \"models\"\n",
        "FIGURES_DIR = OUTPUTS_DIR / \"figures\"\n",
        "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
        "\n",
        "# Load config and rebuild test split for fold 0\n",
        "# cfg = load_config(\"../configs/phase1/p1_resnet18_baseline.json\")\n",
        "# cfg = resolve_nested_fields(cfg)\n",
        "# DATA_DIR = Path(\"../../data\")\n",
        "# raw_ds = DFFDataset(DATA_DIR)\n",
        "# splits = get_splits(raw_ds, cfg)\n",
        "# transform_builder = build_transforms(raw_ds, cfg)\n",
        "# _, _, test_idx = splits[0]\n",
        "# test_ds = transform_builder(test_idx, train=False)\n",
        "\n",
        "# Load model checkpoint\n",
        "# import torch\n",
        "# model = get_model(cfg)\n",
        "# ckpt = MODELS_DIR / \"p1_resnet18_baseline_fold0_best.pt\"\n",
        "# model.load_state_dict(torch.load(ckpt, map_location=\"cpu\", weights_only=True))\n",
        "\n",
        "# Run Grad-CAM on top-confidence errors\n",
        "# from tools.gradcam import save_overlays\n",
        "# records = [...]  # load from reevaluate output or predict_rows\n",
        "# save_overlays(model, records, cfg, FIGURES_DIR / \"gradcam\", device=\"cpu\")\n",
        "print(\"Grad-CAM ready — uncomment above once model checkpoints are available.\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusions\n",
        "\n",
        "### Summary template (fill after running all cells)\n",
        "\n",
        "Use this section only after metrics are generated.\n",
        "Replace placeholders (`<...>`) with measured values.\n",
        "\n",
        "#### 1. Overall performance\n",
        "\n",
        "**Model comparison:** `<winner model>` vs `<other model>`\n",
        "\n",
        "- **AUC-ROC**: `<model A mean±std>` vs `<model B mean±std>`\n",
        "  - **Absolute delta**: `<delta>`\n",
        "  - **Relative delta**: `<percent change>`\n",
        "  - **Statistical test**: `<test name, p-value, effect size>`\n",
        "\n",
        "- **Accuracy**: `<model A mean±std>` vs `<model B mean±std>`\n",
        "  - **Absolute delta**: `<delta>`\n",
        "  - **Relative delta**: `<percent change>`\n",
        "  - **Statistical test**: `<test name, p-value, effect size>`\n",
        "\n",
        "- **F1 score**: `<model A mean±std>` vs `<model B mean±std>`\n",
        "  - **Absolute delta**: `<delta>`\n",
        "  - **Relative delta**: `<percent change>`\n",
        "  - **Statistical test**: `<test name, p-value, effect size>`\n",
        "\n",
        "#### 2. Training dynamics\n",
        "\n",
        "- **Convergence speed**: `<which model converges faster and by how many epochs>`\n",
        "- **Overfitting pattern**:\n",
        "  - `<model A train-vs-val behavior>`\n",
        "  - `<model B train-vs-val behavior>`\n",
        "- **Fold stability (variance)**: `<std/CI comparison across folds>`\n",
        "\n",
        "#### 3. Error analysis (confusion matrix)\n",
        "\n",
        "- **Model A**: `<main error mode>`\n",
        "- **Model B**: `<main error mode>`\n",
        "- **Key difference**: `<which error type improved/worsened and by how much>`\n",
        "\n",
        "#### 4. Why the better model likely performs better\n",
        "\n",
        "1. `<reason 1 tied to architecture/pretraining>`\n",
        "2. `<reason 2 tied to optimization/generalization>`\n",
        "3. `<reason 3 tied to feature capacity>`\n",
        "\n",
        "#### 5. Recommendations for Phase 2\n",
        "\n",
        "- **Primary baseline**: `<model>`\n",
        "- **Secondary baseline**: `<model>`\n",
        "- **Priority experiments**:\n",
        "  - `<experiment 1>`\n",
        "  - `<experiment 2>`\n",
        "  - `<experiment 3>`\n",
        "\n",
        "#### 6. Limitations and next checks\n",
        "\n",
        "- `<missing metric or analysis 1>`\n",
        "- `<missing metric or analysis 2>`\n",
        "\n",
        "### Final verdict\n",
        "\n",
        "`<One concise paragraph with the decision and rationale based on generated metrics.>`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Save Analysis Results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Save analysis summary\n",
        "analysis_summary = {\n",
        "    'phase': 'phase1',\n",
        "    'models': ['SimpleCNN', 'ResNet18'],\n",
        "    'simplecnn_metrics': simplecnn_metrics,\n",
        "    'resnet18_metrics': resnet18_metrics,\n",
        "    'improvement': {\n",
        "        'auc': {\n",
        "            'absolute': resnet18_metrics['auc_mean'] - simplecnn_metrics['auc_mean'],\n",
        "            'percent': ((resnet18_metrics['auc_mean'] - simplecnn_metrics['auc_mean']) / simplecnn_metrics['auc_mean']) * 100\n",
        "        },\n",
        "        'accuracy': {\n",
        "            'absolute': resnet18_metrics['acc_mean'] - simplecnn_metrics['acc_mean'],\n",
        "            'percent': ((resnet18_metrics['acc_mean'] - simplecnn_metrics['acc_mean']) / simplecnn_metrics['acc_mean']) * 100\n",
        "        },\n",
        "        'f1': {\n",
        "            'absolute': resnet18_metrics['f1_mean'] - simplecnn_metrics['f1_mean'],\n",
        "            'percent': ((resnet18_metrics['f1_mean'] - simplecnn_metrics['f1_mean']) / simplecnn_metrics['f1_mean']) * 100\n",
        "        }\n",
        "    },\n",
        "    'statistical_tests': {\n",
        "        'auc_t_stat': test_results['auc'].statistic if test_results else None,\n",
        "        'auc_p_value': test_results['auc'].pvalue if test_results else None,\n",
        "        'acc_t_stat': test_results['accuracy'].statistic if test_results else None,\n",
        "        'acc_p_value': test_results['accuracy'].pvalue if test_results else None,\n",
        "        'f1_t_stat': test_results['f1'].statistic if test_results else None,\n",
        "        'f1_p_value': test_results['f1'].pvalue if test_results else None,\n",
        "    } if test_results else None,\n",
        "    'conclusions': {\n",
        "        'best_model': 'ResNet18',\n",
        "        'reason': 'Significantly better AUC, accuracy, and F1 scores with lower variance across folds',\n",
        "        'recommendation': 'Use ResNet18 as primary baseline for Phase 2 experiments'\n",
        "    }\n",
        "}\n",
        "\n",
        "with open(OUTPUTS_DIR / 'phase1_analysis_summary.json', 'w') as f:\n",
        "    json.dump(analysis_summary, f, indent=2)\n",
        "\n",
        "print(\"\\n\" + \"=\"*80)\n",
        "print(\"Phase 1 Analysis Complete!\")\n",
        "print(\"=\"*80)\n",
        "print(\"\\nResults saved to:\")\n",
        "print(f\"  - {FIGURES_DIR / 'phase1_overall_metrics.png'}\")\n",
        "print(f\"  - {FIGURES_DIR / 'phase1_training_curves.png'}\")\n",
        "print(f\"  - {FIGURES_DIR / 'phase1_confusion_matrices.png'}\")\n",
        "print(f\"  - {OUTPUTS_DIR / 'phase1_analysis_summary.json'}\")\n",
        "print(\"\\nKey Findings:\")\n",
        "print(f\"  - ResNet18 AUC: {resnet18_metrics['auc_mean']:.4f}±{resnet18_metrics['auc_std']:.4f}\")\n",
        "print(f\"  - SimpleCNN AUC: {simplecnn_metrics['auc_mean']:.4f}±{simplecnn_metrics['auc_std']:.4f}\")\n",
        "print(f\"  - Improvement: +{analysis_summary['improvement']['auc']['absolute']:.4f} (+{analysis_summary['improvement']['auc']['percent']:.2f}%)\")\n",
        "print(f\"  - Statistically significant: Yes (p < 0.001)\")"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "drl",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
 }
@@ -1,904 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "54aa00ab",
   "metadata": {},
   "source": [
    "# Phase 2 analysis\n",
    "\n",
    "This notebook follows the Phase 2 config organization (`p2a` to `p2e`) and maps each section directly to its config group.\n",
    "It separates three concerns:\n",
    "\n",
    "1. **Experimental validity**: were expected configs/logs produced, and are comparisons fair?\n",
    "2. **Evidence**: what do the 5-fold CV metrics support?\n",
    "3. **Decision**: which preprocessing choices should move into Phase 3?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "734db3ee",
   "metadata": {},
   "source": [
    "## Questions\n",
    "\n",
    "| Section | Config group | Question | Required evidence |\n",
    "|---|---|---|---|\n",
    "| 2A | `p2a_*` | Shortcut analysis: normalization + source holdout | `p2a_t1_original`, `p2a_t2_real_norm`, `p2a_t3_holdout_*` |\n",
    "| 2B | `p2b_*` | Does 224 improve over 128? | `p2b_simplecnn_224`, `p2b_resnet18_224`, plus P1 128 fallbacks |\n",
    "| 2C | `p2c_*` | Does face cropping help? | `p2c_simplecnn_facecrop`, `p2c_resnet18_facecrop` vs `p2b_*` |\n",
    "| 2D | `p2d_*` | Does augmentation help without facecrop? | `p2d_simplecnn_aug`, `p2d_resnet18_aug` vs `p2b_*` |\n",
    "| 2E | `p2e_*` | Does augmentation help with facecrop? | `p2e_simplecnn_facecrop_aug`, `p2e_resnet18_facecrop_aug` vs `p2c_*` |\n",
    "\n",
    "Decision criteria used here:\n",
    "\n",
    "- Prefer changes with positive mean AUC delta and no worsening of train/validation gap.\n",
    "- Treat fold-level paired tests as directional evidence, not definitive proof, because `n=5` folds is small.\n",
    "- Do not claim per-source generalization unless per-source or prediction-level outputs exist.\n",
    "- Prefer the simplest Phase 3 setting when deltas are small or unsupported.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f4c04b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import json\n",
    "import math\n",
    "import os\n",
    "import sys\n",
    "from dataclasses import dataclass\n",
    "from pathlib import Path\n",
    "from typing import Any\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "\n",
    "try:\n",
    "    from IPython.display import display\n",
    "except Exception:\n",
    "    def display(obj):\n",
    "        print(obj)\n",
    "\n",
    "# Robust project-root detection whether the notebook is run from repo root,\n",
    "# classifier/, or classifier/notebooks/.\n",
    "def find_project_root(start: Path | None = None) -> Path:\n",
    "    start = (start or Path.cwd()).resolve()\n",
    "    for candidate in [start, *start.parents]:\n",
    "        if (candidate / \"classifier\" / \"v2.md\").exists() and (candidate / \"classifier\" / \"impl.md\").exists():\n",
    "            return candidate\n",
    "    raise RuntimeError(f\"Could not find project root from {start}\")\n",
    "\n",
    "PROJECT_ROOT = find_project_root()\n",
    "CLASSIFIER_DIR = PROJECT_ROOT / \"classifier\"\n",
    "LOGS_DIR = CLASSIFIER_DIR / \"outputs\" / \"logs\"\n",
    "FIGURES_DIR = CLASSIFIER_DIR / \"outputs\" / \"figures\" / \"phase2\"\n",
    "ANALYSIS_DIR = CLASSIFIER_DIR / \"outputs\" / \"analysis\"\n",
    "CONFIG_DIR = CLASSIFIER_DIR / \"configs\"\n",
    "\n",
    "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
    "ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "if str(CLASSIFIER_DIR) not in sys.path:\n",
    "    sys.path.insert(0, str(CLASSIFIER_DIR))\n",
    "\n",
    "sns.set_theme(style=\"whitegrid\", context=\"notebook\")\n",
    "plt.rcParams.update({\n",
    "    \"figure.figsize\": (12, 7),\n",
    "    \"axes.spines.top\": False,\n",
    "    \"axes.spines.right\": False,\n",
    "})\n",
    "\n",
    "print(f\"Project root: {PROJECT_ROOT}\")\n",
    "print(f\"Logs:         {LOGS_DIR}\")\n",
    "print(f\"Figures:      {FIGURES_DIR}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24830212",
   "metadata": {},
   "outputs": [],
   "source": [
    "@dataclass(frozen=True)\n",
    "class RunSpec:\n",
    "    run: str\n",
    "    label: str\n",
    "    section: str\n",
    "    model: str\n",
    "    condition: str\n",
    "    intended_role: str\n",
    "    fallback_for: str | None = None\n",
    "\n",
    "RUN_SPECS = [\n",
    "    # 2A: shortcut analysis (normalization + source holdout), ResNet18 only.\n",
    "    RunSpec(\"p2a_t1_original\", \"ResNet18 ImageNet norm\", \"2A\", \"ResNet18\", \"imagenet_norm\", \"expected\"),\n",
    "    RunSpec(\"p2a_t2_real_norm\", \"ResNet18 real-train norm\", \"2A\", \"ResNet18\", \"real_train_norm\", \"expected\"),\n",
    "    RunSpec(\"p2a_t3_holdout_text2img\", \"Holdout text2img\", \"2A\", \"ResNet18\", \"holdout_text2img\", \"expected\"),\n",
    "    RunSpec(\"p2a_t3_holdout_inpainting\", \"Holdout inpainting\", \"2A\", \"ResNet18\", \"holdout_inpainting\", \"expected\"),\n",
    "    RunSpec(\"p2a_t3_holdout_insight\", \"Holdout insight\", \"2A\", \"ResNet18\", \"holdout_insight\", \"expected\"),\n",
    "\n",
    "    # 2B: resolution effect (224 in phase2 vs 128 baseline fallback from phase1).\n",
    "    RunSpec(\"p1_simplecnn_baseline\", \"SimpleCNN 128 (P1 fallback)\", \"2B\", \"SimpleCNN\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_simplecnn_128\"),\n",
    "    RunSpec(\"p1_resnet18_baseline\", \"ResNet18 128 (P1 fallback)\", \"2B\", \"ResNet18\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_resnet18_128\"),\n",
    "    RunSpec(\"p2b_simplecnn_224\", \"SimpleCNN 224\", \"2B\", \"SimpleCNN\", \"224_no_crop_no_aug\", \"expected\"),\n",
    "    RunSpec(\"p2b_resnet18_224\", \"ResNet18 224\", \"2B\", \"ResNet18\", \"224_no_crop_no_aug\", \"expected\"),\n",
    "\n",
    "    # 2C: facecrop effect at 224, no augmentation.\n",
    "    RunSpec(\"p2c_simplecnn_facecrop\", \"SimpleCNN facecrop\", \"2C\", \"SimpleCNN\", \"224_facecrop_no_aug\", \"expected\"),\n",
    "    RunSpec(\"p2c_resnet18_facecrop\", \"ResNet18 facecrop\", \"2C\", \"ResNet18\", \"224_facecrop_no_aug\", \"expected\"),\n",
    "\n",
    "    # 2D: augmentation effect without facecrop.\n",
    "    RunSpec(\"p2d_simplecnn_aug\", \"SimpleCNN light aug\", \"2D\", \"SimpleCNN\", \"224_no_crop_aug\", \"expected\"),\n",
    "    RunSpec(\"p2d_resnet18_aug\", \"ResNet18 light aug\", \"2D\", \"ResNet18\", \"224_no_crop_aug\", \"expected\"),\n",
    "\n",
    "    # 2E: augmentation effect with facecrop.\n",
    "    RunSpec(\"p2e_simplecnn_facecrop_aug\", \"SimpleCNN facecrop + aug\", \"2E\", \"SimpleCNN\", \"224_facecrop_aug\", \"expected\"),\n",
    "    RunSpec(\"p2e_resnet18_facecrop_aug\", \"ResNet18 facecrop + aug\", \"2E\", \"ResNet18\", \"224_facecrop_aug\", \"expected\"),\n",
    "]\n",
    "\n",
    "# Use these aliases when synthetic 128 run IDs are requested for 2B.\n",
    "RUN_ALIASES = {\n",
    "    \"p2b_simplecnn_128\": \"p1_simplecnn_baseline\",\n",
    "    \"p2b_resnet18_128\": \"p1_resnet18_baseline\",\n",
    "}\n",
    "\n",
    "PLANNED_COMPARISONS = [\n",
    "    (\"2A\", \"ResNet18\", \"normalization\", \"p2a_t1_original\", \"p2a_t2_real_norm\", \"real_norm - imagenet_norm\"),\n",
    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"holdout text2img - all-source\"),\n",
    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_inpainting\", \"holdout inpainting - all-source\"),\n",
    "    (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_insight\", \"holdout insight - all-source\"),\n",
    "\n",
    "    (\"2B\", \"SimpleCNN\", \"resolution\", \"p2b_simplecnn_128\", \"p2b_simplecnn_224\", \"224 - 128\"),\n",
    "    (\"2B\", \"ResNet18\", \"resolution\", \"p2b_resnet18_128\", \"p2b_resnet18_224\", \"224 - 128\"),\n",
    "\n",
    "    (\"2C\", \"SimpleCNN\", \"facecrop\", \"p2b_simplecnn_224\", \"p2c_simplecnn_facecrop\", \"facecrop - no facecrop\"),\n",
    "    (\"2C\", \"ResNet18\", \"facecrop\", \"p2b_resnet18_224\", \"p2c_resnet18_facecrop\", \"facecrop - no facecrop\"),\n",
    "\n",
    "    (\"2D\", \"SimpleCNN\", \"augmentation\", \"p2b_simplecnn_224\", \"p2d_simplecnn_aug\", \"light aug - no aug\"),\n",
    "    (\"2D\", \"ResNet18\", \"augmentation\", \"p2b_resnet18_224\", \"p2d_resnet18_aug\", \"light aug - no aug\"),\n",
    "\n",
    "    (\"2E\", \"SimpleCNN\", \"facecrop + augmentation\", \"p2c_simplecnn_facecrop\", \"p2e_simplecnn_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
    "    (\"2E\", \"ResNet18\", \"facecrop + augmentation\", \"p2c_resnet18_facecrop\", \"p2e_resnet18_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
    "]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e2ccd27",
   "metadata": {},
   "source": [
    "## Evidence audit\n",
    "\n",
    "Before comparing numbers, check whether the planned artifacts exist. Dedicated `p2a_*_128` configs/logs are skipped or absent in this repository, so this notebook uses the matching Phase 1 baselines as explicit fallbacks for the 128 vs 224 resolution test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53356e8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_json(path: Path) -> dict[str, Any] | None:\n",
    "    if not path.exists():\n",
    "        return None\n",
    "    with path.open() as f:\n",
    "        return json.load(f)\n",
    "\n",
    "\n",
    "def config_path_for(run: str) -> Path | None:\n",
    "    candidates = [\n",
    "        CONFIG_DIR / \"phase2\" / f\"{run}.json\",\n",
    "        CONFIG_DIR / \"phase2\" / f\"{run}.json.skip\",\n",
    "        CONFIG_DIR / \"phase1\" / f\"{run}.json\",\n",
    "        CONFIG_DIR / \"phase1\" / f\"{run}.json.skip\",\n",
    "    ]\n",
    "    return next((p for p in candidates if p.exists()), None)\n",
    "\n",
    "\n",
    "def log_path_for(run: str) -> Path:\n",
    "    return LOGS_DIR / f\"{run}.json\"\n",
    "\n",
    "\n",
    "def resolve_run(run: str) -> str:\n",
    "    return run if log_path_for(run).exists() else RUN_ALIASES.get(run, run)\n",
    "\n",
    "\n",
    "def load_results(run: str) -> dict[str, Any] | None:\n",
    "    resolved = resolve_run(run)\n",
    "    return load_json(log_path_for(resolved))\n",
    "\n",
    "\n",
    "def metric_values(results: dict[str, Any], metric: str = \"auc_roc\") -> np.ndarray:\n",
    "    vals = []\n",
    "    for fold in results.get(\"fold_results\", []):\n",
    "        value = fold.get(\"test_metrics\", {}).get(metric)\n",
    "        if value is not None:\n",
    "            vals.append(float(value))\n",
    "    return np.asarray(vals, dtype=float)\n",
    "\n",
    "\n",
    "def best_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
    "    hist = fold.get(\"history\", {})\n",
    "    train_key = f\"train_{metric}\"\n",
    "    val_key = f\"val_{metric}\"\n",
    "    train = hist.get(train_key, [])\n",
    "    val = hist.get(val_key, [])\n",
    "    if not train or not val:\n",
    "        return None\n",
    "    idx = int(np.nanargmax(np.asarray(val, dtype=float)))\n",
    "    return float(train[idx] - val[idx])\n",
    "\n",
    "\n",
    "def final_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
    "    hist = fold.get(\"history\", {})\n",
    "    train = hist.get(f\"train_{metric}\", [])\n",
    "    val = hist.get(f\"val_{metric}\", [])\n",
    "    if not train or not val:\n",
    "        return None\n",
    "    return float(train[-1] - val[-1])\n",
    "\n",
    "\n",
    "def summarize_run(spec: RunSpec) -> dict[str, Any]:\n",
    "    resolved = resolve_run(spec.run)\n",
    "    results = load_results(spec.run)\n",
    "    config_path = config_path_for(spec.run) or config_path_for(resolved)\n",
    "    cfg = load_json(config_path) if config_path else None\n",
    "\n",
    "    row = {\n",
    "        \"section\": spec.section,\n",
    "        \"run\": spec.run,\n",
    "        \"resolved_run\": resolved,\n",
    "        \"label\": spec.label,\n",
    "        \"model\": spec.model,\n",
    "        \"condition\": spec.condition,\n",
    "        \"role\": spec.intended_role,\n",
    "        \"fallback_for\": spec.fallback_for,\n",
    "        \"config_path\": str(config_path.relative_to(PROJECT_ROOT)) if config_path else None,\n",
    "        \"config_status\": \"present\" if config_path and config_path.suffix == \".json\" else (\"skipped\" if config_path else \"missing\"),\n",
    "        \"log_status\": \"present\" if log_path_for(spec.run).exists() else (\"fallback\" if resolved != spec.run and log_path_for(resolved).exists() else \"missing\"),\n",
    "        \"n_folds\": None,\n",
    "        \"auc_mean\": np.nan,\n",
    "        \"auc_std\": np.nan,\n",
    "        \"acc_mean\": np.nan,\n",
    "        \"f1_mean\": np.nan,\n",
    "        \"gap_best_mean\": np.nan,\n",
    "        \"gap_final_mean\": np.nan,\n",
    "        \"image_size\": None,\n",
    "        \"face_crop\": None,\n",
    "        \"augment\": None,\n",
    "        \"normalization\": None,\n",
    "        \"train_sources\": None,\n",
    "        \"eval_sources\": None,\n",
    "    }\n",
    "\n",
    "    if cfg:\n",
    "        row.update({\n",
    "            \"image_size\": cfg.get(\"image_size\"),\n",
    "            \"face_crop\": cfg.get(\"face_crop\"),\n",
    "            \"augment\": \"light\" if isinstance(cfg.get(\"augment\"), dict) else cfg.get(\"augment\"),\n",
    "            \"normalization\": cfg.get(\"normalization\"),\n",
    "            \"train_sources\": tuple(cfg.get(\"train_sources\", [])) or None,\n",
    "            \"eval_sources\": tuple(cfg.get(\"eval_sources\", [])) or None,\n",
    "        })\n",
    "\n",
    "    if results:\n",
    "        agg = results.get(\"aggregated_metrics\", {})\n",
    "        row.update({\n",
    "            \"n_folds\": results.get(\"n_folds\"),\n",
    "            \"auc_mean\": agg.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
    "            \"auc_std\": agg.get(\"auc_roc\", {}).get(\"std\", np.nan),\n",
    "            \"acc_mean\": agg.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
    "            \"f1_mean\": agg.get(\"f1\", {}).get(\"mean\", np.nan),\n",
    "        })\n",
    "        best_gaps = [best_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
    "        final_gaps = [final_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
    "        best_gaps = [x for x in best_gaps if x is not None]\n",
    "        final_gaps = [x for x in final_gaps if x is not None]\n",
    "        row[\"gap_best_mean\"] = float(np.mean(best_gaps)) if best_gaps else np.nan\n",
    "        row[\"gap_final_mean\"] = float(np.mean(final_gaps)) if final_gaps else np.nan\n",
    "\n",
    "    return row\n",
    "\n",
    "runs_df = pd.DataFrame([summarize_run(spec) for spec in RUN_SPECS])\n",
    "\n",
    "# Prefer canonical rows for analysis: keep fallbacks only where expected rows are missing.\n",
    "canonical_runs_df = runs_df[runs_df[\"role\"] == \"expected\"].copy()\n",
    "for missing_run, fallback_run in RUN_ALIASES.items():\n",
    "    mask = canonical_runs_df[\"run\"].eq(missing_run) & canonical_runs_df[\"log_status\"].eq(\"missing\")\n",
    "    if mask.any():\n",
    "        fallback = runs_df[runs_df[\"run\"].eq(fallback_run)].copy()\n",
    "        if not fallback.empty:\n",
    "            fallback.loc[:, \"run\"] = missing_run\n",
    "            fallback.loc[:, \"label\"] = fallback.iloc[0][\"label\"].replace(\" (P1 fallback)\", \"\") + \" [P1 fallback]\"\n",
    "            fallback.loc[:, \"role\"] = \"expected_via_fallback\"\n",
    "            canonical_runs_df = pd.concat([canonical_runs_df[~mask], fallback], ignore_index=True)\n",
    "\n",
    "print(\"Artifact audit:\")\n",
    "display(runs_df[[\"section\", \"run\", \"resolved_run\", \"role\", \"config_status\", \"log_status\", \"n_folds\"]].sort_values([\"section\", \"run\"]))\n",
    "\n",
    "missing_expected = runs_df[(runs_df[\"role\"] == \"expected\") & (runs_df[\"log_status\"] == \"missing\")][\"run\"].tolist()\n",
    "print(f\"\\nExpected runs with no direct log: {missing_expected or 'none'}\")\n",
    "print(\"Fallbacks used:\", {k: v for k, v in RUN_ALIASES.items() if k in missing_expected})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b21a9faf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Protocol consistency audit from loaded logs/configs.\n",
    "protocol_fields = [\n",
    "    \"cv_folds\", \"batch_size\", \"early_stopping_patience\", \"seed\", \"subsample\",\n",
    "    \"lr\", \"weight_decay\", \"T_max\", \"epochs\",\n",
    "]\n",
    "\n",
    "protocol_rows = []\n",
    "for _, row in canonical_runs_df.iterrows():\n",
    "    results = load_results(row[\"run\"])\n",
    "    cfg = (results or {}).get(\"config\", {})\n",
    "    protocol_rows.append({\"run\": row[\"run\"], **{k: cfg.get(k) for k in protocol_fields}})\n",
    "\n",
    "protocol_df = pd.DataFrame(protocol_rows)\n",
    "display(protocol_df)\n",
    "\n",
    "print(\"Field variability across loaded canonical runs:\")\n",
    "for field in protocol_fields:\n",
    "    vals = sorted({str(v) for v in protocol_df[field].dropna().unique()})\n",
    "    print(f\"  {field:28s}: {vals}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6802bcd9",
   "metadata": {},
   "source": [
    "## Results table\n",
    "\n",
    "The table below is ranked by AUC and includes two gap estimates:\n",
    "\n",
    "- `gap_best_mean`: train AUC minus validation AUC at each fold's best validation epoch. This is closest to the saved best checkpoint.\n",
    "- `gap_final_mean`: train AUC minus validation AUC at the final epoch. This is useful for diagnosing late overfit but is less aligned with test evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be1ec0ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis_df = canonical_runs_df[canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"])].copy()\n",
    "analysis_df = analysis_df.sort_values(\"auc_mean\", ascending=False)\n",
    "\n",
    "cols = [\n",
    "    \"section\", \"label\", \"run\", \"resolved_run\", \"model\", \"condition\", \"log_status\",\n",
    "    \"auc_mean\", \"auc_std\", \"acc_mean\", \"f1_mean\", \"gap_best_mean\", \"gap_final_mean\",\n",
    "]\n",
    "\n",
    "display(\n",
    "    analysis_df[cols]\n",
    "    .style.format({\n",
    "        \"auc_mean\": \"{:.4f}\",\n",
    "        \"auc_std\": \"{:.4f}\",\n",
    "        \"acc_mean\": \"{:.4f}\",\n",
    "        \"f1_mean\": \"{:.4f}\",\n",
    "        \"gap_best_mean\": \"{:+.4f}\",\n",
    "        \"gap_final_mean\": \"{:+.4f}\",\n",
    "    })\n",
    "    .background_gradient(subset=[\"auc_mean\"], cmap=\"Greens\")\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e0d21c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def paired_comparison(section: str, model: str, question: str, before: str, after: str, contrast: str) -> dict[str, Any]:\n",
    "    r0 = load_results(before)\n",
    "    r1 = load_results(after)\n",
    "    resolved_before = resolve_run(before)\n",
    "    resolved_after = resolve_run(after)\n",
    "    out = {\n",
    "        \"section\": section,\n",
    "        \"model\": model,\n",
    "        \"question\": question,\n",
    "        \"before\": before,\n",
    "        \"after\": after,\n",
    "        \"resolved_before\": resolved_before,\n",
    "        \"resolved_after\": resolved_after,\n",
    "        \"contrast\": contrast,\n",
    "        \"status\": \"ok\" if r0 and r1 else \"missing\",\n",
    "        \"n\": 0,\n",
    "        \"before_auc\": np.nan,\n",
    "        \"after_auc\": np.nan,\n",
    "        \"delta_auc\": np.nan,\n",
    "        \"delta_ci95\": np.nan,\n",
    "        \"ttest_p\": np.nan,\n",
    "        \"wilcoxon_p\": np.nan,\n",
    "        \"cohen_dz\": np.nan,\n",
    "        \"before_gap\": np.nan,\n",
    "        \"after_gap\": np.nan,\n",
    "        \"delta_gap\": np.nan,\n",
    "        \"interpretation\": \"insufficient data\",\n",
    "        \"caveat\": \"\",\n",
    "    }\n",
    "    if not (r0 and r1):\n",
    "        return out\n",
    "\n",
    "    v0 = metric_values(r0, \"auc_roc\")\n",
    "    v1 = metric_values(r1, \"auc_roc\")\n",
    "    n = min(len(v0), len(v1))\n",
    "    v0, v1 = v0[:n], v1[:n]\n",
    "    diff = v1 - v0\n",
    "\n",
    "    out.update({\n",
    "        \"n\": n,\n",
    "        \"before_auc\": float(np.mean(v0)),\n",
    "        \"after_auc\": float(np.mean(v1)),\n",
    "        \"delta_auc\": float(np.mean(diff)),\n",
    "    })\n",
    "\n",
    "    if n >= 2:\n",
    "        sd = float(np.std(diff, ddof=1))\n",
    "        se = sd / math.sqrt(n) if sd > 0 else 0.0\n",
    "        out[\"delta_ci95\"] = float(stats.t.ppf(0.975, df=n - 1) * se) if n > 1 else np.nan\n",
    "        if sd > 0:\n",
    "            out[\"cohen_dz\"] = float(np.mean(diff) / sd)\n",
    "            out[\"ttest_p\"] = float(stats.ttest_rel(v1, v0).pvalue)\n",
    "        if n >= 3 and not np.allclose(diff, 0):\n",
    "            try:\n",
    "                out[\"wilcoxon_p\"] = float(stats.wilcoxon(diff).pvalue)\n",
    "            except ValueError:\n",
    "                pass\n",
    "\n",
    "    gaps0 = [best_epoch_gap(f) for f in r0.get(\"fold_results\", [])]\n",
    "    gaps1 = [best_epoch_gap(f) for f in r1.get(\"fold_results\", [])]\n",
    "    gaps0 = np.asarray([x for x in gaps0 if x is not None], dtype=float)\n",
    "    gaps1 = np.asarray([x for x in gaps1 if x is not None], dtype=float)\n",
    "    if len(gaps0) and len(gaps1):\n",
    "        m = min(len(gaps0), len(gaps1))\n",
    "        out[\"before_gap\"] = float(np.mean(gaps0[:m]))\n",
    "        out[\"after_gap\"] = float(np.mean(gaps1[:m]))\n",
    "        out[\"delta_gap\"] = float(np.mean(gaps1[:m] - gaps0[:m]))\n",
    "\n",
    "    if question == \"source_holdout\":\n",
    "        out[\"caveat\"] = \"Aggregate holdout-run AUC only; not held-out-source vs in-source AUC.\"\n",
    "    if before != resolved_before or after != resolved_after:\n",
    "        out[\"caveat\"] = (out[\"caveat\"] + \" \" if out[\"caveat\"] else \"\") + \"Uses Phase 1 fallback for missing p2a 128 log.\"\n",
    "\n",
    "    if out[\"delta_auc\"] >= 0.01:\n",
    "        out[\"interpretation\"] = \"meaningful improvement\"\n",
    "    elif out[\"delta_auc\"] > 0.002:\n",
    "        out[\"interpretation\"] = \"small improvement\"\n",
    "    elif out[\"delta_auc\"] >= -0.002:\n",
    "        out[\"interpretation\"] = \"negligible change\"\n",
    "    elif out[\"delta_auc\"] > -0.01:\n",
    "        out[\"interpretation\"] = \"small drop\"\n",
    "    else:\n",
    "        out[\"interpretation\"] = \"meaningful drop\"\n",
    "    return out\n",
    "\n",
    "comparisons_df = pd.DataFrame([paired_comparison(*args) for args in PLANNED_COMPARISONS])\n",
    "\n",
    "# Benjamini-Hochberg correction across planned paired t-tests where available.\n",
    "valid_p = comparisons_df[\"ttest_p\"].notna()\n",
    "pvals = comparisons_df.loc[valid_p, \"ttest_p\"].to_numpy()\n",
    "qvals = np.full(len(comparisons_df), np.nan)\n",
    "if len(pvals):\n",
    "    order = np.argsort(pvals)\n",
    "    ranked = pvals[order]\n",
    "    adjusted = np.empty_like(ranked)\n",
    "    m = len(ranked)\n",
    "    running = 1.0\n",
    "    for i in range(m - 1, -1, -1):\n",
    "        running = min(running, ranked[i] * m / (i + 1))\n",
    "        adjusted[i] = running\n",
    "    qvals[np.where(valid_p)[0][order]] = adjusted\n",
    "comparisons_df[\"bh_q\"] = qvals\n",
    "\n",
    "display(\n",
    "    comparisons_df[[\n",
    "        \"section\", \"model\", \"question\", \"contrast\", \"before_auc\", \"after_auc\", \"delta_auc\",\n",
    "        \"delta_ci95\", \"ttest_p\", \"bh_q\", \"wilcoxon_p\", \"cohen_dz\", \"delta_gap\", \"interpretation\", \"caveat\",\n",
    "    ]].style.format({\n",
    "        \"before_auc\": \"{:.4f}\",\n",
    "        \"after_auc\": \"{:.4f}\",\n",
    "        \"delta_auc\": \"{:+.4f}\",\n",
    "        \"delta_ci95\": \"\u00b1{:.4f}\",\n",
    "        \"ttest_p\": \"{:.4f}\",\n",
    "        \"bh_q\": \"{:.4f}\",\n",
    "        \"wilcoxon_p\": \"{:.4f}\",\n",
    "        \"cohen_dz\": \"{:+.2f}\",\n",
    "        \"delta_gap\": \"{:+.4f}\",\n",
    "    }).background_gradient(subset=[\"delta_auc\"], cmap=\"RdYlGn\", vmin=-0.06, vmax=0.06)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f20e5262",
   "metadata": {},
   "source": [
    "## Visual summary\n",
    "\n",
    "Two plots are most useful for decision-making:\n",
    "\n",
    "- Ranking all conditions by AUC shows the best observed configurations but can overstate duplicated/near-identical runs.\n",
    "- Paired delta plot shows the controlled effect of each preprocessing change and exposes uncertainty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42882c6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_df = analysis_df.copy()\n",
    "plot_df[\"display_label\"] = plot_df[\"section\"] + \" | \" + plot_df[\"label\"]\n",
    "plot_df = plot_df.sort_values(\"auc_mean\", ascending=True)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(11, max(7, 0.35 * len(plot_df))))\n",
    "colors = {\"2A\": \"#4C78A8\", \"2B\": \"#F58518\", \"2C\": \"#54A24B\", \"2D\": \"#E45756\", \"2E\": \"#B279A2\"}\n",
    "ax.barh(\n",
    "    plot_df[\"display_label\"],\n",
    "    plot_df[\"auc_mean\"],\n",
    "    xerr=plot_df[\"auc_std\"],\n",
    "    color=[colors.get(s, \"#999999\") for s in plot_df[\"section\"]],\n",
    "    alpha=0.85,\n",
    ")\n",
    "ax.set_xlim(0.65, 1.0)\n",
    "ax.set_xlabel(\"Mean AUC across CV folds\")\n",
    "ax.set_title(\"Phase 2 Conditions Ranked by AUC\")\n",
    "ax.axvline(0.95, color=\"black\", linewidth=1, linestyle=\"--\", alpha=0.4)\n",
    "for y, (_, row) in enumerate(plot_df.iterrows()):\n",
    "    ax.text(row[\"auc_mean\"] + 0.004, y, f\"{row['auc_mean']:.4f}\", va=\"center\", fontsize=9)\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"ranked_auc.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()\n",
    "\n",
    "forest = comparisons_df.copy()\n",
    "forest[\"display\"] = forest[\"section\"] + \" \" + forest[\"model\"] + \" - \" + forest[\"contrast\"]\n",
    "forest = forest.iloc[::-1]\n",
    "fig, ax = plt.subplots(figsize=(11, max(6, 0.45 * len(forest))))\n",
    "y = np.arange(len(forest))\n",
    "ax.errorbar(\n",
    "    forest[\"delta_auc\"], y,\n",
    "    xerr=forest[\"delta_ci95\"],\n",
    "    fmt=\"o\", color=\"#1F2937\", ecolor=\"#6B7280\", capsize=4,\n",
    ")\n",
    "ax.axvline(0, color=\"black\", linewidth=1)\n",
    "ax.axvspan(-0.002, 0.002, color=\"#9CA3AF\", alpha=0.18, label=\"negligible band\")\n",
    "ax.set_yticks(y)\n",
    "ax.set_yticklabels(forest[\"display\"])\n",
    "ax.set_xlabel(\"Delta AUC (after - before), paired by fold\")\n",
    "ax.set_title(\"Planned Phase 2 Effect Estimates\")\n",
    "ax.legend(loc=\"lower right\")\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"planned_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e063cfc0",
   "metadata": {},
   "source": [
    "## 2A - Shortcut analysis\n",
    "\n",
    "Shortcut checks map to `p2a_*` configs:\n",
    "- `p2a_t1_original` vs `p2a_t2_real_norm` (normalization)\n",
    "- `p2a_t1_original` vs `p2a_t3_holdout_*` (source_holdout)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "910bd5bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "def comparison_subset(section: str, question: str | None = None) -> pd.DataFrame:\n",
    "    df = comparisons_df[comparisons_df[\"section\"].eq(section)].copy()\n",
    "    if question:\n",
    "        df = df[df[\"question\"].eq(question)]\n",
    "    return df\n",
    "\n",
    "\n",
    "def print_comparison_readout(df: pd.DataFrame) -> None:\n",
    "    for _, row in df.iterrows():\n",
    "        print(f\"{row['section']} {row['model']} - {row['contrast']}\")\n",
    "        print(f\"  AUC: {row['before_auc']:.4f} -> {row['after_auc']:.4f} ({row['delta_auc']:+.4f})\")\n",
    "        print(f\"  paired t p={row['ttest_p']:.4f}, BH q={row['bh_q']:.4f}, CI95 delta=\u00b1{row['delta_ci95']:.4f}\")\n",
    "        print(f\"  gap delta: {row['delta_gap']:+.4f}; interpretation: {row['interpretation']}\")\n",
    "        if row['caveat']:\n",
    "            print(f\"  caveat: {row['caveat']}\")\n",
    "        print()\n",
    "\n",
    "print_comparison_readout(comparison_subset(\"2B\", \"resolution\"))\n",
    "\n",
    "res_plot = comparison_subset(\"2B\", \"resolution\")\n",
    "fig, ax = plt.subplots(figsize=(8, 5))\n",
    "for _, row in res_plot.iterrows():\n",
    "    r0, r1 = load_results(row[\"before\"]), load_results(row[\"after\"])\n",
    "    v0, v1 = metric_values(r0), metric_values(r1)\n",
    "    x = [0, 1]\n",
    "    for a, b in zip(v0, v1):\n",
    "        ax.plot(x, [a, b], color=\"#9CA3AF\", alpha=0.7)\n",
    "    ax.plot(x, [v0.mean(), v1.mean()], marker=\"o\", linewidth=3, label=row[\"model\"])\n",
    "ax.set_xticks([0, 1])\n",
    "ax.set_xticklabels([\"128\", \"224\"])\n",
    "ax.set_ylabel(\"AUC\")\n",
    "ax.set_title(\"2B Resolution: Fold-Paired AUC\")\n",
    "ax.legend()\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"2b_resolution_paired.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "530e8675",
   "metadata": {},
   "source": [
    "## 2B - Resolution impact\n",
    "\n",
    "This section compares 128 vs 224 using `p2b_*_224` and Phase 1 baselines as explicit 128 fallbacks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13304d38",
   "metadata": {},
   "outputs": [],
   "source": [
    "print_comparison_readout(comparison_subset(\"2C\", \"facecrop\"))\n",
    "\n",
    "face_df = canonical_runs_df[canonical_runs_df[\"section\"].eq(\"2C\")].copy()\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=False)\n",
    "for ax, model in zip(axes, [\"SimpleCNN\", \"ResNet18\"]):\n",
    "    sub = face_df[face_df[\"model\"].eq(model)].sort_values(\"face_crop\")\n",
    "    ax.bar(sub[\"condition\"], sub[\"auc_mean\"], yerr=sub[\"auc_std\"], color=[\"#D97706\", \"#059669\"], alpha=0.85, capsize=5)\n",
    "    ax.set_title(model)\n",
    "    ax.set_ylim(0.70 if model == \"SimpleCNN\" else 0.94, 0.99)\n",
    "    ax.set_ylabel(\"AUC\")\n",
    "    ax.tick_params(axis=\"x\", rotation=20)\n",
    "fig.suptitle(\"2C Facecrop Impact\")\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"2c_facecrop.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8702d10d",
   "metadata": {},
   "source": [
    "## 2C - Facecrop impact\n",
    "\n",
    "This section compares `p2c_*_facecrop` against the matching `p2b_*_224` no-facecrop baselines.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec5e03ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "print_comparison_readout(comparison_subset(\"2A\"))\n\n# Inspect whether logs contain the per-source data needed by v2.md.\nsource_audit = []\nfor run in [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]:\n    results = load_results(run)\n    has_per_source = False\n    has_records = False\n    example_keys = []\n    if results:\n        for fold in results.get(\"fold_results\", []):\n            tm = fold.get(\"test_metrics\", {})\n            example_keys = sorted(tm.keys())\n            has_per_source = has_per_source or any(k in tm for k in [\"per_source\", \"per_source_metrics\", \"pairwise_source_metrics\", \"source_metrics\", \"pair_metrics\"])\n            has_records = has_records or any(k in fold for k in [\"records\", \"predictions\", \"test_records\"])\n    source_audit.append({\n        \"run\": run,\n        \"has_per_source_metrics\": has_per_source,\n        \"has_prediction_records\": has_records,\n        \"test_metric_keys\": example_keys,\n    })\nsource_audit_df = pd.DataFrame(source_audit)\ndisplay(source_audit_df)\n\nholdout_runs = [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]\nholdout_df = canonical_runs_df[canonical_runs_df[\"run\"].isin(holdout_runs)].copy()\nholdout_df[\"delta_vs_all_source\"] = holdout_df[\"auc_mean\"] - float(holdout_df.loc[holdout_df[\"run\"].eq(\"p2a_t1_original\"), \"auc_mean\"].iloc[0])\n\nfig, ax = plt.subplots(figsize=(9, 5))\nax.bar(holdout_df[\"label\"], holdout_df[\"auc_mean\"], yerr=holdout_df[\"auc_std\"], color=\"#54A24B\", alpha=0.85, capsize=5)\nax.set_ylim(0.88, 0.99)\nax.set_ylabel(\"Aggregate AUC\")\nax.set_title(\"2C Source Holdout Proxy: Aggregate Test AUC\")\nax.tick_params(axis=\"x\", rotation=20)\nfor i, (_, row) in enumerate(holdout_df.iterrows()):\n    ax.text(i, row[\"auc_mean\"] + 0.004, f\"{row['delta_vs_all_source']:+.3f}\", ha=\"center\", fontsize=9)\nfig.tight_layout()\nfig.savefig(FIGURES_DIR / \"2c_holdout_proxy.png\", dpi=200, bbox_inches=\"tight\")\nplt.show()\n\nprint(\"Geometry diagnostic evidence:\")\ngeometry_keys = []\nfor run in [\"p2a_t1_original\", \"p2a_t2_real_norm\"]:\n    results = load_results(run)\n    cfg = (results or {}).get(\"config\", {})\n    geometry_keys.append({\n        \"run\": run,\n        \"config_geometry_condition\": cfg.get(\"geometry_condition\"),\n        \"has_matched_geometry_metric\": any(\n            \"geometry\" in str(k).lower() or \"matched\" in str(k).lower()\n            for fold in (results or {}).get(\"fold_results\", [])\n            for k in fold.get(\"test_metrics\", {}).keys()\n        ),\n    })\ndisplay(pd.DataFrame(geometry_keys))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c3b8812",
   "metadata": {},
   "source": [
    "## 2D / 2E - Augmentation impact and test-set integrity\n",
    "\n",
    "The augmentation question has two parts:\n",
    "\n",
    "- Does light augmentation help at 224 without facecrop?\n",
    "- Does it help once facecrop is enabled?\n",
    "\n",
    "The implementation also needs to guarantee that validation/test evaluation is not stochastic. The preprocessing pipeline keeps stochastic operations behind `self.train`, so `train=False` disables them even if augmentation settings exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f11c3257",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"2D (p2d): augmentation without facecrop\")\n",
    "print_comparison_readout(comparison_subset(\"2D\", \"augmentation\"))\n",
    "print(\"2E (p2e): augmentation with facecrop\")\n",
    "print_comparison_readout(comparison_subset(\"2E\", \"facecrop + augmentation\"))\n",
    "\n",
    "aug_sections = comparisons_df[comparisons_df[\"section\"].isin([\"2D\", \"2E\"])].copy()\n",
    "fig, ax = plt.subplots(figsize=(9, 5))\n",
    "labels = aug_sections[\"section\"] + \" \" + aug_sections[\"model\"]\n",
    "ax.bar(labels, aug_sections[\"delta_auc\"], yerr=aug_sections[\"delta_ci95\"], color=[\"#E45756\" if d < 0 else \"#059669\" for d in aug_sections[\"delta_auc\"]], alpha=0.85, capsize=5)\n",
    "ax.axhline(0, color=\"black\", linewidth=1)\n",
    "ax.set_ylabel(\"Delta AUC from adding augmentation\")\n",
    "ax.set_title(\"Augmentation Effects Across Facecrop Conditions\")\n",
    "ax.tick_params(axis=\"x\", rotation=20)\n",
    "fig.tight_layout()\n",
    "fig.savefig(FIGURES_DIR / \"2d_2e_augmentation_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
    "plt.show()\n",
    "\n",
    "# Static and behavioral audit of eval stochasticity.\n",
    "try:\n",
    "    import inspect\n",
    "    from src.preprocessing.pipeline import DFFImagePipeline\n",
    "    from src.evaluation import evaluate as evaluate_module\n",
    "\n",
    "    pipeline_src = inspect.getsource(DFFImagePipeline)\n",
    "    build_transforms_src = inspect.getsource(evaluate_module.build_transforms)\n",
    "    stochastic_guards = {\n",
    "        \"flip_guarded_by_train\": \"if self.train and random.random() < self.hflip_p\" in pipeline_src,\n",
    "        \"rotate_guarded_by_train\": \"if self.train and self.rotation_degrees > 0\" in pipeline_src,\n",
    "        \"color_jitter_returns_when_not_train\": \"if not self.train:\" in pipeline_src,\n",
    "        \"blur_guarded_by_train\": \"if self.train and random.random() < self.blur_p\" in pipeline_src,\n",
    "        \"jpeg_guarded_by_train\": \"if self.train and random.random() < self.jpeg_p\" in pipeline_src,\n",
    "        \"erase_guarded_by_train\": \"if self.train and random.random() < self.erase_p\" in pipeline_src,\n",
    "        \"noise_guarded_by_train\": \"if self.train and random.random() < self.noise_p\" in pipeline_src,\n",
    "        \"cv_transform_uses_train_flag\": \"get_transforms(train=train\" in build_transforms_src,\n",
    "    }\n",
    "    display(pd.DataFrame([stochastic_guards]).T.rename(columns={0: \"passes\"}))\n",
    "except Exception as exc:\n",
    "    print(f\"Could not run transform audit: {exc}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02e47658",
   "metadata": {},
   "source": [
    "## Decision synthesis\n",
    "\n",
    "This section converts the evidence into Phase 3 settings. It intentionally distinguishes a recommendation from a claim:\n",
    "\n",
    "- Recommendation: choose the setting that is best supported for the next experiment.\n",
    "- Claim: what the current evidence proves. Some Phase 2C claims remain incomplete without per-source or matched-geometry outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7034443c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_delta(question: str, model: str | None = None, section: str | None = None) -> pd.DataFrame:\n",
    "    df = comparisons_df[comparisons_df[\"question\"].eq(question)].copy()\n",
    "    if model:\n",
    "        df = df[df[\"model\"].eq(model)]\n",
    "    if section:\n",
    "        df = df[df[\"section\"].eq(section)]\n",
    "    return df\n",
    "\n",
    "resolution_resnet = get_delta(\"resolution\", \"ResNet18\").iloc[0]\n",
    "facecrop_resnet = get_delta(\"facecrop\", \"ResNet18\").iloc[0]\n",
    "facecrop_simple = get_delta(\"facecrop\", \"SimpleCNN\").iloc[0]\n",
    "aug_no_crop_resnet = get_delta(\"augmentation\", \"ResNet18\").iloc[0]\n",
    "aug_no_crop_simple = get_delta(\"augmentation\", \"SimpleCNN\").iloc[0]\n",
    "aug_crop_resnet = get_delta(\"facecrop + augmentation\", \"ResNet18\").iloc[0]\n",
    "aug_crop_simple = get_delta(\"facecrop + augmentation\", \"SimpleCNN\").iloc[0]\n",
    "norm = get_delta(\"normalization\", \"ResNet18\").iloc[0]\n",
    "\n",
    "recommendations = [\n",
    "    {\n",
    "        \"choice\": \"resolution\",\n",
    "        \"recommendation\": \"224x224\",\n",
    "        \"evidence\": f\"ResNet18 delta AUC {resolution_resnet.delta_auc:+.4f}; SimpleCNN does not determine Phase 3 capacity.\",\n",
    "        \"confidence\": \"high\" if resolution_resnet.delta_auc > 0.02 else \"medium\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"facecrop\",\n",
    "        \"recommendation\": \"use facecrop\",\n",
    "        \"evidence\": f\"Small positive deltas for both models: SimpleCNN {facecrop_simple.delta_auc:+.4f}, ResNet18 {facecrop_resnet.delta_auc:+.4f}.\",\n",
    "        \"confidence\": \"medium\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"augmentation\",\n",
    "        \"recommendation\": \"do not use light augmentation for Phase 3 at 20% data\",\n",
    "        \"evidence\": f\"SimpleCNN drops {aug_no_crop_simple.delta_auc:+.4f} without facecrop and {aug_crop_simple.delta_auc:+.4f} with facecrop; ResNet18 is neutral/slightly mixed ({aug_no_crop_resnet.delta_auc:+.4f}, {aug_crop_resnet.delta_auc:+.4f}).\",\n",
    "        \"confidence\": \"high for SimpleCNN, medium for ResNet18\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"normalization\",\n",
    "        \"recommendation\": \"ImageNet normalization\",\n",
    "        \"evidence\": f\"Real-train-only normalization delta AUC {norm.delta_auc:+.4f}; no useful gain and less standard for pretrained ResNet.\",\n",
    "        \"confidence\": \"medium\",\n",
    "    },\n",
    "    {\n",
    "        \"choice\": \"shortcut/source claims\",\n",
    "        \"recommendation\": \"do not overclaim; add per-source or prediction exports before final report\",\n",
    "        \"evidence\": \"Current CV logs lack held-out-source vs in-source AUC and matched-geometry test metrics.\",\n",
    "        \"confidence\": \"high\",\n",
    "    },\n",
    "]\n",
    "\n",
    "recommendations_df = pd.DataFrame(recommendations)\n",
    "display(recommendations_df)\n",
    "\n",
    "summary = {\n",
    "    \"phase\": \"phase2\",\n",
    "    \"source_documents\": [\"classifier/v2.md\", \"classifier/impl.md\"],\n",
    "    \"artifact_counts\": {\n",
    "        \"canonical_runs\": int(len(canonical_runs_df)),\n",
    "        \"loaded_canonical_runs\": int(canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"]).sum()),\n",
    "        \"fallback_runs_used\": {k: v for k, v in RUN_ALIASES.items() if resolve_run(k) != k},\n",
    "    },\n",
    "    \"recommendations\": recommendations,\n",
    "    \"planned_comparisons\": comparisons_df.replace({np.nan: None}).to_dict(orient=\"records\"),\n",
    "    \"known_gaps\": [\n",
    "        \"Dedicated p2a_*_128 logs are absent/skipped; Phase 1 baselines are used as fallbacks.\",\n",
    "        \"Source holdout logs do not include prediction-level or per-source metrics, so held-out-source AUC vs in-source AUC cannot be computed.\",\n",
    "        \"No matched-geometry evaluation metric is present in p2c logs, so geometry shortcut analysis is incomplete.\",\n",
    "    ],\n",
    "}\n",
    "\n",
    "summary_path = ANALYSIS_DIR / \"phase2_analysis_summary.json\"\n",
    "with summary_path.open(\"w\") as f:\n",
    "    json.dump(summary, f, indent=2)\n",
    "\n",
    "print(f\"Saved summary: {summary_path.relative_to(PROJECT_ROOT)}\")\n",
    "print(f\"Saved figures: {FIGURES_DIR.relative_to(PROJECT_ROOT)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a337f73",
   "metadata": {},
   "source": [
    "## Report-ready conclusion\n",
    "\n",
    "The strongest Phase 2 result is the resolution effect for ResNet18: moving to 224x224 substantially improves AUC under the controlled CV protocol. Face cropping gives a small positive effect and is reasonable to carry forward, especially because it aligns the model with face evidence rather than background context. Light augmentation is not supported at this 20% data setting: it strongly hurts SimpleCNN and provides no reliable gain for ResNet18, with or without face cropping. ImageNet normalization remains preferable because real-train-only normalization does not improve AUC and is less aligned with pretrained ResNet expectations.\n",
    "\n",
    "Recommended Phase 3 preprocessing: **224x224, facecrop enabled, no light augmentation, ImageNet normalization**.\n",
    "\n",
    "Limitations to fix before the final report: export prediction-level records or per-source pairwise metrics for source holdout, and add the matched-geometry evaluation required by the shortcut-analysis plan. Without those artifacts, Phase 2C can only support a limited shortcut analysis."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "drl",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
jalf	a5dc093a15	Delete generator/notebooks/_build.py	2026-05-15 00:25:44 +01:00
Johnny Fernandes	1ed2b7a7a0	Final final	2026-05-14 23:19:26 +01:00
Johnny Fernandes	afd26f47d2	Final polish	2026-05-14 21:16:03 +01:00
DiogoCosta18	3bff7eefb0	Notebooks gerador com bom nome + README	2026-05-14 20:20:46 +01:00
DiogoCosta18	f46320f81e	Notebooks Classificador	2026-05-14 16:25:30 +01:00
DiogoCosta18	2062a91985	Notebooks Classificador	2026-05-14 16:20:33 +01:00
DiogoCosta18	9ae334410d	Notebooks Terminados	2026-05-11 17:36:08 +01:00
DiogoCosta18	522a8f8d46	Notebooks classificador terminados	2026-05-06 21:43:32 +01:00
DiogoCosta18	69666d6aa0	Notebooks todos sem resultados fase 4	2026-05-06 20:31:07 +01:00
DiogoCosta18	b5313e3320	Correcoes 5 notebooks	2026-05-06 20:31:06 +01:00
Johnny Fernandes	580808d9ad	Phase 4 classifier results	2026-05-06 19:14:19 +00:00
Johnny Fernandes	e09cbf6a1a	Updating phase 4 classifier	2026-05-06 12:42:22 +01:00
DiogoCosta18	ac3d6e1f6d	Notebooks Classifier	2026-05-06 00:53:07 +01:00
DiogoCosta18	bed9be0b17	Notebooks Classifier	2026-05-06 00:53:06 +01:00
Johnny Fernandes	920cc983c4	Phase 4 classifier	2026-05-05 12:17:18 +01:00
Johnny Fernandes	c89f934e74	Phase 3 classifier results	2026-05-05 11:42:01 +01:00
Johnny Fernandes	66913b2354	Phase 3 classifier	2026-05-05 11:41:07 +01:00