Notebooks todos sem resultados fase 4

2026-05-06 20:28:29 +01:00
parent b5313e3320
commit 69666d6aa0
16 changed files with 2312 additions and 533 deletions
@@ -702,16 +702,6 @@
    "For the report, this table supports the normalization ablation in Phase 2. The actual decision is made in `04_phase2_analysis.ipynb`, where `real_norm` is compared against ImageNet/default using the saved logs.\n"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "id": "fb95d062",
-   "metadata": {},
-   "source": [
-    "## 5. Reproducibility note\n",
-    "\n",
-    "These checks are not preprocessing operations. They simply confirm that the experiment setup is safe: identity groups do not leak across splits, validation/test transforms are deterministic, source-specific metrics handle edge cases, and config inheritance works as expected. The full test suite lives in `classifier/tests/`.\n"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "b02fd790",
@@ -818,22 +818,12 @@
    "The confusion matrices show the AUC story in error-count form. SimpleCNN correctly classifies about `71%` of real images and `70%` of fake images, so it misses many examples in both directions. ResNet18 improves both sides: about `81%` of real images are kept real, and about `88%` of fake images are detected as fake. The most important practical gain is fewer fake images predicted as real (`30%` -> `12%`), although the model still produces some false alarms on real images (`19%`).\n"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "id": "0591bcea",
-   "metadata": {},
-   "source": [
-    "## 6. Reproducibility note\n",
-    "\n",
-    "These checks are not extra results. They simply support the credibility of the Phase 1 comparison: the same config-loading rules, grouped splits, deterministic evaluation transforms, and safe metric handling are covered by tests in `classifier/tests/`. This lets the baseline comparison focus on the intended difference: SimpleCNN versus pretrained ResNet18.\n"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "624839d8",
   "metadata": {},
   "source": [
-    "## Report-ready conclusion\n",
+    "## Conclusion\n",
    "\n",
    "Under identical Phase 1 conditions, ResNet18 is the stronger baseline. SimpleCNN reaches AUC `0.7786`, accuracy `0.7039`, and F1 `0.7801`. ResNet18 reaches AUC `0.9366`, accuracy `0.8650`, and F1 `0.9073`. The mean fold-wise AUC improvement is about `+0.1580`.\n",
    "\n",
@@ -7,17 +7,19 @@
   "source": [
    "# 05 - Grad-CAM Interpretability Analysis\n",
    "\n",
-    "This final classifier notebook adds qualitative evidence. It does not train, tune, or reevaluate models. It loads existing configs, logs, and checkpoints, selects deterministic fold-0 examples, and renders fake-logit Grad-CAM overlays. Metrics reported in the report remain the canonical log values; checkpoint-derived candidate scores in this notebook are only used to choose visual examples.\n",
+    "This interpretability notebook adds qualitative evidence after the Phase 2 ablations. It does not train, tune, or reevaluate models. It loads existing configs, logs, and checkpoints, selects deterministic fold-0 examples, and renders fake-logit Grad-CAM overlays. Metrics reported in the report remain the canonical log values; checkpoint-derived candidate scores in this notebook are only used to choose visual examples.\n",
    "\n",
    "Grad-CAM answers a limited question: which spatial regions most support the model's fake-class logit for a selected image? It is useful for sanity checking localization, but it is not proof of causality and it should not override held-out metrics.\n",
    "\n",
    "A note on resolution: the visible Grad-CAM grid comes from the target convolutional feature map, not from the original image. ResNet18's final convolution is very coarse at 224x224 input, so its last-layer CAM is upsampled from a small grid and appears blockier than some SimpleCNN maps. That block size is architectural granularity, not model confidence. The notebook keeps the canonical last-conv CAM and also adds a finer ResNet18 diagnostic view from an earlier layer for readability.\n",
    "\n",
    "Story questions:\n",
-    "- Does the selected final run focus on facial evidence rather than background shortcuts?\n",
+    "- Does the selected Phase 2 run focus on facial evidence rather than background shortcuts?\n",
    "- Does facecrop change what the model can attend to?\n",
    "- Do augmentation and source-holdout runs reveal instability in attention?\n",
-    "- Are errors visually plausible, or do they suggest shortcut behavior?\n"
+    "- Are errors visually plausible, or do they suggest shortcut behavior?\n",
+    "\n",
+    "Roadmap link: after this qualitative check, `06_phase3_model_family_analysis.ipynb` compares stronger pretrained backbones and `07_phase4_data_scaling_analysis.ipynb` records the planned data-scaling analysis.\n"
   ]
  },
  {
@@ -1693,11 +1695,13 @@
   "id": "7a682e64",
   "metadata": {},
   "source": [
-    "## Report-ready conclusion\n",
+    "## Conclusion\n",
    "\n",
-    "Grad-CAM provides a qualitative final check on the classifier story. The selected metric setting remains `p2c_resnet18_facecrop`: 224x224 input, facecrop enabled, no augmentation, and ImageNet/default normalization. The overlays are most reassuring when they concentrate on facial regions, and most cautionary when errors or source-holdout examples show diffuse, background, or artifact-specific attention.\n",
+    "Grad-CAM provides a qualitative check on the Phase 2 classifier story. The selected metric setting remains `p2c_resnet18_facecrop`: 224x224 input, facecrop enabled, no augmentation, and ImageNet/default normalization. The overlays are most reassuring when they concentrate on facial regions, and most cautionary when errors or source-holdout examples show diffuse, background, or artifact-specific attention.\n",
    "\n",
-    "The key limitation from Phase 2 still stands: high in-distribution AUC does not guarantee source-agnostic generalization. The Grad-CAM panels help make that limitation visible, but the source-holdout pairwise AUC values are the primary quantitative evidence.\n"
+    "The key limitation from Phase 2 still stands: high in-distribution AUC does not guarantee source-agnostic generalization. The Grad-CAM panels help make that limitation visible, but the source-holdout pairwise AUC values are the primary quantitative evidence.\n",
+    "\n",
+    "Next: `06_phase3_model_family_analysis.ipynb` asks whether stronger pretrained model families improve on the selected Phase 2 pipeline.\n"
   ]
  }
 ],
@@ -0,0 +1,609 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 07 - Phase 4 Data-Scaling Analysis\n",
+    "\n",
+    "Phase 4 is the natural next question after Phase 3. Once the best model families have been identified at the 20% data setting, the experiment asks whether more training data improves performance and source generalization.\n",
+    "\n",
+    "In the repository state used to create this notebook, Phase 4 configs exist but no `p4_*` logs or checkpoints are present under `classifier/outputs`. For that reason, this notebook is result-gated: it documents the planned experiment matrix now, and the analysis cells automatically switch on when the corresponding logs are added later.\n",
+    "\n",
+    "No Phase 4 metric is claimed unless it is loaded from an existing `classifier/outputs/logs/p4_*.json` file.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Project root: c:\\Users\\diogo\\Documents\\MIA_UP\\2_Semestre\\DRL\\DRL_2\\DRL_PROJ\n"
+     ]
+    }
+   ],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import json\n",
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "\n",
+    "def find_project_root(start: Path | None = None) -> Path:\n",
+    "    \"\"\"Find DRL_PROJ whether the notebook runs from repo root or classifier/notebooks.\"\"\"\n",
+    "    start = Path.cwd() if start is None else Path(start)\n",
+    "    for candidate in [start, *start.parents]:\n",
+    "        if (candidate / \"classifier\").is_dir() and (candidate / \"docs\" / \"DRL_Project.md\").exists():\n",
+    "            return candidate\n",
+    "    raise RuntimeError(\"Could not find DRL_PROJ root. Run this notebook from inside the repository.\")\n",
+    "\n",
+    "\n",
+    "PROJECT_ROOT = find_project_root()\n",
+    "CLASSIFIER_ROOT = PROJECT_ROOT / \"classifier\"\n",
+    "if str(CLASSIFIER_ROOT) not in sys.path:\n",
+    "    sys.path.insert(0, str(CLASSIFIER_ROOT))\n",
+    "\n",
+    "CONFIGS_DIR = CLASSIFIER_ROOT / \"configs\"\n",
+    "LOGS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"logs\"\n",
+    "MODELS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"models\"\n",
+    "FIGURES_DIR = CLASSIFIER_ROOT / \"outputs\" / \"figures\"\n",
+    "ANALYSIS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"analysis\"\n",
+    "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(f\"Project root: {PROJECT_ROOT}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Planned Phase 4 matrix\n",
+    "\n",
+    "The Phase 4 configs keep the Phase 3 preprocessing setup: pretrained backbones, 224x224 input, facecropped classifier data, no augmentation, and the same cross-validation protocol. The controlled variable is data fraction: 20%, 50%, and 100%.\n",
+    "\n",
+    "The model families selected for scaling are ResNet50, EfficientNet-B0, and ConvNeXt-Tiny. They are the strongest Phase 3 families and represent different capacity/efficiency tradeoffs.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>run</th>\n",
+       "      <th>backbone</th>\n",
+       "      <th>subsample</th>\n",
+       "      <th>image_size</th>\n",
+       "      <th>data_dir</th>\n",
+       "      <th>augment</th>\n",
+       "      <th>pretrained</th>\n",
+       "      <th>epochs</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>p4_convnext_tiny_20pct</td>\n",
+       "      <td>convnext_tiny</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>p4_convnext_tiny_50pct</td>\n",
+       "      <td>convnext_tiny</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>p4_convnext_tiny_100pct</td>\n",
+       "      <td>convnext_tiny</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>p4_efficientnet_b0_20pct</td>\n",
+       "      <td>efficientnet_b0</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>p4_efficientnet_b0_50pct</td>\n",
+       "      <td>efficientnet_b0</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>p4_efficientnet_b0_100pct</td>\n",
+       "      <td>efficientnet_b0</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>p4_resnet50_20pct</td>\n",
+       "      <td>resnet50</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>p4_resnet50_50pct</td>\n",
+       "      <td>resnet50</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>p4_resnet50_100pct</td>\n",
+       "      <td>resnet50</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>224</td>\n",
+       "      <td>cropped/classifier</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>15</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                         run         backbone  subsample  image_size  \\\n",
+       "1     p4_convnext_tiny_20pct    convnext_tiny        0.2         224   \n",
+       "2     p4_convnext_tiny_50pct    convnext_tiny        0.5         224   \n",
+       "0    p4_convnext_tiny_100pct    convnext_tiny        1.0         224   \n",
+       "4   p4_efficientnet_b0_20pct  efficientnet_b0        0.2         224   \n",
+       "5   p4_efficientnet_b0_50pct  efficientnet_b0        0.5         224   \n",
+       "3  p4_efficientnet_b0_100pct  efficientnet_b0        1.0         224   \n",
+       "7          p4_resnet50_20pct         resnet50        0.2         224   \n",
+       "8          p4_resnet50_50pct         resnet50        0.5         224   \n",
+       "6         p4_resnet50_100pct         resnet50        1.0         224   \n",
+       "\n",
+       "             data_dir  augment  pretrained  epochs  \n",
+       "1  cropped/classifier    False        True      15  \n",
+       "2  cropped/classifier    False        True      15  \n",
+       "0  cropped/classifier    False        True      15  \n",
+       "4  cropped/classifier    False        True      15  \n",
+       "5  cropped/classifier    False        True      15  \n",
+       "3  cropped/classifier    False        True      15  \n",
+       "7  cropped/classifier    False        True      15  \n",
+       "8  cropped/classifier    False        True      15  \n",
+       "6  cropped/classifier    False        True      15  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th>subsample</th>\n",
+       "      <th>0.2</th>\n",
+       "      <th>0.5</th>\n",
+       "      <th>1.0</th>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>backbone</th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>convnext_tiny</th>\n",
+       "      <td>p4_convnext_tiny_20pct</td>\n",
+       "      <td>p4_convnext_tiny_50pct</td>\n",
+       "      <td>p4_convnext_tiny_100pct</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>efficientnet_b0</th>\n",
+       "      <td>p4_efficientnet_b0_20pct</td>\n",
+       "      <td>p4_efficientnet_b0_50pct</td>\n",
+       "      <td>p4_efficientnet_b0_100pct</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>resnet50</th>\n",
+       "      <td>p4_resnet50_20pct</td>\n",
+       "      <td>p4_resnet50_50pct</td>\n",
+       "      <td>p4_resnet50_100pct</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "subsample                             0.2                       0.5  \\\n",
+       "backbone                                                              \n",
+       "convnext_tiny      p4_convnext_tiny_20pct    p4_convnext_tiny_50pct   \n",
+       "efficientnet_b0  p4_efficientnet_b0_20pct  p4_efficientnet_b0_50pct   \n",
+       "resnet50                p4_resnet50_20pct         p4_resnet50_50pct   \n",
+       "\n",
+       "subsample                              1.0  \n",
+       "backbone                                    \n",
+       "convnext_tiny      p4_convnext_tiny_100pct  \n",
+       "efficientnet_b0  p4_efficientnet_b0_100pct  \n",
+       "resnet50                p4_resnet50_100pct  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "def load_json(path: Path) -> dict:\n",
+    "    return json.loads(path.read_text(encoding=\"utf-8\"))\n",
+    "\n",
+    "\n",
+    "def resolve_config(path: Path) -> dict:\n",
+    "    cfg = load_json(path)\n",
+    "    parent = cfg.pop(\"extends\", None)\n",
+    "    if parent:\n",
+    "        base = resolve_config(path.parent / parent)\n",
+    "        base.update(cfg)\n",
+    "        cfg = base\n",
+    "    return cfg\n",
+    "\n",
+    "\n",
+    "phase4_configs = []\n",
+    "for path in sorted((CONFIGS_DIR / \"phase4\").glob(\"p4_*.json\")):\n",
+    "    cfg = resolve_config(path)\n",
+    "    phase4_configs.append({\n",
+    "        \"run\": cfg.get(\"run_name\", path.stem),\n",
+    "        \"backbone\": cfg.get(\"backbone\"),\n",
+    "        \"subsample\": cfg.get(\"subsample\"),\n",
+    "        \"image_size\": cfg.get(\"image_size\"),\n",
+    "        \"data_dir\": cfg.get(\"data_dir\"),\n",
+    "        \"augment\": cfg.get(\"augment\", False),\n",
+    "        \"pretrained\": cfg.get(\"pretrained\", True),\n",
+    "        \"epochs\": cfg.get(\"epochs\"),\n",
+    "    })\n",
+    "\n",
+    "config_df = pd.DataFrame(phase4_configs).sort_values([\"backbone\", \"subsample\"])\n",
+    "display(config_df)\n",
+    "\n",
+    "matrix = config_df.pivot(index=\"backbone\", columns=\"subsample\", values=\"run\")\n",
+    "display(matrix)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Result gate\n",
+    "\n",
+    "The cell below checks for `p4_*` logs. If none are present, Phase 4 remains a planned experiment and the notebook stops at the design/status interpretation. If logs are added later, the same notebook will load them and produce scaling curves, source diagnostics, and report-ready conclusions from those logs.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "No Phase 4 result logs found under classifier/outputs/logs.\n",
+      "Missing planned runs:\n",
+      "- p4_convnext_tiny_100pct\n",
+      "- p4_convnext_tiny_20pct\n",
+      "- p4_convnext_tiny_50pct\n",
+      "- p4_efficientnet_b0_100pct\n",
+      "- p4_efficientnet_b0_20pct\n",
+      "- p4_efficientnet_b0_50pct\n",
+      "- p4_resnet50_100pct\n",
+      "- p4_resnet50_20pct\n",
+      "- p4_resnet50_50pct\n"
+     ]
+    }
+   ],
+   "source": [
+    "def log_path(run_name: str) -> Path:\n",
+    "    return LOGS_DIR / f\"{run_name}.json\"\n",
+    "\n",
+    "\n",
+    "def load_run_if_present(run_name: str) -> dict | None:\n",
+    "    path = log_path(run_name)\n",
+    "    return load_json(path) if path.exists() else None\n",
+    "\n",
+    "\n",
+    "def agg_metric(results: dict, metric: str, field: str = \"mean\"):\n",
+    "    return results.get(\"aggregated_metrics\", {}).get(metric, {}).get(field, np.nan)\n",
+    "\n",
+    "\n",
+    "def checkpoint_mb(run_name: str) -> float:\n",
+    "    path = MODELS_DIR / f\"{run_name}_fold0_best.pt\"\n",
+    "    return path.stat().st_size / (1024 * 1024) if path.exists() else np.nan\n",
+    "\n",
+    "\n",
+    "result_rows = []\n",
+    "missing_runs = []\n",
+    "for row in phase4_configs:\n",
+    "    run_name = row[\"run\"]\n",
+    "    results = load_run_if_present(run_name)\n",
+    "    if results is None:\n",
+    "        missing_runs.append(run_name)\n",
+    "        continue\n",
+    "    cfg = {**row, **results.get(\"config\", {})}\n",
+    "    result_rows.append({\n",
+    "        \"run\": run_name,\n",
+    "        \"backbone\": cfg.get(\"backbone\"),\n",
+    "        \"subsample\": cfg.get(\"subsample\", row.get(\"subsample\")),\n",
+    "        \"auc\": agg_metric(results, \"auc_roc\"),\n",
+    "        \"auc_std\": agg_metric(results, \"auc_roc\", \"std\"),\n",
+    "        \"accuracy\": agg_metric(results, \"accuracy\"),\n",
+    "        \"f1\": agg_metric(results, \"f1\"),\n",
+    "        \"checkpoint_mb\": checkpoint_mb(run_name),\n",
+    "    })\n",
+    "\n",
+    "phase4_results_df = pd.DataFrame(result_rows)\n",
+    "if phase4_results_df.empty:\n",
+    "    print(\"No Phase 4 result logs found under classifier/outputs/logs.\")\n",
+    "    print(\"Missing planned runs:\")\n",
+    "    for run_name in missing_runs:\n",
+    "        print(f\"- {run_name}\")\n",
+    "else:\n",
+    "    phase4_results_df = phase4_results_df.sort_values([\"backbone\", \"subsample\"])\n",
+    "    display(\n",
+    "        phase4_results_df.style.format({\n",
+    "            \"subsample\": \"{:.1f}\",\n",
+    "            \"auc\": \"{:.4f}\",\n",
+    "            \"auc_std\": \"{:.4f}\",\n",
+    "            \"accuracy\": \"{:.4f}\",\n",
+    "            \"f1\": \"{:.4f}\",\n",
+    "            \"checkpoint_mb\": \"{:.1f}\",\n",
+    "        })\n",
+    "    )\n",
+    "    if missing_runs:\n",
+    "        print(\"Some planned Phase 4 runs are still missing:\")\n",
+    "        for run_name in missing_runs:\n",
+    "            print(f\"- {run_name}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the status cell reports no logs, the correct interpretation is: Phase 4 has been designed but not yet analyzed. The report can describe the intended purpose, but it should not include Phase 4 performance claims.\n",
+    "\n",
+    "When logs exist, the next sections answer three questions: does more data improve each backbone, which backbone benefits most, and does scaling reduce source-specific weakness?\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Scaling curves, once results exist\n",
+    "\n",
+    "These cells are guarded. They produce figures only when at least one `p4_*` log is available.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Skipping scaling plots because no p4 logs are available yet.\n"
+     ]
+    }
+   ],
+   "source": [
+    "if phase4_results_df.empty:\n",
+    "    print(\"Skipping scaling plots because no p4 logs are available yet.\")\n",
+    "else:\n",
+    "    fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=True)\n",
+    "    for ax, metric, title in zip(axes, [\"auc\", \"accuracy\", \"f1\"], [\"AUC\", \"Accuracy\", \"F1\"]):\n",
+    "        for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n",
+    "            sub = sub.sort_values(\"subsample\")\n",
+    "            ax.plot(sub[\"subsample\"] * 100, sub[metric], marker=\"o\", label=backbone)\n",
+    "        ax.set_xlabel(\"Training data used (%)\")\n",
+    "        ax.set_ylabel(title)\n",
+    "        ax.set_title(f\"Phase 4 {title} scaling\")\n",
+    "        ax.grid(alpha=0.25)\n",
+    "    axes[0].legend(fontsize=8)\n",
+    "    fig.tight_layout()\n",
+    "    fig.savefig(FIGURES_DIR / \"07_phase4_scaling_curves.png\", dpi=180, bbox_inches=\"tight\")\n",
+    "    plt.show()\n",
+    "\n",
+    "    gains = []\n",
+    "    for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n",
+    "        sub = sub.set_index(\"subsample\")\n",
+    "        if 0.2 in sub.index and 1.0 in sub.index:\n",
+    "            gains.append({\n",
+    "                \"backbone\": backbone,\n",
+    "                \"auc_20pct\": sub.loc[0.2, \"auc\"],\n",
+    "                \"auc_100pct\": sub.loc[1.0, \"auc\"],\n",
+    "                \"auc_gain\": sub.loc[1.0, \"auc\"] - sub.loc[0.2, \"auc\"],\n",
+    "                \"accuracy_gain\": sub.loc[1.0, \"accuracy\"] - sub.loc[0.2, \"accuracy\"],\n",
+    "                \"f1_gain\": sub.loc[1.0, \"f1\"] - sub.loc[0.2, \"f1\"],\n",
+    "            })\n",
+    "    gains_df = pd.DataFrame(gains)\n",
+    "    display(gains_df.style.format({\"auc_20pct\": \"{:.4f}\", \"auc_100pct\": \"{:.4f}\", \"auc_gain\": \"{:+.4f}\", \"accuracy_gain\": \"{:+.4f}\", \"f1_gain\": \"{:+.4f}\"}))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Source diagnostics, once results exist\n",
+    "\n",
+    "The most important Phase 4 question is not only whether AUC rises. It is whether extra data improves the weak source pairs found earlier, especially source generalization around `text2img` and `insight`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Skipping source diagnostics because no p4 logs are available yet.\n"
+     ]
+    }
+   ],
+   "source": [
+    "def pairwise_rows(run_name: str, results: dict) -> list[dict]:\n",
+    "    rows = []\n",
+    "    for pair, metrics in results.get(\"aggregated_pairwise\", {}).items():\n",
+    "        rows.append({\n",
+    "            \"run\": run_name,\n",
+    "            \"pair\": pair,\n",
+    "            \"pairwise_auc\": metrics.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
+    "            \"pairwise_f1\": metrics.get(\"f1\", {}).get(\"mean\", np.nan),\n",
+    "            \"pairwise_accuracy\": metrics.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
+    "        })\n",
+    "    return rows\n",
+    "\n",
+    "\n",
+    "if phase4_results_df.empty:\n",
+    "    print(\"Skipping source diagnostics because no p4 logs are available yet.\")\n",
+    "else:\n",
+    "    pair_rows = []\n",
+    "    for _, row in phase4_results_df.iterrows():\n",
+    "        results = load_run_if_present(row[\"run\"])\n",
+    "        for pair_row in pairwise_rows(row[\"run\"], results):\n",
+    "            pair_rows.append({**pair_row, \"backbone\": row[\"backbone\"], \"subsample\": row[\"subsample\"]})\n",
+    "    pair_df = pd.DataFrame(pair_rows)\n",
+    "    display(pair_df.sort_values([\"pair\", \"backbone\", \"subsample\"]).style.format({\"subsample\": \"{:.1f}\", \"pairwise_auc\": \"{:.4f}\", \"pairwise_f1\": \"{:.4f}\", \"pairwise_accuracy\": \"{:.4f}\"}))\n",
+    "\n",
+    "    for pair in sorted(pair_df[\"pair\"].unique()):\n",
+    "        fig, ax = plt.subplots(figsize=(6, 3.6))\n",
+    "        sub_pair = pair_df[pair_df[\"pair\"] == pair]\n",
+    "        for backbone, sub in sub_pair.groupby(\"backbone\"):\n",
+    "            sub = sub.sort_values(\"subsample\")\n",
+    "            ax.plot(sub[\"subsample\"] * 100, sub[\"pairwise_auc\"], marker=\"o\", label=backbone)\n",
+    "        ax.set_title(f\"Phase 4 source-pair scaling: {pair}\")\n",
+    "        ax.set_xlabel(\"Training data used (%)\")\n",
+    "        ax.set_ylabel(\"Pairwise AUC\")\n",
+    "        ax.grid(alpha=0.25)\n",
+    "        ax.legend(fontsize=8)\n",
+    "        fig.tight_layout()\n",
+    "        fig.savefig(FIGURES_DIR / f\"07_phase4_{pair}_scaling.png\", dpi=180, bbox_inches=\"tight\")\n",
+    "        plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "At the time this notebook was added, Phase 4 is a planned data-scaling analysis rather than a completed result chapter. The configs define a clean experiment: take the strongest Phase 3 families and train them at 20%, 50%, and 100% of the available facecropped data.\n",
+    "\n",
+    "The report should not make positive or negative Phase 4 performance claims until `p4_*` logs are present. Once those logs exist, the key result to look for is not just a higher global AUC. The stronger claim would be that more data also improves pairwise source behavior, especially for the sources that exposed generalization limits earlier.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -6,33 +6,32 @@
    {
      "choice": "input size",
      "decision": "224x224",
-      "evidence": "ResNet18 improves from 0.9366 to 0.9660 AUC.",
+      "evidence": "ResNet18 improves from AUC 0.9366 to 0.9660.",
      "confidence": "high"
    },
    {
      "choice": "face crop",
      "decision": "enable",
-      "evidence": "Best run is p2c_resnet18_facecrop with AUC 0.9755.",
+      "evidence": "Best Phase 2 run is p2c_resnet18_facecrop with AUC 0.9755.",
      "confidence": "medium-high"
    },
    {
      "choice": "augmentation",
      "decision": "disable for current 20% setting",
-      "evidence": "p2e_resnet18_facecrop_aug is 0.9737, below facecrop-only 0.9755; SimpleCNN drops sharply.",
+      "evidence": "p2e_resnet18_facecrop_aug reaches AUC 0.9737, below facecrop-only 0.9755; SimpleCNN drops sharply.",
      "confidence": "low"
    },
    {
      "choice": "normalization",
      "decision": "ImageNet/default",
-      "evidence": "real_norm is only +0.0018 and is less aligned with pretrained weights.",
+      "evidence": "real_norm is only +0.0018 AUC and is less aligned with pretrained ImageNet weights.",
      "confidence": "medium"
    },
    {
      "choice": "source generalization",
      "decision": "report as limitation and diagnostic target",
-      "evidence": "Holdout text2img and insight pairwise AUC drop to 0.7595 and 0.8421.",
+      "evidence": "Holding out text2img and insight drops pairwise AUC to 0.7595 and 0.8421.",
      "confidence": "high"
    }
-  ],
-  "note": "Generated by 04_phase2_analysis.ipynb when this cell is executed."
+  ]
 }