904 lines
44 KiB
Plaintext
904 lines
44 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "54aa00ab",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Phase 2 analysis\n",
|
|
"\n",
|
|
"This notebook follows the Phase 2 config organization (`p2a` to `p2e`) and maps each section directly to its config group.\n",
|
|
"It separates three concerns:\n",
|
|
"\n",
|
|
"1. **Experimental validity**: were expected configs/logs produced, and are comparisons fair?\n",
|
|
"2. **Evidence**: what do the 5-fold CV metrics support?\n",
|
|
"3. **Decision**: which preprocessing choices should move into Phase 3?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "734db3ee",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Questions\n",
|
|
"\n",
|
|
"| Section | Config group | Question | Required evidence |\n",
|
|
"|---|---|---|---|\n",
|
|
"| 2A | `p2a_*` | Shortcut analysis: normalization + source holdout | `p2a_t1_original`, `p2a_t2_real_norm`, `p2a_t3_holdout_*` |\n",
|
|
"| 2B | `p2b_*` | Does 224 improve over 128? | `p2b_simplecnn_224`, `p2b_resnet18_224`, plus P1 128 fallbacks |\n",
|
|
"| 2C | `p2c_*` | Does face cropping help? | `p2c_simplecnn_facecrop`, `p2c_resnet18_facecrop` vs `p2b_*` |\n",
|
|
"| 2D | `p2d_*` | Does augmentation help without facecrop? | `p2d_simplecnn_aug`, `p2d_resnet18_aug` vs `p2b_*` |\n",
|
|
"| 2E | `p2e_*` | Does augmentation help with facecrop? | `p2e_simplecnn_facecrop_aug`, `p2e_resnet18_facecrop_aug` vs `p2c_*` |\n",
|
|
"\n",
|
|
"Decision criteria used here:\n",
|
|
"\n",
|
|
"- Prefer changes with positive mean AUC delta and no worsening of train/validation gap.\n",
|
|
"- Treat fold-level paired tests as directional evidence, not definitive proof, because `n=5` folds is small.\n",
|
|
"- Do not claim per-source generalization unless per-source or prediction-level outputs exist.\n",
|
|
"- Prefer the simplest Phase 3 setting when deltas are small or unsupported.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1f4c04b3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from __future__ import annotations\n",
|
|
"\n",
|
|
"import json\n",
|
|
"import math\n",
|
|
"import os\n",
|
|
"import sys\n",
|
|
"from dataclasses import dataclass\n",
|
|
"from pathlib import Path\n",
|
|
"from typing import Any\n",
|
|
"\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns\n",
|
|
"from scipy import stats\n",
|
|
"\n",
|
|
"try:\n",
|
|
" from IPython.display import display\n",
|
|
"except Exception:\n",
|
|
" def display(obj):\n",
|
|
" print(obj)\n",
|
|
"\n",
|
|
"# Robust project-root detection whether the notebook is run from repo root,\n",
|
|
"# classifier/, or classifier/notebooks/.\n",
|
|
"def find_project_root(start: Path | None = None) -> Path:\n",
|
|
" start = (start or Path.cwd()).resolve()\n",
|
|
" for candidate in [start, *start.parents]:\n",
|
|
" if (candidate / \"classifier\" / \"v2.md\").exists() and (candidate / \"classifier\" / \"impl.md\").exists():\n",
|
|
" return candidate\n",
|
|
" raise RuntimeError(f\"Could not find project root from {start}\")\n",
|
|
"\n",
|
|
"PROJECT_ROOT = find_project_root()\n",
|
|
"CLASSIFIER_DIR = PROJECT_ROOT / \"classifier\"\n",
|
|
"LOGS_DIR = CLASSIFIER_DIR / \"outputs\" / \"logs\"\n",
|
|
"FIGURES_DIR = CLASSIFIER_DIR / \"outputs\" / \"figures\" / \"phase2\"\n",
|
|
"ANALYSIS_DIR = CLASSIFIER_DIR / \"outputs\" / \"analysis\"\n",
|
|
"CONFIG_DIR = CLASSIFIER_DIR / \"configs\"\n",
|
|
"\n",
|
|
"FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
"ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
"\n",
|
|
"if str(CLASSIFIER_DIR) not in sys.path:\n",
|
|
" sys.path.insert(0, str(CLASSIFIER_DIR))\n",
|
|
"\n",
|
|
"sns.set_theme(style=\"whitegrid\", context=\"notebook\")\n",
|
|
"plt.rcParams.update({\n",
|
|
" \"figure.figsize\": (12, 7),\n",
|
|
" \"axes.spines.top\": False,\n",
|
|
" \"axes.spines.right\": False,\n",
|
|
"})\n",
|
|
"\n",
|
|
"print(f\"Project root: {PROJECT_ROOT}\")\n",
|
|
"print(f\"Logs: {LOGS_DIR}\")\n",
|
|
"print(f\"Figures: {FIGURES_DIR}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "24830212",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"@dataclass(frozen=True)\n",
|
|
"class RunSpec:\n",
|
|
" run: str\n",
|
|
" label: str\n",
|
|
" section: str\n",
|
|
" model: str\n",
|
|
" condition: str\n",
|
|
" intended_role: str\n",
|
|
" fallback_for: str | None = None\n",
|
|
"\n",
|
|
"RUN_SPECS = [\n",
|
|
" # 2A: shortcut analysis (normalization + source holdout), ResNet18 only.\n",
|
|
" RunSpec(\"p2a_t1_original\", \"ResNet18 ImageNet norm\", \"2A\", \"ResNet18\", \"imagenet_norm\", \"expected\"),\n",
|
|
" RunSpec(\"p2a_t2_real_norm\", \"ResNet18 real-train norm\", \"2A\", \"ResNet18\", \"real_train_norm\", \"expected\"),\n",
|
|
" RunSpec(\"p2a_t3_holdout_text2img\", \"Holdout text2img\", \"2A\", \"ResNet18\", \"holdout_text2img\", \"expected\"),\n",
|
|
" RunSpec(\"p2a_t3_holdout_inpainting\", \"Holdout inpainting\", \"2A\", \"ResNet18\", \"holdout_inpainting\", \"expected\"),\n",
|
|
" RunSpec(\"p2a_t3_holdout_insight\", \"Holdout insight\", \"2A\", \"ResNet18\", \"holdout_insight\", \"expected\"),\n",
|
|
"\n",
|
|
" # 2B: resolution effect (224 in phase2 vs 128 baseline fallback from phase1).\n",
|
|
" RunSpec(\"p1_simplecnn_baseline\", \"SimpleCNN 128 (P1 fallback)\", \"2B\", \"SimpleCNN\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_simplecnn_128\"),\n",
|
|
" RunSpec(\"p1_resnet18_baseline\", \"ResNet18 128 (P1 fallback)\", \"2B\", \"ResNet18\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_resnet18_128\"),\n",
|
|
" RunSpec(\"p2b_simplecnn_224\", \"SimpleCNN 224\", \"2B\", \"SimpleCNN\", \"224_no_crop_no_aug\", \"expected\"),\n",
|
|
" RunSpec(\"p2b_resnet18_224\", \"ResNet18 224\", \"2B\", \"ResNet18\", \"224_no_crop_no_aug\", \"expected\"),\n",
|
|
"\n",
|
|
" # 2C: facecrop effect at 224, no augmentation.\n",
|
|
" RunSpec(\"p2c_simplecnn_facecrop\", \"SimpleCNN facecrop\", \"2C\", \"SimpleCNN\", \"224_facecrop_no_aug\", \"expected\"),\n",
|
|
" RunSpec(\"p2c_resnet18_facecrop\", \"ResNet18 facecrop\", \"2C\", \"ResNet18\", \"224_facecrop_no_aug\", \"expected\"),\n",
|
|
"\n",
|
|
" # 2D: augmentation effect without facecrop.\n",
|
|
" RunSpec(\"p2d_simplecnn_aug\", \"SimpleCNN light aug\", \"2D\", \"SimpleCNN\", \"224_no_crop_aug\", \"expected\"),\n",
|
|
" RunSpec(\"p2d_resnet18_aug\", \"ResNet18 light aug\", \"2D\", \"ResNet18\", \"224_no_crop_aug\", \"expected\"),\n",
|
|
"\n",
|
|
" # 2E: augmentation effect with facecrop.\n",
|
|
" RunSpec(\"p2e_simplecnn_facecrop_aug\", \"SimpleCNN facecrop + aug\", \"2E\", \"SimpleCNN\", \"224_facecrop_aug\", \"expected\"),\n",
|
|
" RunSpec(\"p2e_resnet18_facecrop_aug\", \"ResNet18 facecrop + aug\", \"2E\", \"ResNet18\", \"224_facecrop_aug\", \"expected\"),\n",
|
|
"]\n",
|
|
"\n",
|
|
"# Use these aliases when synthetic 128 run IDs are requested for 2B.\n",
|
|
"RUN_ALIASES = {\n",
|
|
" \"p2b_simplecnn_128\": \"p1_simplecnn_baseline\",\n",
|
|
" \"p2b_resnet18_128\": \"p1_resnet18_baseline\",\n",
|
|
"}\n",
|
|
"\n",
|
|
"PLANNED_COMPARISONS = [\n",
|
|
" (\"2A\", \"ResNet18\", \"normalization\", \"p2a_t1_original\", \"p2a_t2_real_norm\", \"real_norm - imagenet_norm\"),\n",
|
|
" (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"holdout text2img - all-source\"),\n",
|
|
" (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_inpainting\", \"holdout inpainting - all-source\"),\n",
|
|
" (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_insight\", \"holdout insight - all-source\"),\n",
|
|
"\n",
|
|
" (\"2B\", \"SimpleCNN\", \"resolution\", \"p2b_simplecnn_128\", \"p2b_simplecnn_224\", \"224 - 128\"),\n",
|
|
" (\"2B\", \"ResNet18\", \"resolution\", \"p2b_resnet18_128\", \"p2b_resnet18_224\", \"224 - 128\"),\n",
|
|
"\n",
|
|
" (\"2C\", \"SimpleCNN\", \"facecrop\", \"p2b_simplecnn_224\", \"p2c_simplecnn_facecrop\", \"facecrop - no facecrop\"),\n",
|
|
" (\"2C\", \"ResNet18\", \"facecrop\", \"p2b_resnet18_224\", \"p2c_resnet18_facecrop\", \"facecrop - no facecrop\"),\n",
|
|
"\n",
|
|
" (\"2D\", \"SimpleCNN\", \"augmentation\", \"p2b_simplecnn_224\", \"p2d_simplecnn_aug\", \"light aug - no aug\"),\n",
|
|
" (\"2D\", \"ResNet18\", \"augmentation\", \"p2b_resnet18_224\", \"p2d_resnet18_aug\", \"light aug - no aug\"),\n",
|
|
"\n",
|
|
" (\"2E\", \"SimpleCNN\", \"facecrop + augmentation\", \"p2c_simplecnn_facecrop\", \"p2e_simplecnn_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
|
|
" (\"2E\", \"ResNet18\", \"facecrop + augmentation\", \"p2c_resnet18_facecrop\", \"p2e_resnet18_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
|
|
"]\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6e2ccd27",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Evidence audit\n",
|
|
"\n",
|
|
"Before comparing numbers, check whether the planned artifacts exist. Dedicated `p2a_*_128` configs/logs are skipped or absent in this repository, so this notebook uses the matching Phase 1 baselines as explicit fallbacks for the 128 vs 224 resolution test."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "53356e8b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def load_json(path: Path) -> dict[str, Any] | None:\n",
|
|
" if not path.exists():\n",
|
|
" return None\n",
|
|
" with path.open() as f:\n",
|
|
" return json.load(f)\n",
|
|
"\n",
|
|
"\n",
|
|
"def config_path_for(run: str) -> Path | None:\n",
|
|
" candidates = [\n",
|
|
" CONFIG_DIR / \"phase2\" / f\"{run}.json\",\n",
|
|
" CONFIG_DIR / \"phase2\" / f\"{run}.json.skip\",\n",
|
|
" CONFIG_DIR / \"phase1\" / f\"{run}.json\",\n",
|
|
" CONFIG_DIR / \"phase1\" / f\"{run}.json.skip\",\n",
|
|
" ]\n",
|
|
" return next((p for p in candidates if p.exists()), None)\n",
|
|
"\n",
|
|
"\n",
|
|
"def log_path_for(run: str) -> Path:\n",
|
|
" return LOGS_DIR / f\"{run}.json\"\n",
|
|
"\n",
|
|
"\n",
|
|
"def resolve_run(run: str) -> str:\n",
|
|
" return run if log_path_for(run).exists() else RUN_ALIASES.get(run, run)\n",
|
|
"\n",
|
|
"\n",
|
|
"def load_results(run: str) -> dict[str, Any] | None:\n",
|
|
" resolved = resolve_run(run)\n",
|
|
" return load_json(log_path_for(resolved))\n",
|
|
"\n",
|
|
"\n",
|
|
"def metric_values(results: dict[str, Any], metric: str = \"auc_roc\") -> np.ndarray:\n",
|
|
" vals = []\n",
|
|
" for fold in results.get(\"fold_results\", []):\n",
|
|
" value = fold.get(\"test_metrics\", {}).get(metric)\n",
|
|
" if value is not None:\n",
|
|
" vals.append(float(value))\n",
|
|
" return np.asarray(vals, dtype=float)\n",
|
|
"\n",
|
|
"\n",
|
|
"def best_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
|
|
" hist = fold.get(\"history\", {})\n",
|
|
" train_key = f\"train_{metric}\"\n",
|
|
" val_key = f\"val_{metric}\"\n",
|
|
" train = hist.get(train_key, [])\n",
|
|
" val = hist.get(val_key, [])\n",
|
|
" if not train or not val:\n",
|
|
" return None\n",
|
|
" idx = int(np.nanargmax(np.asarray(val, dtype=float)))\n",
|
|
" return float(train[idx] - val[idx])\n",
|
|
"\n",
|
|
"\n",
|
|
"def final_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
|
|
" hist = fold.get(\"history\", {})\n",
|
|
" train = hist.get(f\"train_{metric}\", [])\n",
|
|
" val = hist.get(f\"val_{metric}\", [])\n",
|
|
" if not train or not val:\n",
|
|
" return None\n",
|
|
" return float(train[-1] - val[-1])\n",
|
|
"\n",
|
|
"\n",
|
|
"def summarize_run(spec: RunSpec) -> dict[str, Any]:\n",
|
|
" resolved = resolve_run(spec.run)\n",
|
|
" results = load_results(spec.run)\n",
|
|
" config_path = config_path_for(spec.run) or config_path_for(resolved)\n",
|
|
" cfg = load_json(config_path) if config_path else None\n",
|
|
"\n",
|
|
" row = {\n",
|
|
" \"section\": spec.section,\n",
|
|
" \"run\": spec.run,\n",
|
|
" \"resolved_run\": resolved,\n",
|
|
" \"label\": spec.label,\n",
|
|
" \"model\": spec.model,\n",
|
|
" \"condition\": spec.condition,\n",
|
|
" \"role\": spec.intended_role,\n",
|
|
" \"fallback_for\": spec.fallback_for,\n",
|
|
" \"config_path\": str(config_path.relative_to(PROJECT_ROOT)) if config_path else None,\n",
|
|
" \"config_status\": \"present\" if config_path and config_path.suffix == \".json\" else (\"skipped\" if config_path else \"missing\"),\n",
|
|
" \"log_status\": \"present\" if log_path_for(spec.run).exists() else (\"fallback\" if resolved != spec.run and log_path_for(resolved).exists() else \"missing\"),\n",
|
|
" \"n_folds\": None,\n",
|
|
" \"auc_mean\": np.nan,\n",
|
|
" \"auc_std\": np.nan,\n",
|
|
" \"acc_mean\": np.nan,\n",
|
|
" \"f1_mean\": np.nan,\n",
|
|
" \"gap_best_mean\": np.nan,\n",
|
|
" \"gap_final_mean\": np.nan,\n",
|
|
" \"image_size\": None,\n",
|
|
" \"face_crop\": None,\n",
|
|
" \"augment\": None,\n",
|
|
" \"normalization\": None,\n",
|
|
" \"train_sources\": None,\n",
|
|
" \"eval_sources\": None,\n",
|
|
" }\n",
|
|
"\n",
|
|
" if cfg:\n",
|
|
" row.update({\n",
|
|
" \"image_size\": cfg.get(\"image_size\"),\n",
|
|
" \"face_crop\": cfg.get(\"face_crop\"),\n",
|
|
" \"augment\": \"light\" if isinstance(cfg.get(\"augment\"), dict) else cfg.get(\"augment\"),\n",
|
|
" \"normalization\": cfg.get(\"normalization\"),\n",
|
|
" \"train_sources\": tuple(cfg.get(\"train_sources\", [])) or None,\n",
|
|
" \"eval_sources\": tuple(cfg.get(\"eval_sources\", [])) or None,\n",
|
|
" })\n",
|
|
"\n",
|
|
" if results:\n",
|
|
" agg = results.get(\"aggregated_metrics\", {})\n",
|
|
" row.update({\n",
|
|
" \"n_folds\": results.get(\"n_folds\"),\n",
|
|
" \"auc_mean\": agg.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
|
|
" \"auc_std\": agg.get(\"auc_roc\", {}).get(\"std\", np.nan),\n",
|
|
" \"acc_mean\": agg.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
|
|
" \"f1_mean\": agg.get(\"f1\", {}).get(\"mean\", np.nan),\n",
|
|
" })\n",
|
|
" best_gaps = [best_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
|
|
" final_gaps = [final_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
|
|
" best_gaps = [x for x in best_gaps if x is not None]\n",
|
|
" final_gaps = [x for x in final_gaps if x is not None]\n",
|
|
" row[\"gap_best_mean\"] = float(np.mean(best_gaps)) if best_gaps else np.nan\n",
|
|
" row[\"gap_final_mean\"] = float(np.mean(final_gaps)) if final_gaps else np.nan\n",
|
|
"\n",
|
|
" return row\n",
|
|
"\n",
|
|
"runs_df = pd.DataFrame([summarize_run(spec) for spec in RUN_SPECS])\n",
|
|
"\n",
|
|
"# Prefer canonical rows for analysis: keep fallbacks only where expected rows are missing.\n",
|
|
"canonical_runs_df = runs_df[runs_df[\"role\"] == \"expected\"].copy()\n",
|
|
"for missing_run, fallback_run in RUN_ALIASES.items():\n",
|
|
" mask = canonical_runs_df[\"run\"].eq(missing_run) & canonical_runs_df[\"log_status\"].eq(\"missing\")\n",
|
|
" if mask.any():\n",
|
|
" fallback = runs_df[runs_df[\"run\"].eq(fallback_run)].copy()\n",
|
|
" if not fallback.empty:\n",
|
|
" fallback.loc[:, \"run\"] = missing_run\n",
|
|
" fallback.loc[:, \"label\"] = fallback.iloc[0][\"label\"].replace(\" (P1 fallback)\", \"\") + \" [P1 fallback]\"\n",
|
|
" fallback.loc[:, \"role\"] = \"expected_via_fallback\"\n",
|
|
" canonical_runs_df = pd.concat([canonical_runs_df[~mask], fallback], ignore_index=True)\n",
|
|
"\n",
|
|
"print(\"Artifact audit:\")\n",
|
|
"display(runs_df[[\"section\", \"run\", \"resolved_run\", \"role\", \"config_status\", \"log_status\", \"n_folds\"]].sort_values([\"section\", \"run\"]))\n",
|
|
"\n",
|
|
"missing_expected = runs_df[(runs_df[\"role\"] == \"expected\") & (runs_df[\"log_status\"] == \"missing\")][\"run\"].tolist()\n",
|
|
"print(f\"\\nExpected runs with no direct log: {missing_expected or 'none'}\")\n",
|
|
"print(\"Fallbacks used:\", {k: v for k, v in RUN_ALIASES.items() if k in missing_expected})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b21a9faf",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Protocol consistency audit from loaded logs/configs.\n",
|
|
"protocol_fields = [\n",
|
|
" \"cv_folds\", \"batch_size\", \"early_stopping_patience\", \"seed\", \"subsample\",\n",
|
|
" \"lr\", \"weight_decay\", \"T_max\", \"epochs\",\n",
|
|
"]\n",
|
|
"\n",
|
|
"protocol_rows = []\n",
|
|
"for _, row in canonical_runs_df.iterrows():\n",
|
|
" results = load_results(row[\"run\"])\n",
|
|
" cfg = (results or {}).get(\"config\", {})\n",
|
|
" protocol_rows.append({\"run\": row[\"run\"], **{k: cfg.get(k) for k in protocol_fields}})\n",
|
|
"\n",
|
|
"protocol_df = pd.DataFrame(protocol_rows)\n",
|
|
"display(protocol_df)\n",
|
|
"\n",
|
|
"print(\"Field variability across loaded canonical runs:\")\n",
|
|
"for field in protocol_fields:\n",
|
|
" vals = sorted({str(v) for v in protocol_df[field].dropna().unique()})\n",
|
|
" print(f\" {field:28s}: {vals}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6802bcd9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Results table\n",
|
|
"\n",
|
|
"The table below is ranked by AUC and includes two gap estimates:\n",
|
|
"\n",
|
|
"- `gap_best_mean`: train AUC minus validation AUC at each fold's best validation epoch. This is closest to the saved best checkpoint.\n",
|
|
"- `gap_final_mean`: train AUC minus validation AUC at the final epoch. This is useful for diagnosing late overfit but is less aligned with test evaluation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "be1ec0ba",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"analysis_df = canonical_runs_df[canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"])].copy()\n",
|
|
"analysis_df = analysis_df.sort_values(\"auc_mean\", ascending=False)\n",
|
|
"\n",
|
|
"cols = [\n",
|
|
" \"section\", \"label\", \"run\", \"resolved_run\", \"model\", \"condition\", \"log_status\",\n",
|
|
" \"auc_mean\", \"auc_std\", \"acc_mean\", \"f1_mean\", \"gap_best_mean\", \"gap_final_mean\",\n",
|
|
"]\n",
|
|
"\n",
|
|
"display(\n",
|
|
" analysis_df[cols]\n",
|
|
" .style.format({\n",
|
|
" \"auc_mean\": \"{:.4f}\",\n",
|
|
" \"auc_std\": \"{:.4f}\",\n",
|
|
" \"acc_mean\": \"{:.4f}\",\n",
|
|
" \"f1_mean\": \"{:.4f}\",\n",
|
|
" \"gap_best_mean\": \"{:+.4f}\",\n",
|
|
" \"gap_final_mean\": \"{:+.4f}\",\n",
|
|
" })\n",
|
|
" .background_gradient(subset=[\"auc_mean\"], cmap=\"Greens\")\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1e0d21c1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def paired_comparison(section: str, model: str, question: str, before: str, after: str, contrast: str) -> dict[str, Any]:\n",
|
|
" r0 = load_results(before)\n",
|
|
" r1 = load_results(after)\n",
|
|
" resolved_before = resolve_run(before)\n",
|
|
" resolved_after = resolve_run(after)\n",
|
|
" out = {\n",
|
|
" \"section\": section,\n",
|
|
" \"model\": model,\n",
|
|
" \"question\": question,\n",
|
|
" \"before\": before,\n",
|
|
" \"after\": after,\n",
|
|
" \"resolved_before\": resolved_before,\n",
|
|
" \"resolved_after\": resolved_after,\n",
|
|
" \"contrast\": contrast,\n",
|
|
" \"status\": \"ok\" if r0 and r1 else \"missing\",\n",
|
|
" \"n\": 0,\n",
|
|
" \"before_auc\": np.nan,\n",
|
|
" \"after_auc\": np.nan,\n",
|
|
" \"delta_auc\": np.nan,\n",
|
|
" \"delta_ci95\": np.nan,\n",
|
|
" \"ttest_p\": np.nan,\n",
|
|
" \"wilcoxon_p\": np.nan,\n",
|
|
" \"cohen_dz\": np.nan,\n",
|
|
" \"before_gap\": np.nan,\n",
|
|
" \"after_gap\": np.nan,\n",
|
|
" \"delta_gap\": np.nan,\n",
|
|
" \"interpretation\": \"insufficient data\",\n",
|
|
" \"caveat\": \"\",\n",
|
|
" }\n",
|
|
" if not (r0 and r1):\n",
|
|
" return out\n",
|
|
"\n",
|
|
" v0 = metric_values(r0, \"auc_roc\")\n",
|
|
" v1 = metric_values(r1, \"auc_roc\")\n",
|
|
" n = min(len(v0), len(v1))\n",
|
|
" v0, v1 = v0[:n], v1[:n]\n",
|
|
" diff = v1 - v0\n",
|
|
"\n",
|
|
" out.update({\n",
|
|
" \"n\": n,\n",
|
|
" \"before_auc\": float(np.mean(v0)),\n",
|
|
" \"after_auc\": float(np.mean(v1)),\n",
|
|
" \"delta_auc\": float(np.mean(diff)),\n",
|
|
" })\n",
|
|
"\n",
|
|
" if n >= 2:\n",
|
|
" sd = float(np.std(diff, ddof=1))\n",
|
|
" se = sd / math.sqrt(n) if sd > 0 else 0.0\n",
|
|
" out[\"delta_ci95\"] = float(stats.t.ppf(0.975, df=n - 1) * se) if n > 1 else np.nan\n",
|
|
" if sd > 0:\n",
|
|
" out[\"cohen_dz\"] = float(np.mean(diff) / sd)\n",
|
|
" out[\"ttest_p\"] = float(stats.ttest_rel(v1, v0).pvalue)\n",
|
|
" if n >= 3 and not np.allclose(diff, 0):\n",
|
|
" try:\n",
|
|
" out[\"wilcoxon_p\"] = float(stats.wilcoxon(diff).pvalue)\n",
|
|
" except ValueError:\n",
|
|
" pass\n",
|
|
"\n",
|
|
" gaps0 = [best_epoch_gap(f) for f in r0.get(\"fold_results\", [])]\n",
|
|
" gaps1 = [best_epoch_gap(f) for f in r1.get(\"fold_results\", [])]\n",
|
|
" gaps0 = np.asarray([x for x in gaps0 if x is not None], dtype=float)\n",
|
|
" gaps1 = np.asarray([x for x in gaps1 if x is not None], dtype=float)\n",
|
|
" if len(gaps0) and len(gaps1):\n",
|
|
" m = min(len(gaps0), len(gaps1))\n",
|
|
" out[\"before_gap\"] = float(np.mean(gaps0[:m]))\n",
|
|
" out[\"after_gap\"] = float(np.mean(gaps1[:m]))\n",
|
|
" out[\"delta_gap\"] = float(np.mean(gaps1[:m] - gaps0[:m]))\n",
|
|
"\n",
|
|
" if question == \"source_holdout\":\n",
|
|
" out[\"caveat\"] = \"Aggregate holdout-run AUC only; not held-out-source vs in-source AUC.\"\n",
|
|
" if before != resolved_before or after != resolved_after:\n",
|
|
" out[\"caveat\"] = (out[\"caveat\"] + \" \" if out[\"caveat\"] else \"\") + \"Uses Phase 1 fallback for missing p2a 128 log.\"\n",
|
|
"\n",
|
|
" if out[\"delta_auc\"] >= 0.01:\n",
|
|
" out[\"interpretation\"] = \"meaningful improvement\"\n",
|
|
" elif out[\"delta_auc\"] > 0.002:\n",
|
|
" out[\"interpretation\"] = \"small improvement\"\n",
|
|
" elif out[\"delta_auc\"] >= -0.002:\n",
|
|
" out[\"interpretation\"] = \"negligible change\"\n",
|
|
" elif out[\"delta_auc\"] > -0.01:\n",
|
|
" out[\"interpretation\"] = \"small drop\"\n",
|
|
" else:\n",
|
|
" out[\"interpretation\"] = \"meaningful drop\"\n",
|
|
" return out\n",
|
|
"\n",
|
|
"comparisons_df = pd.DataFrame([paired_comparison(*args) for args in PLANNED_COMPARISONS])\n",
|
|
"\n",
|
|
"# Benjamini-Hochberg correction across planned paired t-tests where available.\n",
|
|
"valid_p = comparisons_df[\"ttest_p\"].notna()\n",
|
|
"pvals = comparisons_df.loc[valid_p, \"ttest_p\"].to_numpy()\n",
|
|
"qvals = np.full(len(comparisons_df), np.nan)\n",
|
|
"if len(pvals):\n",
|
|
" order = np.argsort(pvals)\n",
|
|
" ranked = pvals[order]\n",
|
|
" adjusted = np.empty_like(ranked)\n",
|
|
" m = len(ranked)\n",
|
|
" running = 1.0\n",
|
|
" for i in range(m - 1, -1, -1):\n",
|
|
" running = min(running, ranked[i] * m / (i + 1))\n",
|
|
" adjusted[i] = running\n",
|
|
" qvals[np.where(valid_p)[0][order]] = adjusted\n",
|
|
"comparisons_df[\"bh_q\"] = qvals\n",
|
|
"\n",
|
|
"display(\n",
|
|
" comparisons_df[[\n",
|
|
" \"section\", \"model\", \"question\", \"contrast\", \"before_auc\", \"after_auc\", \"delta_auc\",\n",
|
|
" \"delta_ci95\", \"ttest_p\", \"bh_q\", \"wilcoxon_p\", \"cohen_dz\", \"delta_gap\", \"interpretation\", \"caveat\",\n",
|
|
" ]].style.format({\n",
|
|
" \"before_auc\": \"{:.4f}\",\n",
|
|
" \"after_auc\": \"{:.4f}\",\n",
|
|
" \"delta_auc\": \"{:+.4f}\",\n",
|
|
" \"delta_ci95\": \"\u00b1{:.4f}\",\n",
|
|
" \"ttest_p\": \"{:.4f}\",\n",
|
|
" \"bh_q\": \"{:.4f}\",\n",
|
|
" \"wilcoxon_p\": \"{:.4f}\",\n",
|
|
" \"cohen_dz\": \"{:+.2f}\",\n",
|
|
" \"delta_gap\": \"{:+.4f}\",\n",
|
|
" }).background_gradient(subset=[\"delta_auc\"], cmap=\"RdYlGn\", vmin=-0.06, vmax=0.06)\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f20e5262",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Visual summary\n",
|
|
"\n",
|
|
"Two plots are most useful for decision-making:\n",
|
|
"\n",
|
|
"- Ranking all conditions by AUC shows the best observed configurations but can overstate duplicated/near-identical runs.\n",
|
|
"- Paired delta plot shows the controlled effect of each preprocessing change and exposes uncertainty."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "42882c6a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plot_df = analysis_df.copy()\n",
|
|
"plot_df[\"display_label\"] = plot_df[\"section\"] + \" | \" + plot_df[\"label\"]\n",
|
|
"plot_df = plot_df.sort_values(\"auc_mean\", ascending=True)\n",
|
|
"\n",
|
|
"fig, ax = plt.subplots(figsize=(11, max(7, 0.35 * len(plot_df))))\n",
|
|
"colors = {\"2A\": \"#4C78A8\", \"2B\": \"#F58518\", \"2C\": \"#54A24B\", \"2D\": \"#E45756\", \"2E\": \"#B279A2\"}\n",
|
|
"ax.barh(\n",
|
|
" plot_df[\"display_label\"],\n",
|
|
" plot_df[\"auc_mean\"],\n",
|
|
" xerr=plot_df[\"auc_std\"],\n",
|
|
" color=[colors.get(s, \"#999999\") for s in plot_df[\"section\"]],\n",
|
|
" alpha=0.85,\n",
|
|
")\n",
|
|
"ax.set_xlim(0.65, 1.0)\n",
|
|
"ax.set_xlabel(\"Mean AUC across CV folds\")\n",
|
|
"ax.set_title(\"Phase 2 Conditions Ranked by AUC\")\n",
|
|
"ax.axvline(0.95, color=\"black\", linewidth=1, linestyle=\"--\", alpha=0.4)\n",
|
|
"for y, (_, row) in enumerate(plot_df.iterrows()):\n",
|
|
" ax.text(row[\"auc_mean\"] + 0.004, y, f\"{row['auc_mean']:.4f}\", va=\"center\", fontsize=9)\n",
|
|
"fig.tight_layout()\n",
|
|
"fig.savefig(FIGURES_DIR / \"ranked_auc.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
"plt.show()\n",
|
|
"\n",
|
|
"forest = comparisons_df.copy()\n",
|
|
"forest[\"display\"] = forest[\"section\"] + \" \" + forest[\"model\"] + \" - \" + forest[\"contrast\"]\n",
|
|
"forest = forest.iloc[::-1]\n",
|
|
"fig, ax = plt.subplots(figsize=(11, max(6, 0.45 * len(forest))))\n",
|
|
"y = np.arange(len(forest))\n",
|
|
"ax.errorbar(\n",
|
|
" forest[\"delta_auc\"], y,\n",
|
|
" xerr=forest[\"delta_ci95\"],\n",
|
|
" fmt=\"o\", color=\"#1F2937\", ecolor=\"#6B7280\", capsize=4,\n",
|
|
")\n",
|
|
"ax.axvline(0, color=\"black\", linewidth=1)\n",
|
|
"ax.axvspan(-0.002, 0.002, color=\"#9CA3AF\", alpha=0.18, label=\"negligible band\")\n",
|
|
"ax.set_yticks(y)\n",
|
|
"ax.set_yticklabels(forest[\"display\"])\n",
|
|
"ax.set_xlabel(\"Delta AUC (after - before), paired by fold\")\n",
|
|
"ax.set_title(\"Planned Phase 2 Effect Estimates\")\n",
|
|
"ax.legend(loc=\"lower right\")\n",
|
|
"fig.tight_layout()\n",
|
|
"fig.savefig(FIGURES_DIR / \"planned_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e063cfc0",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2A - Shortcut analysis\n",
|
|
"\n",
|
|
"Shortcut checks map to `p2a_*` configs:\n",
|
|
"- `p2a_t1_original` vs `p2a_t2_real_norm` (normalization)\n",
|
|
"- `p2a_t1_original` vs `p2a_t3_holdout_*` (source_holdout)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "910bd5bd",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def comparison_subset(section: str, question: str | None = None) -> pd.DataFrame:\n",
|
|
" df = comparisons_df[comparisons_df[\"section\"].eq(section)].copy()\n",
|
|
" if question:\n",
|
|
" df = df[df[\"question\"].eq(question)]\n",
|
|
" return df\n",
|
|
"\n",
|
|
"\n",
|
|
"def print_comparison_readout(df: pd.DataFrame) -> None:\n",
|
|
" for _, row in df.iterrows():\n",
|
|
" print(f\"{row['section']} {row['model']} - {row['contrast']}\")\n",
|
|
" print(f\" AUC: {row['before_auc']:.4f} -> {row['after_auc']:.4f} ({row['delta_auc']:+.4f})\")\n",
|
|
" print(f\" paired t p={row['ttest_p']:.4f}, BH q={row['bh_q']:.4f}, CI95 delta=\u00b1{row['delta_ci95']:.4f}\")\n",
|
|
" print(f\" gap delta: {row['delta_gap']:+.4f}; interpretation: {row['interpretation']}\")\n",
|
|
" if row['caveat']:\n",
|
|
" print(f\" caveat: {row['caveat']}\")\n",
|
|
" print()\n",
|
|
"\n",
|
|
"print_comparison_readout(comparison_subset(\"2B\", \"resolution\"))\n",
|
|
"\n",
|
|
"res_plot = comparison_subset(\"2B\", \"resolution\")\n",
|
|
"fig, ax = plt.subplots(figsize=(8, 5))\n",
|
|
"for _, row in res_plot.iterrows():\n",
|
|
" r0, r1 = load_results(row[\"before\"]), load_results(row[\"after\"])\n",
|
|
" v0, v1 = metric_values(r0), metric_values(r1)\n",
|
|
" x = [0, 1]\n",
|
|
" for a, b in zip(v0, v1):\n",
|
|
" ax.plot(x, [a, b], color=\"#9CA3AF\", alpha=0.7)\n",
|
|
" ax.plot(x, [v0.mean(), v1.mean()], marker=\"o\", linewidth=3, label=row[\"model\"])\n",
|
|
"ax.set_xticks([0, 1])\n",
|
|
"ax.set_xticklabels([\"128\", \"224\"])\n",
|
|
"ax.set_ylabel(\"AUC\")\n",
|
|
"ax.set_title(\"2B Resolution: Fold-Paired AUC\")\n",
|
|
"ax.legend()\n",
|
|
"fig.tight_layout()\n",
|
|
"fig.savefig(FIGURES_DIR / \"2b_resolution_paired.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "530e8675",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2B - Resolution impact\n",
|
|
"\n",
|
|
"This section compares 128 vs 224 using `p2b_*_224` and Phase 1 baselines as explicit 128 fallbacks.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "13304d38",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print_comparison_readout(comparison_subset(\"2C\", \"facecrop\"))\n",
|
|
"\n",
|
|
"face_df = canonical_runs_df[canonical_runs_df[\"section\"].eq(\"2C\")].copy()\n",
|
|
"fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=False)\n",
|
|
"for ax, model in zip(axes, [\"SimpleCNN\", \"ResNet18\"]):\n",
|
|
" sub = face_df[face_df[\"model\"].eq(model)].sort_values(\"face_crop\")\n",
|
|
" ax.bar(sub[\"condition\"], sub[\"auc_mean\"], yerr=sub[\"auc_std\"], color=[\"#D97706\", \"#059669\"], alpha=0.85, capsize=5)\n",
|
|
" ax.set_title(model)\n",
|
|
" ax.set_ylim(0.70 if model == \"SimpleCNN\" else 0.94, 0.99)\n",
|
|
" ax.set_ylabel(\"AUC\")\n",
|
|
" ax.tick_params(axis=\"x\", rotation=20)\n",
|
|
"fig.suptitle(\"2C Facecrop Impact\")\n",
|
|
"fig.tight_layout()\n",
|
|
"fig.savefig(FIGURES_DIR / \"2c_facecrop.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8702d10d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2C - Facecrop impact\n",
|
|
"\n",
|
|
"This section compares `p2c_*_facecrop` against the matching `p2b_*_224` no-facecrop baselines.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ec5e03ef",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print_comparison_readout(comparison_subset(\"2A\"))\n\n# Inspect whether logs contain the per-source data needed by v2.md.\nsource_audit = []\nfor run in [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]:\n results = load_results(run)\n has_per_source = False\n has_records = False\n example_keys = []\n if results:\n for fold in results.get(\"fold_results\", []):\n tm = fold.get(\"test_metrics\", {})\n example_keys = sorted(tm.keys())\n has_per_source = has_per_source or any(k in tm for k in [\"per_source\", \"per_source_metrics\", \"pairwise_source_metrics\", \"source_metrics\", \"pair_metrics\"])\n has_records = has_records or any(k in fold for k in [\"records\", \"predictions\", \"test_records\"])\n source_audit.append({\n \"run\": run,\n \"has_per_source_metrics\": has_per_source,\n \"has_prediction_records\": has_records,\n \"test_metric_keys\": example_keys,\n })\nsource_audit_df = pd.DataFrame(source_audit)\ndisplay(source_audit_df)\n\nholdout_runs = [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]\nholdout_df = canonical_runs_df[canonical_runs_df[\"run\"].isin(holdout_runs)].copy()\nholdout_df[\"delta_vs_all_source\"] = holdout_df[\"auc_mean\"] - float(holdout_df.loc[holdout_df[\"run\"].eq(\"p2a_t1_original\"), \"auc_mean\"].iloc[0])\n\nfig, ax = plt.subplots(figsize=(9, 5))\nax.bar(holdout_df[\"label\"], holdout_df[\"auc_mean\"], yerr=holdout_df[\"auc_std\"], color=\"#54A24B\", alpha=0.85, capsize=5)\nax.set_ylim(0.88, 0.99)\nax.set_ylabel(\"Aggregate AUC\")\nax.set_title(\"2C Source Holdout Proxy: Aggregate Test AUC\")\nax.tick_params(axis=\"x\", rotation=20)\nfor i, (_, row) in enumerate(holdout_df.iterrows()):\n ax.text(i, row[\"auc_mean\"] + 0.004, f\"{row['delta_vs_all_source']:+.3f}\", ha=\"center\", fontsize=9)\nfig.tight_layout()\nfig.savefig(FIGURES_DIR / \"2c_holdout_proxy.png\", dpi=200, bbox_inches=\"tight\")\nplt.show()\n\nprint(\"Geometry diagnostic evidence:\")\ngeometry_keys = []\nfor run in [\"p2a_t1_original\", \"p2a_t2_real_norm\"]:\n results = load_results(run)\n cfg = (results or {}).get(\"config\", {})\n geometry_keys.append({\n \"run\": run,\n \"config_geometry_condition\": cfg.get(\"geometry_condition\"),\n \"has_matched_geometry_metric\": any(\n \"geometry\" in str(k).lower() or \"matched\" in str(k).lower()\n for fold in (results or {}).get(\"fold_results\", [])\n for k in fold.get(\"test_metrics\", {}).keys()\n ),\n })\ndisplay(pd.DataFrame(geometry_keys))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2c3b8812",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2D / 2E - Augmentation impact and test-set integrity\n",
|
|
"\n",
|
|
"The augmentation question has two parts:\n",
|
|
"\n",
|
|
"- Does light augmentation help at 224 without facecrop?\n",
|
|
"- Does it help once facecrop is enabled?\n",
|
|
"\n",
|
|
"The implementation also needs to guarantee that validation/test evaluation is not stochastic. The preprocessing pipeline keeps stochastic operations behind `self.train`, so `train=False` disables them even if augmentation settings exist."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f11c3257",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"2D (p2d): augmentation without facecrop\")\n",
|
|
"print_comparison_readout(comparison_subset(\"2D\", \"augmentation\"))\n",
|
|
"print(\"2E (p2e): augmentation with facecrop\")\n",
|
|
"print_comparison_readout(comparison_subset(\"2E\", \"facecrop + augmentation\"))\n",
|
|
"\n",
|
|
"aug_sections = comparisons_df[comparisons_df[\"section\"].isin([\"2D\", \"2E\"])].copy()\n",
|
|
"fig, ax = plt.subplots(figsize=(9, 5))\n",
|
|
"labels = aug_sections[\"section\"] + \" \" + aug_sections[\"model\"]\n",
|
|
"ax.bar(labels, aug_sections[\"delta_auc\"], yerr=aug_sections[\"delta_ci95\"], color=[\"#E45756\" if d < 0 else \"#059669\" for d in aug_sections[\"delta_auc\"]], alpha=0.85, capsize=5)\n",
|
|
"ax.axhline(0, color=\"black\", linewidth=1)\n",
|
|
"ax.set_ylabel(\"Delta AUC from adding augmentation\")\n",
|
|
"ax.set_title(\"Augmentation Effects Across Facecrop Conditions\")\n",
|
|
"ax.tick_params(axis=\"x\", rotation=20)\n",
|
|
"fig.tight_layout()\n",
|
|
"fig.savefig(FIGURES_DIR / \"2d_2e_augmentation_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
"plt.show()\n",
|
|
"\n",
|
|
"# Static and behavioral audit of eval stochasticity.\n",
|
|
"try:\n",
|
|
" import inspect\n",
|
|
" from src.preprocessing.pipeline import DFFImagePipeline\n",
|
|
" from src.evaluation import evaluate as evaluate_module\n",
|
|
"\n",
|
|
" pipeline_src = inspect.getsource(DFFImagePipeline)\n",
|
|
" build_transforms_src = inspect.getsource(evaluate_module.build_transforms)\n",
|
|
" stochastic_guards = {\n",
|
|
" \"flip_guarded_by_train\": \"if self.train and random.random() < self.hflip_p\" in pipeline_src,\n",
|
|
" \"rotate_guarded_by_train\": \"if self.train and self.rotation_degrees > 0\" in pipeline_src,\n",
|
|
" \"color_jitter_returns_when_not_train\": \"if not self.train:\" in pipeline_src,\n",
|
|
" \"blur_guarded_by_train\": \"if self.train and random.random() < self.blur_p\" in pipeline_src,\n",
|
|
" \"jpeg_guarded_by_train\": \"if self.train and random.random() < self.jpeg_p\" in pipeline_src,\n",
|
|
" \"erase_guarded_by_train\": \"if self.train and random.random() < self.erase_p\" in pipeline_src,\n",
|
|
" \"noise_guarded_by_train\": \"if self.train and random.random() < self.noise_p\" in pipeline_src,\n",
|
|
" \"cv_transform_uses_train_flag\": \"get_transforms(train=train\" in build_transforms_src,\n",
|
|
" }\n",
|
|
" display(pd.DataFrame([stochastic_guards]).T.rename(columns={0: \"passes\"}))\n",
|
|
"except Exception as exc:\n",
|
|
" print(f\"Could not run transform audit: {exc}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "02e47658",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Decision synthesis\n",
|
|
"\n",
|
|
"This section converts the evidence into Phase 3 settings. It intentionally distinguishes a recommendation from a claim:\n",
|
|
"\n",
|
|
"- Recommendation: choose the setting that is best supported for the next experiment.\n",
|
|
"- Claim: what the current evidence proves. Some Phase 2C claims remain incomplete without per-source or matched-geometry outputs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7034443c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def get_delta(question: str, model: str | None = None, section: str | None = None) -> pd.DataFrame:\n",
|
|
" df = comparisons_df[comparisons_df[\"question\"].eq(question)].copy()\n",
|
|
" if model:\n",
|
|
" df = df[df[\"model\"].eq(model)]\n",
|
|
" if section:\n",
|
|
" df = df[df[\"section\"].eq(section)]\n",
|
|
" return df\n",
|
|
"\n",
|
|
"resolution_resnet = get_delta(\"resolution\", \"ResNet18\").iloc[0]\n",
|
|
"facecrop_resnet = get_delta(\"facecrop\", \"ResNet18\").iloc[0]\n",
|
|
"facecrop_simple = get_delta(\"facecrop\", \"SimpleCNN\").iloc[0]\n",
|
|
"aug_no_crop_resnet = get_delta(\"augmentation\", \"ResNet18\").iloc[0]\n",
|
|
"aug_no_crop_simple = get_delta(\"augmentation\", \"SimpleCNN\").iloc[0]\n",
|
|
"aug_crop_resnet = get_delta(\"facecrop + augmentation\", \"ResNet18\").iloc[0]\n",
|
|
"aug_crop_simple = get_delta(\"facecrop + augmentation\", \"SimpleCNN\").iloc[0]\n",
|
|
"norm = get_delta(\"normalization\", \"ResNet18\").iloc[0]\n",
|
|
"\n",
|
|
"recommendations = [\n",
|
|
" {\n",
|
|
" \"choice\": \"resolution\",\n",
|
|
" \"recommendation\": \"224x224\",\n",
|
|
" \"evidence\": f\"ResNet18 delta AUC {resolution_resnet.delta_auc:+.4f}; SimpleCNN does not determine Phase 3 capacity.\",\n",
|
|
" \"confidence\": \"high\" if resolution_resnet.delta_auc > 0.02 else \"medium\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"choice\": \"facecrop\",\n",
|
|
" \"recommendation\": \"use facecrop\",\n",
|
|
" \"evidence\": f\"Small positive deltas for both models: SimpleCNN {facecrop_simple.delta_auc:+.4f}, ResNet18 {facecrop_resnet.delta_auc:+.4f}.\",\n",
|
|
" \"confidence\": \"medium\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"choice\": \"augmentation\",\n",
|
|
" \"recommendation\": \"do not use light augmentation for Phase 3 at 20% data\",\n",
|
|
" \"evidence\": f\"SimpleCNN drops {aug_no_crop_simple.delta_auc:+.4f} without facecrop and {aug_crop_simple.delta_auc:+.4f} with facecrop; ResNet18 is neutral/slightly mixed ({aug_no_crop_resnet.delta_auc:+.4f}, {aug_crop_resnet.delta_auc:+.4f}).\",\n",
|
|
" \"confidence\": \"high for SimpleCNN, medium for ResNet18\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"choice\": \"normalization\",\n",
|
|
" \"recommendation\": \"ImageNet normalization\",\n",
|
|
" \"evidence\": f\"Real-train-only normalization delta AUC {norm.delta_auc:+.4f}; no useful gain and less standard for pretrained ResNet.\",\n",
|
|
" \"confidence\": \"medium\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"choice\": \"shortcut/source claims\",\n",
|
|
" \"recommendation\": \"do not overclaim; add per-source or prediction exports before final report\",\n",
|
|
" \"evidence\": \"Current CV logs lack held-out-source vs in-source AUC and matched-geometry test metrics.\",\n",
|
|
" \"confidence\": \"high\",\n",
|
|
" },\n",
|
|
"]\n",
|
|
"\n",
|
|
"recommendations_df = pd.DataFrame(recommendations)\n",
|
|
"display(recommendations_df)\n",
|
|
"\n",
|
|
"summary = {\n",
|
|
" \"phase\": \"phase2\",\n",
|
|
" \"source_documents\": [\"classifier/v2.md\", \"classifier/impl.md\"],\n",
|
|
" \"artifact_counts\": {\n",
|
|
" \"canonical_runs\": int(len(canonical_runs_df)),\n",
|
|
" \"loaded_canonical_runs\": int(canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"]).sum()),\n",
|
|
" \"fallback_runs_used\": {k: v for k, v in RUN_ALIASES.items() if resolve_run(k) != k},\n",
|
|
" },\n",
|
|
" \"recommendations\": recommendations,\n",
|
|
" \"planned_comparisons\": comparisons_df.replace({np.nan: None}).to_dict(orient=\"records\"),\n",
|
|
" \"known_gaps\": [\n",
|
|
" \"Dedicated p2a_*_128 logs are absent/skipped; Phase 1 baselines are used as fallbacks.\",\n",
|
|
" \"Source holdout logs do not include prediction-level or per-source metrics, so held-out-source AUC vs in-source AUC cannot be computed.\",\n",
|
|
" \"No matched-geometry evaluation metric is present in p2c logs, so geometry shortcut analysis is incomplete.\",\n",
|
|
" ],\n",
|
|
"}\n",
|
|
"\n",
|
|
"summary_path = ANALYSIS_DIR / \"phase2_analysis_summary.json\"\n",
|
|
"with summary_path.open(\"w\") as f:\n",
|
|
" json.dump(summary, f, indent=2)\n",
|
|
"\n",
|
|
"print(f\"Saved summary: {summary_path.relative_to(PROJECT_ROOT)}\")\n",
|
|
"print(f\"Saved figures: {FIGURES_DIR.relative_to(PROJECT_ROOT)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5a337f73",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Report-ready conclusion\n",
|
|
"\n",
|
|
"The strongest Phase 2 result is the resolution effect for ResNet18: moving to 224x224 substantially improves AUC under the controlled CV protocol. Face cropping gives a small positive effect and is reasonable to carry forward, especially because it aligns the model with face evidence rather than background context. Light augmentation is not supported at this 20% data setting: it strongly hurts SimpleCNN and provides no reliable gain for ResNet18, with or without face cropping. ImageNet normalization remains preferable because real-train-only normalization does not improve AUC and is less aligned with pretrained ResNet expectations.\n",
|
|
"\n",
|
|
"Recommended Phase 3 preprocessing: **224x224, facecrop enabled, no light augmentation, ImageNet normalization**.\n",
|
|
"\n",
|
|
"Limitations to fix before the final report: export prediction-level records or per-source pairwise metrics for source holdout, and add the matched-geometry evaluation required by the shortcut-analysis plan. Without those artifacts, Phase 2C can only support a limited shortcut analysis."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "drl",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.13"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
} |