Notebooks todos sem resultados fase 4

This commit is contained in:
DiogoCosta18
2026-05-06 20:28:29 +01:00
parent b5313e3320
commit 69666d6aa0
16 changed files with 2312 additions and 533 deletions
@@ -702,16 +702,6 @@
"For the report, this table supports the normalization ablation in Phase 2. The actual decision is made in `04_phase2_analysis.ipynb`, where `real_norm` is compared against ImageNet/default using the saved logs.\n"
]
},
{
"cell_type": "markdown",
"id": "fb95d062",
"metadata": {},
"source": [
"## 5. Reproducibility note\n",
"\n",
"These checks are not preprocessing operations. They simply confirm that the experiment setup is safe: identity groups do not leak across splits, validation/test transforms are deterministic, source-specific metrics handle edge cases, and config inheritance works as expected. The full test suite lives in `classifier/tests/`.\n"
]
},
{
"cell_type": "markdown",
"id": "b02fd790",
+1 -11
View File
@@ -818,22 +818,12 @@
"The confusion matrices show the AUC story in error-count form. SimpleCNN correctly classifies about `71%` of real images and `70%` of fake images, so it misses many examples in both directions. ResNet18 improves both sides: about `81%` of real images are kept real, and about `88%` of fake images are detected as fake. The most important practical gain is fewer fake images predicted as real (`30%` -> `12%`), although the model still produces some false alarms on real images (`19%`).\n"
]
},
{
"cell_type": "markdown",
"id": "0591bcea",
"metadata": {},
"source": [
"## 6. Reproducibility note\n",
"\n",
"These checks are not extra results. They simply support the credibility of the Phase 1 comparison: the same config-loading rules, grouped splits, deterministic evaluation transforms, and safe metric handling are covered by tests in `classifier/tests/`. This lets the baseline comparison focus on the intended difference: SimpleCNN versus pretrained ResNet18.\n"
]
},
{
"cell_type": "markdown",
"id": "624839d8",
"metadata": {},
"source": [
"## Report-ready conclusion\n",
"## Conclusion\n",
"\n",
"Under identical Phase 1 conditions, ResNet18 is the stronger baseline. SimpleCNN reaches AUC `0.7786`, accuracy `0.7039`, and F1 `0.7801`. ResNet18 reaches AUC `0.9366`, accuracy `0.8650`, and F1 `0.9073`. The mean fold-wise AUC improvement is about `+0.1580`.\n",
"\n",
File diff suppressed because it is too large Load Diff
+10 -6
View File
@@ -7,17 +7,19 @@
"source": [
"# 05 - Grad-CAM Interpretability Analysis\n",
"\n",
"This final classifier notebook adds qualitative evidence. It does not train, tune, or reevaluate models. It loads existing configs, logs, and checkpoints, selects deterministic fold-0 examples, and renders fake-logit Grad-CAM overlays. Metrics reported in the report remain the canonical log values; checkpoint-derived candidate scores in this notebook are only used to choose visual examples.\n",
"This interpretability notebook adds qualitative evidence after the Phase 2 ablations. It does not train, tune, or reevaluate models. It loads existing configs, logs, and checkpoints, selects deterministic fold-0 examples, and renders fake-logit Grad-CAM overlays. Metrics reported in the report remain the canonical log values; checkpoint-derived candidate scores in this notebook are only used to choose visual examples.\n",
"\n",
"Grad-CAM answers a limited question: which spatial regions most support the model's fake-class logit for a selected image? It is useful for sanity checking localization, but it is not proof of causality and it should not override held-out metrics.\n",
"\n",
"A note on resolution: the visible Grad-CAM grid comes from the target convolutional feature map, not from the original image. ResNet18's final convolution is very coarse at 224x224 input, so its last-layer CAM is upsampled from a small grid and appears blockier than some SimpleCNN maps. That block size is architectural granularity, not model confidence. The notebook keeps the canonical last-conv CAM and also adds a finer ResNet18 diagnostic view from an earlier layer for readability.\n",
"\n",
"Story questions:\n",
"- Does the selected final run focus on facial evidence rather than background shortcuts?\n",
"- Does the selected Phase 2 run focus on facial evidence rather than background shortcuts?\n",
"- Does facecrop change what the model can attend to?\n",
"- Do augmentation and source-holdout runs reveal instability in attention?\n",
"- Are errors visually plausible, or do they suggest shortcut behavior?\n"
"- Are errors visually plausible, or do they suggest shortcut behavior?\n",
"\n",
"Roadmap link: after this qualitative check, `06_phase3_model_family_analysis.ipynb` compares stronger pretrained backbones and `07_phase4_data_scaling_analysis.ipynb` records the planned data-scaling analysis.\n"
]
},
{
@@ -1693,11 +1695,13 @@
"id": "7a682e64",
"metadata": {},
"source": [
"## Report-ready conclusion\n",
"## Conclusion\n",
"\n",
"Grad-CAM provides a qualitative final check on the classifier story. The selected metric setting remains `p2c_resnet18_facecrop`: 224x224 input, facecrop enabled, no augmentation, and ImageNet/default normalization. The overlays are most reassuring when they concentrate on facial regions, and most cautionary when errors or source-holdout examples show diffuse, background, or artifact-specific attention.\n",
"Grad-CAM provides a qualitative check on the Phase 2 classifier story. The selected metric setting remains `p2c_resnet18_facecrop`: 224x224 input, facecrop enabled, no augmentation, and ImageNet/default normalization. The overlays are most reassuring when they concentrate on facial regions, and most cautionary when errors or source-holdout examples show diffuse, background, or artifact-specific attention.\n",
"\n",
"The key limitation from Phase 2 still stands: high in-distribution AUC does not guarantee source-agnostic generalization. The Grad-CAM panels help make that limitation visible, but the source-holdout pairwise AUC values are the primary quantitative evidence.\n"
"The key limitation from Phase 2 still stands: high in-distribution AUC does not guarantee source-agnostic generalization. The Grad-CAM panels help make that limitation visible, but the source-holdout pairwise AUC values are the primary quantitative evidence.\n",
"\n",
"Next: `06_phase3_model_family_analysis.ipynb` asks whether stronger pretrained model families improve on the selected Phase 2 pipeline.\n"
]
}
],
File diff suppressed because one or more lines are too long
@@ -0,0 +1,609 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 07 - Phase 4 Data-Scaling Analysis\n",
"\n",
"Phase 4 is the natural next question after Phase 3. Once the best model families have been identified at the 20% data setting, the experiment asks whether more training data improves performance and source generalization.\n",
"\n",
"In the repository state used to create this notebook, Phase 4 configs exist but no `p4_*` logs or checkpoints are present under `classifier/outputs`. For that reason, this notebook is result-gated: it documents the planned experiment matrix now, and the analysis cells automatically switch on when the corresponding logs are added later.\n",
"\n",
"No Phase 4 metric is claimed unless it is loaded from an existing `classifier/outputs/logs/p4_*.json` file.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Project root: c:\\Users\\diogo\\Documents\\MIA_UP\\2_Semestre\\DRL\\DRL_2\\DRL_PROJ\n"
]
}
],
"source": [
"from __future__ import annotations\n",
"\n",
"import json\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"\n",
"def find_project_root(start: Path | None = None) -> Path:\n",
" \"\"\"Find DRL_PROJ whether the notebook runs from repo root or classifier/notebooks.\"\"\"\n",
" start = Path.cwd() if start is None else Path(start)\n",
" for candidate in [start, *start.parents]:\n",
" if (candidate / \"classifier\").is_dir() and (candidate / \"docs\" / \"DRL_Project.md\").exists():\n",
" return candidate\n",
" raise RuntimeError(\"Could not find DRL_PROJ root. Run this notebook from inside the repository.\")\n",
"\n",
"\n",
"PROJECT_ROOT = find_project_root()\n",
"CLASSIFIER_ROOT = PROJECT_ROOT / \"classifier\"\n",
"if str(CLASSIFIER_ROOT) not in sys.path:\n",
" sys.path.insert(0, str(CLASSIFIER_ROOT))\n",
"\n",
"CONFIGS_DIR = CLASSIFIER_ROOT / \"configs\"\n",
"LOGS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"logs\"\n",
"MODELS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"models\"\n",
"FIGURES_DIR = CLASSIFIER_ROOT / \"outputs\" / \"figures\"\n",
"ANALYSIS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"analysis\"\n",
"FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
"ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"Project root: {PROJECT_ROOT}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Planned Phase 4 matrix\n",
"\n",
"The Phase 4 configs keep the Phase 3 preprocessing setup: pretrained backbones, 224x224 input, facecropped classifier data, no augmentation, and the same cross-validation protocol. The controlled variable is data fraction: 20%, 50%, and 100%.\n",
"\n",
"The model families selected for scaling are ResNet50, EfficientNet-B0, and ConvNeXt-Tiny. They are the strongest Phase 3 families and represent different capacity/efficiency tradeoffs.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>run</th>\n",
" <th>backbone</th>\n",
" <th>subsample</th>\n",
" <th>image_size</th>\n",
" <th>data_dir</th>\n",
" <th>augment</th>\n",
" <th>pretrained</th>\n",
" <th>epochs</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>p4_convnext_tiny_20pct</td>\n",
" <td>convnext_tiny</td>\n",
" <td>0.2</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>p4_convnext_tiny_50pct</td>\n",
" <td>convnext_tiny</td>\n",
" <td>0.5</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>p4_convnext_tiny_100pct</td>\n",
" <td>convnext_tiny</td>\n",
" <td>1.0</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>p4_efficientnet_b0_20pct</td>\n",
" <td>efficientnet_b0</td>\n",
" <td>0.2</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>p4_efficientnet_b0_50pct</td>\n",
" <td>efficientnet_b0</td>\n",
" <td>0.5</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>p4_efficientnet_b0_100pct</td>\n",
" <td>efficientnet_b0</td>\n",
" <td>1.0</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>p4_resnet50_20pct</td>\n",
" <td>resnet50</td>\n",
" <td>0.2</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>p4_resnet50_50pct</td>\n",
" <td>resnet50</td>\n",
" <td>0.5</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>p4_resnet50_100pct</td>\n",
" <td>resnet50</td>\n",
" <td>1.0</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" run backbone subsample image_size \\\n",
"1 p4_convnext_tiny_20pct convnext_tiny 0.2 224 \n",
"2 p4_convnext_tiny_50pct convnext_tiny 0.5 224 \n",
"0 p4_convnext_tiny_100pct convnext_tiny 1.0 224 \n",
"4 p4_efficientnet_b0_20pct efficientnet_b0 0.2 224 \n",
"5 p4_efficientnet_b0_50pct efficientnet_b0 0.5 224 \n",
"3 p4_efficientnet_b0_100pct efficientnet_b0 1.0 224 \n",
"7 p4_resnet50_20pct resnet50 0.2 224 \n",
"8 p4_resnet50_50pct resnet50 0.5 224 \n",
"6 p4_resnet50_100pct resnet50 1.0 224 \n",
"\n",
" data_dir augment pretrained epochs \n",
"1 cropped/classifier False True 15 \n",
"2 cropped/classifier False True 15 \n",
"0 cropped/classifier False True 15 \n",
"4 cropped/classifier False True 15 \n",
"5 cropped/classifier False True 15 \n",
"3 cropped/classifier False True 15 \n",
"7 cropped/classifier False True 15 \n",
"8 cropped/classifier False True 15 \n",
"6 cropped/classifier False True 15 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>subsample</th>\n",
" <th>0.2</th>\n",
" <th>0.5</th>\n",
" <th>1.0</th>\n",
" </tr>\n",
" <tr>\n",
" <th>backbone</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>convnext_tiny</th>\n",
" <td>p4_convnext_tiny_20pct</td>\n",
" <td>p4_convnext_tiny_50pct</td>\n",
" <td>p4_convnext_tiny_100pct</td>\n",
" </tr>\n",
" <tr>\n",
" <th>efficientnet_b0</th>\n",
" <td>p4_efficientnet_b0_20pct</td>\n",
" <td>p4_efficientnet_b0_50pct</td>\n",
" <td>p4_efficientnet_b0_100pct</td>\n",
" </tr>\n",
" <tr>\n",
" <th>resnet50</th>\n",
" <td>p4_resnet50_20pct</td>\n",
" <td>p4_resnet50_50pct</td>\n",
" <td>p4_resnet50_100pct</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"subsample 0.2 0.5 \\\n",
"backbone \n",
"convnext_tiny p4_convnext_tiny_20pct p4_convnext_tiny_50pct \n",
"efficientnet_b0 p4_efficientnet_b0_20pct p4_efficientnet_b0_50pct \n",
"resnet50 p4_resnet50_20pct p4_resnet50_50pct \n",
"\n",
"subsample 1.0 \n",
"backbone \n",
"convnext_tiny p4_convnext_tiny_100pct \n",
"efficientnet_b0 p4_efficientnet_b0_100pct \n",
"resnet50 p4_resnet50_100pct "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def load_json(path: Path) -> dict:\n",
" return json.loads(path.read_text(encoding=\"utf-8\"))\n",
"\n",
"\n",
"def resolve_config(path: Path) -> dict:\n",
" cfg = load_json(path)\n",
" parent = cfg.pop(\"extends\", None)\n",
" if parent:\n",
" base = resolve_config(path.parent / parent)\n",
" base.update(cfg)\n",
" cfg = base\n",
" return cfg\n",
"\n",
"\n",
"phase4_configs = []\n",
"for path in sorted((CONFIGS_DIR / \"phase4\").glob(\"p4_*.json\")):\n",
" cfg = resolve_config(path)\n",
" phase4_configs.append({\n",
" \"run\": cfg.get(\"run_name\", path.stem),\n",
" \"backbone\": cfg.get(\"backbone\"),\n",
" \"subsample\": cfg.get(\"subsample\"),\n",
" \"image_size\": cfg.get(\"image_size\"),\n",
" \"data_dir\": cfg.get(\"data_dir\"),\n",
" \"augment\": cfg.get(\"augment\", False),\n",
" \"pretrained\": cfg.get(\"pretrained\", True),\n",
" \"epochs\": cfg.get(\"epochs\"),\n",
" })\n",
"\n",
"config_df = pd.DataFrame(phase4_configs).sort_values([\"backbone\", \"subsample\"])\n",
"display(config_df)\n",
"\n",
"matrix = config_df.pivot(index=\"backbone\", columns=\"subsample\", values=\"run\")\n",
"display(matrix)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Result gate\n",
"\n",
"The cell below checks for `p4_*` logs. If none are present, Phase 4 remains a planned experiment and the notebook stops at the design/status interpretation. If logs are added later, the same notebook will load them and produce scaling curves, source diagnostics, and report-ready conclusions from those logs.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No Phase 4 result logs found under classifier/outputs/logs.\n",
"Missing planned runs:\n",
"- p4_convnext_tiny_100pct\n",
"- p4_convnext_tiny_20pct\n",
"- p4_convnext_tiny_50pct\n",
"- p4_efficientnet_b0_100pct\n",
"- p4_efficientnet_b0_20pct\n",
"- p4_efficientnet_b0_50pct\n",
"- p4_resnet50_100pct\n",
"- p4_resnet50_20pct\n",
"- p4_resnet50_50pct\n"
]
}
],
"source": [
"def log_path(run_name: str) -> Path:\n",
" return LOGS_DIR / f\"{run_name}.json\"\n",
"\n",
"\n",
"def load_run_if_present(run_name: str) -> dict | None:\n",
" path = log_path(run_name)\n",
" return load_json(path) if path.exists() else None\n",
"\n",
"\n",
"def agg_metric(results: dict, metric: str, field: str = \"mean\"):\n",
" return results.get(\"aggregated_metrics\", {}).get(metric, {}).get(field, np.nan)\n",
"\n",
"\n",
"def checkpoint_mb(run_name: str) -> float:\n",
" path = MODELS_DIR / f\"{run_name}_fold0_best.pt\"\n",
" return path.stat().st_size / (1024 * 1024) if path.exists() else np.nan\n",
"\n",
"\n",
"result_rows = []\n",
"missing_runs = []\n",
"for row in phase4_configs:\n",
" run_name = row[\"run\"]\n",
" results = load_run_if_present(run_name)\n",
" if results is None:\n",
" missing_runs.append(run_name)\n",
" continue\n",
" cfg = {**row, **results.get(\"config\", {})}\n",
" result_rows.append({\n",
" \"run\": run_name,\n",
" \"backbone\": cfg.get(\"backbone\"),\n",
" \"subsample\": cfg.get(\"subsample\", row.get(\"subsample\")),\n",
" \"auc\": agg_metric(results, \"auc_roc\"),\n",
" \"auc_std\": agg_metric(results, \"auc_roc\", \"std\"),\n",
" \"accuracy\": agg_metric(results, \"accuracy\"),\n",
" \"f1\": agg_metric(results, \"f1\"),\n",
" \"checkpoint_mb\": checkpoint_mb(run_name),\n",
" })\n",
"\n",
"phase4_results_df = pd.DataFrame(result_rows)\n",
"if phase4_results_df.empty:\n",
" print(\"No Phase 4 result logs found under classifier/outputs/logs.\")\n",
" print(\"Missing planned runs:\")\n",
" for run_name in missing_runs:\n",
" print(f\"- {run_name}\")\n",
"else:\n",
" phase4_results_df = phase4_results_df.sort_values([\"backbone\", \"subsample\"])\n",
" display(\n",
" phase4_results_df.style.format({\n",
" \"subsample\": \"{:.1f}\",\n",
" \"auc\": \"{:.4f}\",\n",
" \"auc_std\": \"{:.4f}\",\n",
" \"accuracy\": \"{:.4f}\",\n",
" \"f1\": \"{:.4f}\",\n",
" \"checkpoint_mb\": \"{:.1f}\",\n",
" })\n",
" )\n",
" if missing_runs:\n",
" print(\"Some planned Phase 4 runs are still missing:\")\n",
" for run_name in missing_runs:\n",
" print(f\"- {run_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the status cell reports no logs, the correct interpretation is: Phase 4 has been designed but not yet analyzed. The report can describe the intended purpose, but it should not include Phase 4 performance claims.\n",
"\n",
"When logs exist, the next sections answer three questions: does more data improve each backbone, which backbone benefits most, and does scaling reduce source-specific weakness?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Scaling curves, once results exist\n",
"\n",
"These cells are guarded. They produce figures only when at least one `p4_*` log is available.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skipping scaling plots because no p4 logs are available yet.\n"
]
}
],
"source": [
"if phase4_results_df.empty:\n",
" print(\"Skipping scaling plots because no p4 logs are available yet.\")\n",
"else:\n",
" fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=True)\n",
" for ax, metric, title in zip(axes, [\"auc\", \"accuracy\", \"f1\"], [\"AUC\", \"Accuracy\", \"F1\"]):\n",
" for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n",
" sub = sub.sort_values(\"subsample\")\n",
" ax.plot(sub[\"subsample\"] * 100, sub[metric], marker=\"o\", label=backbone)\n",
" ax.set_xlabel(\"Training data used (%)\")\n",
" ax.set_ylabel(title)\n",
" ax.set_title(f\"Phase 4 {title} scaling\")\n",
" ax.grid(alpha=0.25)\n",
" axes[0].legend(fontsize=8)\n",
" fig.tight_layout()\n",
" fig.savefig(FIGURES_DIR / \"07_phase4_scaling_curves.png\", dpi=180, bbox_inches=\"tight\")\n",
" plt.show()\n",
"\n",
" gains = []\n",
" for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n",
" sub = sub.set_index(\"subsample\")\n",
" if 0.2 in sub.index and 1.0 in sub.index:\n",
" gains.append({\n",
" \"backbone\": backbone,\n",
" \"auc_20pct\": sub.loc[0.2, \"auc\"],\n",
" \"auc_100pct\": sub.loc[1.0, \"auc\"],\n",
" \"auc_gain\": sub.loc[1.0, \"auc\"] - sub.loc[0.2, \"auc\"],\n",
" \"accuracy_gain\": sub.loc[1.0, \"accuracy\"] - sub.loc[0.2, \"accuracy\"],\n",
" \"f1_gain\": sub.loc[1.0, \"f1\"] - sub.loc[0.2, \"f1\"],\n",
" })\n",
" gains_df = pd.DataFrame(gains)\n",
" display(gains_df.style.format({\"auc_20pct\": \"{:.4f}\", \"auc_100pct\": \"{:.4f}\", \"auc_gain\": \"{:+.4f}\", \"accuracy_gain\": \"{:+.4f}\", \"f1_gain\": \"{:+.4f}\"}))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Source diagnostics, once results exist\n",
"\n",
"The most important Phase 4 question is not only whether AUC rises. It is whether extra data improves the weak source pairs found earlier, especially source generalization around `text2img` and `insight`.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skipping source diagnostics because no p4 logs are available yet.\n"
]
}
],
"source": [
"def pairwise_rows(run_name: str, results: dict) -> list[dict]:\n",
" rows = []\n",
" for pair, metrics in results.get(\"aggregated_pairwise\", {}).items():\n",
" rows.append({\n",
" \"run\": run_name,\n",
" \"pair\": pair,\n",
" \"pairwise_auc\": metrics.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
" \"pairwise_f1\": metrics.get(\"f1\", {}).get(\"mean\", np.nan),\n",
" \"pairwise_accuracy\": metrics.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
" })\n",
" return rows\n",
"\n",
"\n",
"if phase4_results_df.empty:\n",
" print(\"Skipping source diagnostics because no p4 logs are available yet.\")\n",
"else:\n",
" pair_rows = []\n",
" for _, row in phase4_results_df.iterrows():\n",
" results = load_run_if_present(row[\"run\"])\n",
" for pair_row in pairwise_rows(row[\"run\"], results):\n",
" pair_rows.append({**pair_row, \"backbone\": row[\"backbone\"], \"subsample\": row[\"subsample\"]})\n",
" pair_df = pd.DataFrame(pair_rows)\n",
" display(pair_df.sort_values([\"pair\", \"backbone\", \"subsample\"]).style.format({\"subsample\": \"{:.1f}\", \"pairwise_auc\": \"{:.4f}\", \"pairwise_f1\": \"{:.4f}\", \"pairwise_accuracy\": \"{:.4f}\"}))\n",
"\n",
" for pair in sorted(pair_df[\"pair\"].unique()):\n",
" fig, ax = plt.subplots(figsize=(6, 3.6))\n",
" sub_pair = pair_df[pair_df[\"pair\"] == pair]\n",
" for backbone, sub in sub_pair.groupby(\"backbone\"):\n",
" sub = sub.sort_values(\"subsample\")\n",
" ax.plot(sub[\"subsample\"] * 100, sub[\"pairwise_auc\"], marker=\"o\", label=backbone)\n",
" ax.set_title(f\"Phase 4 source-pair scaling: {pair}\")\n",
" ax.set_xlabel(\"Training data used (%)\")\n",
" ax.set_ylabel(\"Pairwise AUC\")\n",
" ax.grid(alpha=0.25)\n",
" ax.legend(fontsize=8)\n",
" fig.tight_layout()\n",
" fig.savefig(FIGURES_DIR / f\"07_phase4_{pair}_scaling.png\", dpi=180, bbox_inches=\"tight\")\n",
" plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"At the time this notebook was added, Phase 4 is a planned data-scaling analysis rather than a completed result chapter. The configs define a clean experiment: take the strongest Phase 3 families and train them at 20%, 50%, and 100% of the available facecropped data.\n",
"\n",
"The report should not make positive or negative Phase 4 performance claims until `p4_*` logs are present. Once those logs exist, the key result to look for is not just a higher global AUC. The stronger claim would be that more data also improves pairwise source behavior, especially for the sources that exposed generalization limits earlier.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -6,33 +6,32 @@
{
"choice": "input size",
"decision": "224x224",
"evidence": "ResNet18 improves from 0.9366 to 0.9660 AUC.",
"evidence": "ResNet18 improves from AUC 0.9366 to 0.9660.",
"confidence": "high"
},
{
"choice": "face crop",
"decision": "enable",
"evidence": "Best run is p2c_resnet18_facecrop with AUC 0.9755.",
"evidence": "Best Phase 2 run is p2c_resnet18_facecrop with AUC 0.9755.",
"confidence": "medium-high"
},
{
"choice": "augmentation",
"decision": "disable for current 20% setting",
"evidence": "p2e_resnet18_facecrop_aug is 0.9737, below facecrop-only 0.9755; SimpleCNN drops sharply.",
"evidence": "p2e_resnet18_facecrop_aug reaches AUC 0.9737, below facecrop-only 0.9755; SimpleCNN drops sharply.",
"confidence": "low"
},
{
"choice": "normalization",
"decision": "ImageNet/default",
"evidence": "real_norm is only +0.0018 and is less aligned with pretrained weights.",
"evidence": "real_norm is only +0.0018 AUC and is less aligned with pretrained ImageNet weights.",
"confidence": "medium"
},
{
"choice": "source generalization",
"decision": "report as limitation and diagnostic target",
"evidence": "Holdout text2img and insight pairwise AUC drop to 0.7595 and 0.8421.",
"evidence": "Holding out text2img and insight drops pairwise AUC to 0.7595 and 0.8421.",
"confidence": "high"
}
],
"note": "Generated by 04_phase2_analysis.ipynb when this cell is executed."
]
}
Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 134 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB