Files
DRL_PROJ/classifier/notebooks/07_phase4_data_scaling_analysis.ipynb
T
2026-05-06 20:31:07 +01:00

610 lines
23 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 07 - Phase 4 Data-Scaling Analysis\n",
"\n",
"Phase 4 is the natural next question after Phase 3. Once the best model families have been identified at the 20% data setting, the experiment asks whether more training data improves performance and source generalization.\n",
"\n",
"In the repository state used to create this notebook, Phase 4 configs exist but no `p4_*` logs or checkpoints are present under `classifier/outputs`. For that reason, this notebook is result-gated: it documents the planned experiment matrix now, and the analysis cells automatically switch on when the corresponding logs are added later.\n",
"\n",
"No Phase 4 metric is claimed unless it is loaded from an existing `classifier/outputs/logs/p4_*.json` file.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Project root: c:\\Users\\diogo\\Documents\\MIA_UP\\2_Semestre\\DRL\\DRL_2\\DRL_PROJ\n"
]
}
],
"source": [
"from __future__ import annotations\n",
"\n",
"import json\n",
"import sys\n",
"from pathlib import Path\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"\n",
"def find_project_root(start: Path | None = None) -> Path:\n",
" \"\"\"Find DRL_PROJ whether the notebook runs from repo root or classifier/notebooks.\"\"\"\n",
" start = Path.cwd() if start is None else Path(start)\n",
" for candidate in [start, *start.parents]:\n",
" if (candidate / \"classifier\").is_dir() and (candidate / \"docs\" / \"DRL_Project.md\").exists():\n",
" return candidate\n",
" raise RuntimeError(\"Could not find DRL_PROJ root. Run this notebook from inside the repository.\")\n",
"\n",
"\n",
"PROJECT_ROOT = find_project_root()\n",
"CLASSIFIER_ROOT = PROJECT_ROOT / \"classifier\"\n",
"if str(CLASSIFIER_ROOT) not in sys.path:\n",
" sys.path.insert(0, str(CLASSIFIER_ROOT))\n",
"\n",
"CONFIGS_DIR = CLASSIFIER_ROOT / \"configs\"\n",
"LOGS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"logs\"\n",
"MODELS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"models\"\n",
"FIGURES_DIR = CLASSIFIER_ROOT / \"outputs\" / \"figures\"\n",
"ANALYSIS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"analysis\"\n",
"FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
"ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print(f\"Project root: {PROJECT_ROOT}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Planned Phase 4 matrix\n",
"\n",
"The Phase 4 configs keep the Phase 3 preprocessing setup: pretrained backbones, 224x224 input, facecropped classifier data, no augmentation, and the same cross-validation protocol. The controlled variable is data fraction: 20%, 50%, and 100%.\n",
"\n",
"The model families selected for scaling are ResNet50, EfficientNet-B0, and ConvNeXt-Tiny. They are the strongest Phase 3 families and represent different capacity/efficiency tradeoffs.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>run</th>\n",
" <th>backbone</th>\n",
" <th>subsample</th>\n",
" <th>image_size</th>\n",
" <th>data_dir</th>\n",
" <th>augment</th>\n",
" <th>pretrained</th>\n",
" <th>epochs</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>p4_convnext_tiny_20pct</td>\n",
" <td>convnext_tiny</td>\n",
" <td>0.2</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>p4_convnext_tiny_50pct</td>\n",
" <td>convnext_tiny</td>\n",
" <td>0.5</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>p4_convnext_tiny_100pct</td>\n",
" <td>convnext_tiny</td>\n",
" <td>1.0</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>p4_efficientnet_b0_20pct</td>\n",
" <td>efficientnet_b0</td>\n",
" <td>0.2</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>p4_efficientnet_b0_50pct</td>\n",
" <td>efficientnet_b0</td>\n",
" <td>0.5</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>p4_efficientnet_b0_100pct</td>\n",
" <td>efficientnet_b0</td>\n",
" <td>1.0</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>p4_resnet50_20pct</td>\n",
" <td>resnet50</td>\n",
" <td>0.2</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>p4_resnet50_50pct</td>\n",
" <td>resnet50</td>\n",
" <td>0.5</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>p4_resnet50_100pct</td>\n",
" <td>resnet50</td>\n",
" <td>1.0</td>\n",
" <td>224</td>\n",
" <td>cropped/classifier</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" run backbone subsample image_size \\\n",
"1 p4_convnext_tiny_20pct convnext_tiny 0.2 224 \n",
"2 p4_convnext_tiny_50pct convnext_tiny 0.5 224 \n",
"0 p4_convnext_tiny_100pct convnext_tiny 1.0 224 \n",
"4 p4_efficientnet_b0_20pct efficientnet_b0 0.2 224 \n",
"5 p4_efficientnet_b0_50pct efficientnet_b0 0.5 224 \n",
"3 p4_efficientnet_b0_100pct efficientnet_b0 1.0 224 \n",
"7 p4_resnet50_20pct resnet50 0.2 224 \n",
"8 p4_resnet50_50pct resnet50 0.5 224 \n",
"6 p4_resnet50_100pct resnet50 1.0 224 \n",
"\n",
" data_dir augment pretrained epochs \n",
"1 cropped/classifier False True 15 \n",
"2 cropped/classifier False True 15 \n",
"0 cropped/classifier False True 15 \n",
"4 cropped/classifier False True 15 \n",
"5 cropped/classifier False True 15 \n",
"3 cropped/classifier False True 15 \n",
"7 cropped/classifier False True 15 \n",
"8 cropped/classifier False True 15 \n",
"6 cropped/classifier False True 15 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>subsample</th>\n",
" <th>0.2</th>\n",
" <th>0.5</th>\n",
" <th>1.0</th>\n",
" </tr>\n",
" <tr>\n",
" <th>backbone</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>convnext_tiny</th>\n",
" <td>p4_convnext_tiny_20pct</td>\n",
" <td>p4_convnext_tiny_50pct</td>\n",
" <td>p4_convnext_tiny_100pct</td>\n",
" </tr>\n",
" <tr>\n",
" <th>efficientnet_b0</th>\n",
" <td>p4_efficientnet_b0_20pct</td>\n",
" <td>p4_efficientnet_b0_50pct</td>\n",
" <td>p4_efficientnet_b0_100pct</td>\n",
" </tr>\n",
" <tr>\n",
" <th>resnet50</th>\n",
" <td>p4_resnet50_20pct</td>\n",
" <td>p4_resnet50_50pct</td>\n",
" <td>p4_resnet50_100pct</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"subsample 0.2 0.5 \\\n",
"backbone \n",
"convnext_tiny p4_convnext_tiny_20pct p4_convnext_tiny_50pct \n",
"efficientnet_b0 p4_efficientnet_b0_20pct p4_efficientnet_b0_50pct \n",
"resnet50 p4_resnet50_20pct p4_resnet50_50pct \n",
"\n",
"subsample 1.0 \n",
"backbone \n",
"convnext_tiny p4_convnext_tiny_100pct \n",
"efficientnet_b0 p4_efficientnet_b0_100pct \n",
"resnet50 p4_resnet50_100pct "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def load_json(path: Path) -> dict:\n",
" return json.loads(path.read_text(encoding=\"utf-8\"))\n",
"\n",
"\n",
"def resolve_config(path: Path) -> dict:\n",
" cfg = load_json(path)\n",
" parent = cfg.pop(\"extends\", None)\n",
" if parent:\n",
" base = resolve_config(path.parent / parent)\n",
" base.update(cfg)\n",
" cfg = base\n",
" return cfg\n",
"\n",
"\n",
"phase4_configs = []\n",
"for path in sorted((CONFIGS_DIR / \"phase4\").glob(\"p4_*.json\")):\n",
" cfg = resolve_config(path)\n",
" phase4_configs.append({\n",
" \"run\": cfg.get(\"run_name\", path.stem),\n",
" \"backbone\": cfg.get(\"backbone\"),\n",
" \"subsample\": cfg.get(\"subsample\"),\n",
" \"image_size\": cfg.get(\"image_size\"),\n",
" \"data_dir\": cfg.get(\"data_dir\"),\n",
" \"augment\": cfg.get(\"augment\", False),\n",
" \"pretrained\": cfg.get(\"pretrained\", True),\n",
" \"epochs\": cfg.get(\"epochs\"),\n",
" })\n",
"\n",
"config_df = pd.DataFrame(phase4_configs).sort_values([\"backbone\", \"subsample\"])\n",
"display(config_df)\n",
"\n",
"matrix = config_df.pivot(index=\"backbone\", columns=\"subsample\", values=\"run\")\n",
"display(matrix)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Result gate\n",
"\n",
"The cell below checks for `p4_*` logs. If none are present, Phase 4 remains a planned experiment and the notebook stops at the design/status interpretation. If logs are added later, the same notebook will load them and produce scaling curves, source diagnostics, and report-ready conclusions from those logs.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No Phase 4 result logs found under classifier/outputs/logs.\n",
"Missing planned runs:\n",
"- p4_convnext_tiny_100pct\n",
"- p4_convnext_tiny_20pct\n",
"- p4_convnext_tiny_50pct\n",
"- p4_efficientnet_b0_100pct\n",
"- p4_efficientnet_b0_20pct\n",
"- p4_efficientnet_b0_50pct\n",
"- p4_resnet50_100pct\n",
"- p4_resnet50_20pct\n",
"- p4_resnet50_50pct\n"
]
}
],
"source": [
"def log_path(run_name: str) -> Path:\n",
" return LOGS_DIR / f\"{run_name}.json\"\n",
"\n",
"\n",
"def load_run_if_present(run_name: str) -> dict | None:\n",
" path = log_path(run_name)\n",
" return load_json(path) if path.exists() else None\n",
"\n",
"\n",
"def agg_metric(results: dict, metric: str, field: str = \"mean\"):\n",
" return results.get(\"aggregated_metrics\", {}).get(metric, {}).get(field, np.nan)\n",
"\n",
"\n",
"def checkpoint_mb(run_name: str) -> float:\n",
" path = MODELS_DIR / f\"{run_name}_fold0_best.pt\"\n",
" return path.stat().st_size / (1024 * 1024) if path.exists() else np.nan\n",
"\n",
"\n",
"result_rows = []\n",
"missing_runs = []\n",
"for row in phase4_configs:\n",
" run_name = row[\"run\"]\n",
" results = load_run_if_present(run_name)\n",
" if results is None:\n",
" missing_runs.append(run_name)\n",
" continue\n",
" cfg = {**row, **results.get(\"config\", {})}\n",
" result_rows.append({\n",
" \"run\": run_name,\n",
" \"backbone\": cfg.get(\"backbone\"),\n",
" \"subsample\": cfg.get(\"subsample\", row.get(\"subsample\")),\n",
" \"auc\": agg_metric(results, \"auc_roc\"),\n",
" \"auc_std\": agg_metric(results, \"auc_roc\", \"std\"),\n",
" \"accuracy\": agg_metric(results, \"accuracy\"),\n",
" \"f1\": agg_metric(results, \"f1\"),\n",
" \"checkpoint_mb\": checkpoint_mb(run_name),\n",
" })\n",
"\n",
"phase4_results_df = pd.DataFrame(result_rows)\n",
"if phase4_results_df.empty:\n",
" print(\"No Phase 4 result logs found under classifier/outputs/logs.\")\n",
" print(\"Missing planned runs:\")\n",
" for run_name in missing_runs:\n",
" print(f\"- {run_name}\")\n",
"else:\n",
" phase4_results_df = phase4_results_df.sort_values([\"backbone\", \"subsample\"])\n",
" display(\n",
" phase4_results_df.style.format({\n",
" \"subsample\": \"{:.1f}\",\n",
" \"auc\": \"{:.4f}\",\n",
" \"auc_std\": \"{:.4f}\",\n",
" \"accuracy\": \"{:.4f}\",\n",
" \"f1\": \"{:.4f}\",\n",
" \"checkpoint_mb\": \"{:.1f}\",\n",
" })\n",
" )\n",
" if missing_runs:\n",
" print(\"Some planned Phase 4 runs are still missing:\")\n",
" for run_name in missing_runs:\n",
" print(f\"- {run_name}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the status cell reports no logs, the correct interpretation is: Phase 4 has been designed but not yet analyzed. The report can describe the intended purpose, but it should not include Phase 4 performance claims.\n",
"\n",
"When logs exist, the next sections answer three questions: does more data improve each backbone, which backbone benefits most, and does scaling reduce source-specific weakness?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Scaling curves, once results exist\n",
"\n",
"These cells are guarded. They produce figures only when at least one `p4_*` log is available.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skipping scaling plots because no p4 logs are available yet.\n"
]
}
],
"source": [
"if phase4_results_df.empty:\n",
" print(\"Skipping scaling plots because no p4 logs are available yet.\")\n",
"else:\n",
" fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=True)\n",
" for ax, metric, title in zip(axes, [\"auc\", \"accuracy\", \"f1\"], [\"AUC\", \"Accuracy\", \"F1\"]):\n",
" for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n",
" sub = sub.sort_values(\"subsample\")\n",
" ax.plot(sub[\"subsample\"] * 100, sub[metric], marker=\"o\", label=backbone)\n",
" ax.set_xlabel(\"Training data used (%)\")\n",
" ax.set_ylabel(title)\n",
" ax.set_title(f\"Phase 4 {title} scaling\")\n",
" ax.grid(alpha=0.25)\n",
" axes[0].legend(fontsize=8)\n",
" fig.tight_layout()\n",
" fig.savefig(FIGURES_DIR / \"07_phase4_scaling_curves.png\", dpi=180, bbox_inches=\"tight\")\n",
" plt.show()\n",
"\n",
" gains = []\n",
" for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n",
" sub = sub.set_index(\"subsample\")\n",
" if 0.2 in sub.index and 1.0 in sub.index:\n",
" gains.append({\n",
" \"backbone\": backbone,\n",
" \"auc_20pct\": sub.loc[0.2, \"auc\"],\n",
" \"auc_100pct\": sub.loc[1.0, \"auc\"],\n",
" \"auc_gain\": sub.loc[1.0, \"auc\"] - sub.loc[0.2, \"auc\"],\n",
" \"accuracy_gain\": sub.loc[1.0, \"accuracy\"] - sub.loc[0.2, \"accuracy\"],\n",
" \"f1_gain\": sub.loc[1.0, \"f1\"] - sub.loc[0.2, \"f1\"],\n",
" })\n",
" gains_df = pd.DataFrame(gains)\n",
" display(gains_df.style.format({\"auc_20pct\": \"{:.4f}\", \"auc_100pct\": \"{:.4f}\", \"auc_gain\": \"{:+.4f}\", \"accuracy_gain\": \"{:+.4f}\", \"f1_gain\": \"{:+.4f}\"}))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Source diagnostics, once results exist\n",
"\n",
"The most important Phase 4 question is not only whether AUC rises. It is whether extra data improves the weak source pairs found earlier, especially source generalization around `text2img` and `insight`.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Skipping source diagnostics because no p4 logs are available yet.\n"
]
}
],
"source": [
"def pairwise_rows(run_name: str, results: dict) -> list[dict]:\n",
" rows = []\n",
" for pair, metrics in results.get(\"aggregated_pairwise\", {}).items():\n",
" rows.append({\n",
" \"run\": run_name,\n",
" \"pair\": pair,\n",
" \"pairwise_auc\": metrics.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
" \"pairwise_f1\": metrics.get(\"f1\", {}).get(\"mean\", np.nan),\n",
" \"pairwise_accuracy\": metrics.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
" })\n",
" return rows\n",
"\n",
"\n",
"if phase4_results_df.empty:\n",
" print(\"Skipping source diagnostics because no p4 logs are available yet.\")\n",
"else:\n",
" pair_rows = []\n",
" for _, row in phase4_results_df.iterrows():\n",
" results = load_run_if_present(row[\"run\"])\n",
" for pair_row in pairwise_rows(row[\"run\"], results):\n",
" pair_rows.append({**pair_row, \"backbone\": row[\"backbone\"], \"subsample\": row[\"subsample\"]})\n",
" pair_df = pd.DataFrame(pair_rows)\n",
" display(pair_df.sort_values([\"pair\", \"backbone\", \"subsample\"]).style.format({\"subsample\": \"{:.1f}\", \"pairwise_auc\": \"{:.4f}\", \"pairwise_f1\": \"{:.4f}\", \"pairwise_accuracy\": \"{:.4f}\"}))\n",
"\n",
" for pair in sorted(pair_df[\"pair\"].unique()):\n",
" fig, ax = plt.subplots(figsize=(6, 3.6))\n",
" sub_pair = pair_df[pair_df[\"pair\"] == pair]\n",
" for backbone, sub in sub_pair.groupby(\"backbone\"):\n",
" sub = sub.sort_values(\"subsample\")\n",
" ax.plot(sub[\"subsample\"] * 100, sub[\"pairwise_auc\"], marker=\"o\", label=backbone)\n",
" ax.set_title(f\"Phase 4 source-pair scaling: {pair}\")\n",
" ax.set_xlabel(\"Training data used (%)\")\n",
" ax.set_ylabel(\"Pairwise AUC\")\n",
" ax.grid(alpha=0.25)\n",
" ax.legend(fontsize=8)\n",
" fig.tight_layout()\n",
" fig.savefig(FIGURES_DIR / f\"07_phase4_{pair}_scaling.png\", dpi=180, bbox_inches=\"tight\")\n",
" plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"At the time this notebook was added, Phase 4 is a planned data-scaling analysis rather than a completed result chapter. The configs define a clean experiment: take the strongest Phase 3 families and train them at 20%, 50%, and 100% of the available facecropped data.\n",
"\n",
"The report should not make positive or negative Phase 4 performance claims until `p4_*` logs are present. Once those logs exist, the key result to look for is not just a higher global AUC. The stronger claim would be that more data also improves pairwise source behavior, especially for the sources that exposed generalization limits earlier.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}