{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 07 - Phase 4 Data-Scaling Analysis\n", "\n", "Phase 4 is the natural next question after Phase 3. Once the best model families have been identified at the 20% data setting, the experiment asks whether more training data improves performance and source generalization.\n", "\n", "In the repository state used to create this notebook, Phase 4 configs exist but no `p4_*` logs or checkpoints are present under `classifier/outputs`. For that reason, this notebook is result-gated: it documents the planned experiment matrix now, and the analysis cells automatically switch on when the corresponding logs are added later.\n", "\n", "No Phase 4 metric is claimed unless it is loaded from an existing `classifier/outputs/logs/p4_*.json` file.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project root: c:\\Users\\diogo\\Documents\\MIA_UP\\2_Semestre\\DRL\\DRL_2\\DRL_PROJ\n" ] } ], "source": [ "from __future__ import annotations\n", "\n", "import json\n", "import sys\n", "from pathlib import Path\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "def find_project_root(start: Path | None = None) -> Path:\n", " \"\"\"Find DRL_PROJ whether the notebook runs from repo root or classifier/notebooks.\"\"\"\n", " start = Path.cwd() if start is None else Path(start)\n", " for candidate in [start, *start.parents]:\n", " if (candidate / \"classifier\").is_dir() and (candidate / \"docs\" / \"DRL_Project.md\").exists():\n", " return candidate\n", " raise RuntimeError(\"Could not find DRL_PROJ root. Run this notebook from inside the repository.\")\n", "\n", "\n", "PROJECT_ROOT = find_project_root()\n", "CLASSIFIER_ROOT = PROJECT_ROOT / \"classifier\"\n", "if str(CLASSIFIER_ROOT) not in sys.path:\n", " sys.path.insert(0, str(CLASSIFIER_ROOT))\n", "\n", "CONFIGS_DIR = CLASSIFIER_ROOT / \"configs\"\n", "LOGS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"logs\"\n", "MODELS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"models\"\n", "FIGURES_DIR = CLASSIFIER_ROOT / \"outputs\" / \"figures\"\n", "ANALYSIS_DIR = CLASSIFIER_ROOT / \"outputs\" / \"analysis\"\n", "FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n", "ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "print(f\"Project root: {PROJECT_ROOT}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Planned Phase 4 matrix\n", "\n", "The Phase 4 configs keep the Phase 3 preprocessing setup: pretrained backbones, 224x224 input, facecropped classifier data, no augmentation, and the same cross-validation protocol. The controlled variable is data fraction: 20%, 50%, and 100%.\n", "\n", "The model families selected for scaling are ResNet50, EfficientNet-B0, and ConvNeXt-Tiny. They are the strongest Phase 3 families and represent different capacity/efficiency tradeoffs.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
runbackbonesubsampleimage_sizedata_diraugmentpretrainedepochs
1p4_convnext_tiny_20pctconvnext_tiny0.2224cropped/classifierFalseTrue15
2p4_convnext_tiny_50pctconvnext_tiny0.5224cropped/classifierFalseTrue15
0p4_convnext_tiny_100pctconvnext_tiny1.0224cropped/classifierFalseTrue15
4p4_efficientnet_b0_20pctefficientnet_b00.2224cropped/classifierFalseTrue15
5p4_efficientnet_b0_50pctefficientnet_b00.5224cropped/classifierFalseTrue15
3p4_efficientnet_b0_100pctefficientnet_b01.0224cropped/classifierFalseTrue15
7p4_resnet50_20pctresnet500.2224cropped/classifierFalseTrue15
8p4_resnet50_50pctresnet500.5224cropped/classifierFalseTrue15
6p4_resnet50_100pctresnet501.0224cropped/classifierFalseTrue15
\n", "
" ], "text/plain": [ " run backbone subsample image_size \\\n", "1 p4_convnext_tiny_20pct convnext_tiny 0.2 224 \n", "2 p4_convnext_tiny_50pct convnext_tiny 0.5 224 \n", "0 p4_convnext_tiny_100pct convnext_tiny 1.0 224 \n", "4 p4_efficientnet_b0_20pct efficientnet_b0 0.2 224 \n", "5 p4_efficientnet_b0_50pct efficientnet_b0 0.5 224 \n", "3 p4_efficientnet_b0_100pct efficientnet_b0 1.0 224 \n", "7 p4_resnet50_20pct resnet50 0.2 224 \n", "8 p4_resnet50_50pct resnet50 0.5 224 \n", "6 p4_resnet50_100pct resnet50 1.0 224 \n", "\n", " data_dir augment pretrained epochs \n", "1 cropped/classifier False True 15 \n", "2 cropped/classifier False True 15 \n", "0 cropped/classifier False True 15 \n", "4 cropped/classifier False True 15 \n", "5 cropped/classifier False True 15 \n", "3 cropped/classifier False True 15 \n", "7 cropped/classifier False True 15 \n", "8 cropped/classifier False True 15 \n", "6 cropped/classifier False True 15 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subsample0.20.51.0
backbone
convnext_tinyp4_convnext_tiny_20pctp4_convnext_tiny_50pctp4_convnext_tiny_100pct
efficientnet_b0p4_efficientnet_b0_20pctp4_efficientnet_b0_50pctp4_efficientnet_b0_100pct
resnet50p4_resnet50_20pctp4_resnet50_50pctp4_resnet50_100pct
\n", "
" ], "text/plain": [ "subsample 0.2 0.5 \\\n", "backbone \n", "convnext_tiny p4_convnext_tiny_20pct p4_convnext_tiny_50pct \n", "efficientnet_b0 p4_efficientnet_b0_20pct p4_efficientnet_b0_50pct \n", "resnet50 p4_resnet50_20pct p4_resnet50_50pct \n", "\n", "subsample 1.0 \n", "backbone \n", "convnext_tiny p4_convnext_tiny_100pct \n", "efficientnet_b0 p4_efficientnet_b0_100pct \n", "resnet50 p4_resnet50_100pct " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def load_json(path: Path) -> dict:\n", " return json.loads(path.read_text(encoding=\"utf-8\"))\n", "\n", "\n", "def resolve_config(path: Path) -> dict:\n", " cfg = load_json(path)\n", " parent = cfg.pop(\"extends\", None)\n", " if parent:\n", " base = resolve_config(path.parent / parent)\n", " base.update(cfg)\n", " cfg = base\n", " return cfg\n", "\n", "\n", "phase4_configs = []\n", "for path in sorted((CONFIGS_DIR / \"phase4\").glob(\"p4_*.json\")):\n", " cfg = resolve_config(path)\n", " phase4_configs.append({\n", " \"run\": cfg.get(\"run_name\", path.stem),\n", " \"backbone\": cfg.get(\"backbone\"),\n", " \"subsample\": cfg.get(\"subsample\"),\n", " \"image_size\": cfg.get(\"image_size\"),\n", " \"data_dir\": cfg.get(\"data_dir\"),\n", " \"augment\": cfg.get(\"augment\", False),\n", " \"pretrained\": cfg.get(\"pretrained\", True),\n", " \"epochs\": cfg.get(\"epochs\"),\n", " })\n", "\n", "config_df = pd.DataFrame(phase4_configs).sort_values([\"backbone\", \"subsample\"])\n", "display(config_df)\n", "\n", "matrix = config_df.pivot(index=\"backbone\", columns=\"subsample\", values=\"run\")\n", "display(matrix)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Result gate\n", "\n", "The cell below checks for `p4_*` logs. If none are present, Phase 4 remains a planned experiment and the notebook stops at the design/status interpretation. If logs are added later, the same notebook will load them and produce scaling curves, source diagnostics, and report-ready conclusions from those logs.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No Phase 4 result logs found under classifier/outputs/logs.\n", "Missing planned runs:\n", "- p4_convnext_tiny_100pct\n", "- p4_convnext_tiny_20pct\n", "- p4_convnext_tiny_50pct\n", "- p4_efficientnet_b0_100pct\n", "- p4_efficientnet_b0_20pct\n", "- p4_efficientnet_b0_50pct\n", "- p4_resnet50_100pct\n", "- p4_resnet50_20pct\n", "- p4_resnet50_50pct\n" ] } ], "source": [ "def log_path(run_name: str) -> Path:\n", " return LOGS_DIR / f\"{run_name}.json\"\n", "\n", "\n", "def load_run_if_present(run_name: str) -> dict | None:\n", " path = log_path(run_name)\n", " return load_json(path) if path.exists() else None\n", "\n", "\n", "def agg_metric(results: dict, metric: str, field: str = \"mean\"):\n", " return results.get(\"aggregated_metrics\", {}).get(metric, {}).get(field, np.nan)\n", "\n", "\n", "def checkpoint_mb(run_name: str) -> float:\n", " path = MODELS_DIR / f\"{run_name}_fold0_best.pt\"\n", " return path.stat().st_size / (1024 * 1024) if path.exists() else np.nan\n", "\n", "\n", "result_rows = []\n", "missing_runs = []\n", "for row in phase4_configs:\n", " run_name = row[\"run\"]\n", " results = load_run_if_present(run_name)\n", " if results is None:\n", " missing_runs.append(run_name)\n", " continue\n", " cfg = {**row, **results.get(\"config\", {})}\n", " result_rows.append({\n", " \"run\": run_name,\n", " \"backbone\": cfg.get(\"backbone\"),\n", " \"subsample\": cfg.get(\"subsample\", row.get(\"subsample\")),\n", " \"auc\": agg_metric(results, \"auc_roc\"),\n", " \"auc_std\": agg_metric(results, \"auc_roc\", \"std\"),\n", " \"accuracy\": agg_metric(results, \"accuracy\"),\n", " \"f1\": agg_metric(results, \"f1\"),\n", " \"checkpoint_mb\": checkpoint_mb(run_name),\n", " })\n", "\n", "phase4_results_df = pd.DataFrame(result_rows)\n", "if phase4_results_df.empty:\n", " print(\"No Phase 4 result logs found under classifier/outputs/logs.\")\n", " print(\"Missing planned runs:\")\n", " for run_name in missing_runs:\n", " print(f\"- {run_name}\")\n", "else:\n", " phase4_results_df = phase4_results_df.sort_values([\"backbone\", \"subsample\"])\n", " display(\n", " phase4_results_df.style.format({\n", " \"subsample\": \"{:.1f}\",\n", " \"auc\": \"{:.4f}\",\n", " \"auc_std\": \"{:.4f}\",\n", " \"accuracy\": \"{:.4f}\",\n", " \"f1\": \"{:.4f}\",\n", " \"checkpoint_mb\": \"{:.1f}\",\n", " })\n", " )\n", " if missing_runs:\n", " print(\"Some planned Phase 4 runs are still missing:\")\n", " for run_name in missing_runs:\n", " print(f\"- {run_name}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the status cell reports no logs, the correct interpretation is: Phase 4 has been designed but not yet analyzed. The report can describe the intended purpose, but it should not include Phase 4 performance claims.\n", "\n", "When logs exist, the next sections answer three questions: does more data improve each backbone, which backbone benefits most, and does scaling reduce source-specific weakness?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Scaling curves, once results exist\n", "\n", "These cells are guarded. They produce figures only when at least one `p4_*` log is available.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Skipping scaling plots because no p4 logs are available yet.\n" ] } ], "source": [ "if phase4_results_df.empty:\n", " print(\"Skipping scaling plots because no p4 logs are available yet.\")\n", "else:\n", " fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=True)\n", " for ax, metric, title in zip(axes, [\"auc\", \"accuracy\", \"f1\"], [\"AUC\", \"Accuracy\", \"F1\"]):\n", " for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n", " sub = sub.sort_values(\"subsample\")\n", " ax.plot(sub[\"subsample\"] * 100, sub[metric], marker=\"o\", label=backbone)\n", " ax.set_xlabel(\"Training data used (%)\")\n", " ax.set_ylabel(title)\n", " ax.set_title(f\"Phase 4 {title} scaling\")\n", " ax.grid(alpha=0.25)\n", " axes[0].legend(fontsize=8)\n", " fig.tight_layout()\n", " fig.savefig(FIGURES_DIR / \"07_phase4_scaling_curves.png\", dpi=180, bbox_inches=\"tight\")\n", " plt.show()\n", "\n", " gains = []\n", " for backbone, sub in phase4_results_df.groupby(\"backbone\"):\n", " sub = sub.set_index(\"subsample\")\n", " if 0.2 in sub.index and 1.0 in sub.index:\n", " gains.append({\n", " \"backbone\": backbone,\n", " \"auc_20pct\": sub.loc[0.2, \"auc\"],\n", " \"auc_100pct\": sub.loc[1.0, \"auc\"],\n", " \"auc_gain\": sub.loc[1.0, \"auc\"] - sub.loc[0.2, \"auc\"],\n", " \"accuracy_gain\": sub.loc[1.0, \"accuracy\"] - sub.loc[0.2, \"accuracy\"],\n", " \"f1_gain\": sub.loc[1.0, \"f1\"] - sub.loc[0.2, \"f1\"],\n", " })\n", " gains_df = pd.DataFrame(gains)\n", " display(gains_df.style.format({\"auc_20pct\": \"{:.4f}\", \"auc_100pct\": \"{:.4f}\", \"auc_gain\": \"{:+.4f}\", \"accuracy_gain\": \"{:+.4f}\", \"f1_gain\": \"{:+.4f}\"}))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Source diagnostics, once results exist\n", "\n", "The most important Phase 4 question is not only whether AUC rises. It is whether extra data improves the weak source pairs found earlier, especially source generalization around `text2img` and `insight`.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Skipping source diagnostics because no p4 logs are available yet.\n" ] } ], "source": [ "def pairwise_rows(run_name: str, results: dict) -> list[dict]:\n", " rows = []\n", " for pair, metrics in results.get(\"aggregated_pairwise\", {}).items():\n", " rows.append({\n", " \"run\": run_name,\n", " \"pair\": pair,\n", " \"pairwise_auc\": metrics.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n", " \"pairwise_f1\": metrics.get(\"f1\", {}).get(\"mean\", np.nan),\n", " \"pairwise_accuracy\": metrics.get(\"accuracy\", {}).get(\"mean\", np.nan),\n", " })\n", " return rows\n", "\n", "\n", "if phase4_results_df.empty:\n", " print(\"Skipping source diagnostics because no p4 logs are available yet.\")\n", "else:\n", " pair_rows = []\n", " for _, row in phase4_results_df.iterrows():\n", " results = load_run_if_present(row[\"run\"])\n", " for pair_row in pairwise_rows(row[\"run\"], results):\n", " pair_rows.append({**pair_row, \"backbone\": row[\"backbone\"], \"subsample\": row[\"subsample\"]})\n", " pair_df = pd.DataFrame(pair_rows)\n", " display(pair_df.sort_values([\"pair\", \"backbone\", \"subsample\"]).style.format({\"subsample\": \"{:.1f}\", \"pairwise_auc\": \"{:.4f}\", \"pairwise_f1\": \"{:.4f}\", \"pairwise_accuracy\": \"{:.4f}\"}))\n", "\n", " for pair in sorted(pair_df[\"pair\"].unique()):\n", " fig, ax = plt.subplots(figsize=(6, 3.6))\n", " sub_pair = pair_df[pair_df[\"pair\"] == pair]\n", " for backbone, sub in sub_pair.groupby(\"backbone\"):\n", " sub = sub.sort_values(\"subsample\")\n", " ax.plot(sub[\"subsample\"] * 100, sub[\"pairwise_auc\"], marker=\"o\", label=backbone)\n", " ax.set_title(f\"Phase 4 source-pair scaling: {pair}\")\n", " ax.set_xlabel(\"Training data used (%)\")\n", " ax.set_ylabel(\"Pairwise AUC\")\n", " ax.grid(alpha=0.25)\n", " ax.legend(fontsize=8)\n", " fig.tight_layout()\n", " fig.savefig(FIGURES_DIR / f\"07_phase4_{pair}_scaling.png\", dpi=180, bbox_inches=\"tight\")\n", " plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "At the time this notebook was added, Phase 4 is a planned data-scaling analysis rather than a completed result chapter. The configs define a clean experiment: take the strongest Phase 3 families and train them at 20%, 50%, and 100% of the available facecropped data.\n", "\n", "The report should not make positive or negative Phase 4 performance claims until `p4_*` logs are present. Once those logs exist, the key result to look for is not just a higher global AUC. The stronger claim would be that more data also improves pairwise source behavior, especially for the sources that exposed generalization limits earlier.\n" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.10" } }, "nbformat": 4, "nbformat_minor": 5 }