Compare commits
14 Commits
920cc983c4
...
clean
| Author | SHA1 | Date | |
|---|---|---|---|
| a5dc093a15 | |||
| 1ed2b7a7a0 | |||
| afd26f47d2 | |||
| 3bff7eefb0 | |||
| f46320f81e | |||
| 2062a91985 | |||
| 9ae334410d | |||
| 522a8f8d46 | |||
| 69666d6aa0 | |||
| b5313e3320 | |||
| 580808d9ad | |||
| e09cbf6a1a | |||
| ac3d6e1f6d | |||
| bed9be0b17 |
@@ -67,3 +67,6 @@ generator/outputs/samples/*
|
|||||||
.venv/
|
.venv/
|
||||||
.ipynb_checkpoints/
|
.ipynb_checkpoints/
|
||||||
__pycache__/
|
__pycache__/
|
||||||
|
|
||||||
|
#Presentation
|
||||||
|
presentation_inputs.zip
|
||||||
@@ -1,125 +1,264 @@
|
|||||||
# DRL_PROJ — DeepFake Detection
|
# Deep learning face project
|
||||||
|
|
||||||
Deep learning project for binary deepfake detection on the DeepFakeFace dataset.
|
This repository contains a two-part deep learning project on the
|
||||||
|
DeepFakeFace (DFF) dataset:
|
||||||
|
|
||||||
## Project structure
|
1. **Classifier:** detect whether a face image is real or fake.
|
||||||
|
2. **Generator:** train generative models that produce new fake face images.
|
||||||
|
|
||||||
```
|
The project is written as an experimental report. The notebooks are the main
|
||||||
|
deliverable: they show the pipeline, the intermediate failures, the ablations,
|
||||||
|
the decisions, and the final models. Read them in order.
|
||||||
|
|
||||||
|
## Project story
|
||||||
|
|
||||||
|
The work follows the same principle in both parts: start with a simple
|
||||||
|
baseline, inspect what fails, change one important factor at a time, and keep
|
||||||
|
the evidence tied to saved logs and saved artifacts.
|
||||||
|
|
||||||
|
For the **classifier**, the story moves from dataset understanding to
|
||||||
|
preprocessing, baseline models, controlled ablations, Grad-CAM inspection,
|
||||||
|
stronger model families, and data scaling. The final practical classifier is a
|
||||||
|
ResNet50-style pipeline using face crops, 224×224 inputs, ImageNet/default
|
||||||
|
normalization, and no stochastic augmentation at validation/test time.
|
||||||
|
|
||||||
|
For the **generator**, the story starts with raw baseline failures, then locks
|
||||||
|
the data pipeline before comparing three parallel model-family branches:
|
||||||
|
GAN, VAE, and DDPM. The final comparison keeps quality versus speed central:
|
||||||
|
DDPM gives the best saved FID and visual quality, GAN is the best
|
||||||
|
quality-speed compromise, and VAE is the fastest but smoothest option.
|
||||||
|
|
||||||
|
## How to read the project
|
||||||
|
|
||||||
|
Start with the classifier notebooks, then read the generator notebooks. The
|
||||||
|
generator has one linear setup stage followed by three parallel branches:
|
||||||
|
GAN, VAE, and DDPM. Those branches are numbered in reading order, but they are
|
||||||
|
conceptually parallel experiments after the pipeline is selected.
|
||||||
|
|
||||||
|
### Classifier notebooks
|
||||||
|
|
||||||
|
Read these first:
|
||||||
|
|
||||||
|
1. `classifier/notebooks/01_eda.ipynb`
|
||||||
|
Dataset composition, real/fake source mapping, image statistics, and
|
||||||
|
shortcut risks.
|
||||||
|
2. `classifier/notebooks/02_preprocessing.ipynb`
|
||||||
|
Deterministic preprocessing, train-only augmentation, face crops, and
|
||||||
|
normalization.
|
||||||
|
3. `classifier/notebooks/03_phase1_analysis.ipynb`
|
||||||
|
SimpleCNN and ResNet18 controlled baselines.
|
||||||
|
4. `classifier/notebooks/04_phase2_analysis.ipynb`
|
||||||
|
Resolution, normalization, source holdouts, facecrop, and augmentation
|
||||||
|
ablations.
|
||||||
|
5. `classifier/notebooks/05_gradcam_analysis.ipynb`
|
||||||
|
Qualitative localization analysis across the classifier pipeline.
|
||||||
|
6. `classifier/notebooks/06_phase3_model_family_analysis.ipynb`
|
||||||
|
Stronger pretrained model families and the ResNet50 practical choice.
|
||||||
|
7. `classifier/notebooks/07_phase4_data_scaling_analysis.ipynb`
|
||||||
|
Data scaling for strong backbones and the final classifier decision.
|
||||||
|
|
||||||
|
### Generator notebooks
|
||||||
|
|
||||||
|
Read these after the classifier:
|
||||||
|
|
||||||
|
1. `generator/notebooks/01_baseline_sanity_check.ipynb`
|
||||||
|
Raw baseline failures and why the data pipeline must be fixed first.
|
||||||
|
2. `generator/notebooks/02_pipeline_selection.ipynb`
|
||||||
|
Controlled pipeline ablations: resolution, alignment, augmentation, and
|
||||||
|
raw/aligned mixing.
|
||||||
|
3. `generator/notebooks/03_gan_stability_progression.ipynb`
|
||||||
|
GAN branch: DCGAN → WGAN-GP → spectral normalization + GroupNorm +
|
||||||
|
self-attention → 128×128 check.
|
||||||
|
4. `generator/notebooks/04_vae_loss_progression.ipynb`
|
||||||
|
VAE branch: MSE + KL → perceptual loss → PatchGAN adversarial loss.
|
||||||
|
5. `generator/notebooks/05_ddpm_recipe_progression.ipynb`
|
||||||
|
DDPM branch: linear schedule → cosine schedule → v-prediction → wider
|
||||||
|
backbone.
|
||||||
|
6. `generator/notebooks/06_final_family_comparison.ipynb`
|
||||||
|
Final comparison of the selected GAN, VAE, and DDPM recipes under saved
|
||||||
|
Phase 5 conditions.
|
||||||
|
7. `generator/notebooks/07_final_sample_showcase.ipynb`
|
||||||
|
Curated final sample examples from saved outputs. This is qualitative
|
||||||
|
showcase material, not a replacement for FID.
|
||||||
|
|
||||||
|
## What the notebooks do
|
||||||
|
|
||||||
|
The notebooks are analysis/report chapters. They load existing configs, logs,
|
||||||
|
figures, saved sample grids, checkpoints, and prediction summaries. They are
|
||||||
|
not intended to launch new training runs.
|
||||||
|
|
||||||
|
When a notebook shows a plot or image grid, the surrounding markdown explains:
|
||||||
|
|
||||||
|
- what the artifact shows;
|
||||||
|
- why it is needed;
|
||||||
|
- how it supports the phase decision;
|
||||||
|
- what limitation remains.
|
||||||
|
|
||||||
|
This is important because the project is evaluated not only by final
|
||||||
|
performance, but by the documented evolution of the solution.
|
||||||
|
|
||||||
|
## Repository layout
|
||||||
|
|
||||||
|
```text
|
||||||
DRL_PROJ/
|
DRL_PROJ/
|
||||||
classifier/ ← discriminative model (real vs. fake classifier)
|
classifier/
|
||||||
src/ ← model definitions, training, evaluation, preprocessing
|
configs/ experiment configs by phase
|
||||||
configs/ ← experiment configs organised by phase
|
notebooks/ classifier report notebooks
|
||||||
phase1/ ← baseline models (SimpleCNN, ResNet18)
|
outputs/ saved logs, figures, Grad-CAM panels, checkpoints
|
||||||
phase2/ ← architecture sweep (ResNet variants, face-crop)
|
src/ classifier data, models, training, evaluation
|
||||||
phase3/ ← EfficientNet, ViT, frequency-aware training
|
tests/ unit and smoke tests
|
||||||
phase4/ ← ensemble strategies
|
tools/ facecrop, Grad-CAM, inference, reevaluation helpers
|
||||||
tools/ ← analyse.py, ensemble.py, inference.py, facecrop.py
|
|
||||||
notebooks/ ← EDA, preprocessing, evaluation, GradCAM
|
generator/
|
||||||
outputs/ ← models, logs, figures (gitignored except .pt/.json)
|
configs/ generator configs by phase/family
|
||||||
run.py ← main training entry point
|
notebooks/ generator report notebooks and notebook builder
|
||||||
generator/ ← generative model (GAN / VAE / diffusion) — in progress
|
outputs/ saved logs, sample grids, final showcase artifacts
|
||||||
pipeline/ ← Vast.ai ephemeral GPU orchestration
|
src/ generator data, models, training, metrics
|
||||||
data/ ← dataset root (gitignored)
|
tests/ unit and smoke tests
|
||||||
cropped/ ← MTCNN pre-cropped faces (gitignored)
|
tools/ sampling and utility scripts
|
||||||
classifier/ ← bbox crops for the classifier
|
|
||||||
generator/ ← landmark-aligned crops for the generator
|
data/ original DFF dataset root, not committed
|
||||||
|
cropped/ preprocessed face crops, not committed
|
||||||
|
docs/ project statement and supporting documents
|
||||||
|
pipeline/ optional remote/GPU orchestration helpers
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Rebuilding the generator notebooks
|
||||||
|
|
||||||
|
The generator notebooks are generated from a single source file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd generator/notebooks
|
||||||
|
python _build.py
|
||||||
|
```
|
||||||
|
|
||||||
|
That builder writes the numbered generator notebooks listed above. It uses
|
||||||
|
existing saved logs and artifacts; it does not train models.
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
Create a local environment when you want to run the code directly on a machine you control:
|
Create a conda environment and install the project requirements:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m venv .venv
|
conda create -n drl python=3.12
|
||||||
source .venv/bin/activate
|
conda activate drl
|
||||||
python -m pip install --upgrade pip setuptools wheel
|
python -m pip install --upgrade pip setuptools wheel
|
||||||
python -m pip install -r requirements.txt
|
python -m pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
## Local Training
|
Use **Python 3.12**; some dependencies (for example `facenet-pytorch`) are
|
||||||
|
unreliable on 3.13+.
|
||||||
|
|
||||||
|
The raw dataset should be placed under `data/`. Preprocessed crops are stored
|
||||||
|
under `cropped/`. These folders are intentionally not committed. To download
|
||||||
|
and extract the dataset:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 classifier/run.py classifier/configs/phase2/p2_resnet18_facecrop.json
|
python classifier/tools/fetch_ds.py
|
||||||
python3 classifier/run.py classifier/configs/phase3/p3_efficientnet_b0.json
|
python classifier/tools/fetch_ds.py --data-dir /path/to/DFF
|
||||||
```
|
```
|
||||||
|
|
||||||
## Ephemeral Vast.ai Pipeline
|
Expected layout under the data root: `wiki/<identity>/*.jpg`,
|
||||||
|
`inpainting/...`, `text2img/...`, `insight/...`.
|
||||||
|
|
||||||
The deployment/orchestration path now lives under [`pipeline/`](/run/host/mnt/shared/UP/DRL/DRL_PROJ/pipeline/README.md).
|
## Classifier — training
|
||||||
|
|
||||||
One-time setup:
|
From the repository root:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cat > pipeline/.env <<'EOF'
|
# CPU (slow but valid)
|
||||||
VAST_API_KEY=<your-api-key>
|
python classifier/run.py classifier/configs/phase4/p4_convnext_tiny_100pct.json
|
||||||
VAST_SSH_PRIVATE_KEY=/home/your-user/.ssh/id_ed25519
|
|
||||||
EOF
|
# GPU when CUDA is available
|
||||||
|
python classifier/run.py classifier/configs/phase4/p4_convnext_tiny_100pct.json --use-gpu
|
||||||
```
|
```
|
||||||
|
|
||||||
End-to-end ephemeral run:
|
Training uses 5-fold stratified group cross-validation. Per-fold checkpoints
|
||||||
|
are saved as `classifier/outputs/models/{run_name}_fold{k}_best.pt` (and
|
||||||
|
`_final.pt`). Override data or output locations with `--data-dir` and
|
||||||
|
`--output-root`.
|
||||||
|
|
||||||
|
**Primary delivery model** (best Phase 4 detector): config
|
||||||
|
`classifier/configs/phase4/p4_convnext_tiny_100pct.json` with per-fold
|
||||||
|
weights `classifier/outputs/models/p4_convnext_tiny_100pct_fold*_best.pt`.
|
||||||
|
|
||||||
|
## Classifier — inference
|
||||||
|
|
||||||
|
Classify a single image as real or fake:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m pipeline run classifier/configs/phase2/p2_resnet18_facecrop.json --upload-data
|
python classifier/tools/inference.py image.jpg classifier/configs/phase4/p4_convnext_tiny_100pct.json
|
||||||
```
|
```
|
||||||
|
|
||||||
Interactive offer selection:
|
This loads the config and the matching checkpoint, runs the image through the
|
||||||
|
model, and prints a result like:
|
||||||
|
|
||||||
|
```
|
||||||
|
Image : image.jpg
|
||||||
|
Model : p4_convnext_tiny_100pct (convnext_tiny)
|
||||||
|
Device: cuda
|
||||||
|
Result: FAKE (confidence: 74.7%)
|
||||||
|
P(fake): 0.7466 P(real): 0.2534
|
||||||
|
```
|
||||||
|
|
||||||
|
If you omit `--checkpoint`, the tool automatically looks for a saved
|
||||||
|
checkpoint under `classifier/outputs/models/` — first the single-run
|
||||||
|
`{run_name}_best.pt`, then CV fold files `{run_name}_fold{k}_best.pt`, then
|
||||||
|
`{run_name}_fold{k}_final.pt`. To use a specific fold:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m pipeline offers --select-offer
|
python classifier/tools/inference.py image.jpg classifier/configs/phase4/p4_convnext_tiny_100pct.json \
|
||||||
|
--checkpoint classifier/outputs/models/p4_convnext_tiny_100pct_fold0_best.pt
|
||||||
```
|
```
|
||||||
|
|
||||||
You can override the ranking mode per run:
|
## Generator — training
|
||||||
|
|
||||||
|
From the repository root:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m pipeline offers --sort price
|
python generator/run.py generator/configs/phase0/p0_vae.json
|
||||||
python3 -m pipeline offers --sort performance
|
python generator/run.py generator/configs/phase0/p0_ddpm.json
|
||||||
python3 -m pipeline offers --sort performance --price 0.14
|
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also filter by region:
|
Generator training expects real-face images (default source is `wiki`); use
|
||||||
|
`--data-dir` to point at your dataset tree. Checkpoints are saved under
|
||||||
|
`generator/outputs/models/{run_name}_final_ema.pt` (EMA shadow) and
|
||||||
|
`{run_name}_best_ema.pt` (lowest-FID snapshot).
|
||||||
|
|
||||||
|
## Generator — inference (sampling)
|
||||||
|
|
||||||
|
Generate 4×4 sample grids from Phase 5 EMA checkpoints:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m pipeline offers --select-offer --region europe
|
python generator/tools/sampling.py --models p5_gan p5_vae p5_ddpm --samples 10
|
||||||
python3 -m pipeline offers --select-offer --region Portugal
|
|
||||||
python3 -m pipeline offers --select-offer --region US
|
|
||||||
python3 -m pipeline offers --select-offer --region europe --price 0.14
|
|
||||||
```
|
```
|
||||||
|
|
||||||
To inspect which region strings are currently available from the search results:
|
Options:
|
||||||
|
|
||||||
```bash
|
- `--models` — which models to sample from (`p5_gan`, `p5_vae`, `p5_ddpm`;
|
||||||
python3 -m pipeline offers --list-regions
|
defaults to all three).
|
||||||
```
|
- `--samples` — number of grids per model (default 10).
|
||||||
|
- `--output-dir` — where to write the PNGs (default
|
||||||
|
`generator/outputs/samples/final_comparison/`).
|
||||||
|
- `--truncation` — optional latent truncation for the GAN (lower = less
|
||||||
|
diversity but sharper).
|
||||||
|
- `--device` — `cuda` or `cpu` (default: auto-detect).
|
||||||
|
|
||||||
That command:
|
Each grid is a 4×4 PNG of 16 images sampled from the model's EMA weights.
|
||||||
- ensures your SSH public key is registered with Vast.ai
|
GAN samples are drawn from random latent vectors, VAE samples decode from the
|
||||||
- searches offers using the filters in `pipeline/defaults/vast.json`
|
learned prior, and DDPM samples use 50-step DDIM.
|
||||||
- creates an instance
|
|
||||||
- waits for SSH readiness
|
|
||||||
- syncs the repo
|
|
||||||
- uploads `data/` when `--upload-data` is set
|
|
||||||
- runs `python3 classifier/run.py ...`
|
|
||||||
- downloads `classifier/outputs/`
|
|
||||||
- for generator runs, rsyncs `generator/outputs/` back every 25 epochs and again at completion
|
|
||||||
- destroys the instance automatically unless `--keep-on-failure` is set
|
|
||||||
|
|
||||||
Useful commands:
|
## Final takeaway
|
||||||
|
|
||||||
```bash
|
The project is best understood as a sequence of controlled decisions:
|
||||||
python3 -m pipeline up
|
|
||||||
python3 -m pipeline status <instance_id>
|
|
||||||
python3 -m pipeline down <instance_id>
|
|
||||||
```
|
|
||||||
|
|
||||||
To override the default Vast search/runtime settings, copy `pipeline/defaults/vast.json`, edit it, and pass:
|
1. cleanly define the data and preprocessing;
|
||||||
|
2. establish simple baselines;
|
||||||
|
3. improve one factor at a time;
|
||||||
|
4. compare model families using saved evidence;
|
||||||
|
5. report both performance and limitations.
|
||||||
|
|
||||||
```bash
|
The classifier becomes reliable through source-aware preprocessing, stronger
|
||||||
python3 -m pipeline run classifier/configs/phase3/p3_efficientnet_b0.json --pipeline-config /path/to/vast.override.json
|
pretrained backbones, and scaling. The generator improves by first locking the
|
||||||
```
|
face-aligned pipeline and then selecting the best recipe inside each model
|
||||||
|
family before the final GAN/VAE/DDPM comparison.
|
||||||
The default policy in `pipeline/defaults/vast.json` now targets:
|
|
||||||
- `1x` GPU
|
|
||||||
- `RTX 3090` or `RTX 3090 Ti`
|
|
||||||
- `<= $0.20/hour`
|
|
||||||
- sorted by `dlperf` descending
|
|
||||||
- uses `vastai/pytorch:latest` as the default image
|
|
||||||
|
|||||||
@@ -1,6 +0,0 @@
|
|||||||
{
|
|
||||||
"extends": "_base.json",
|
|
||||||
"run_name": "p4_convnext_tiny_20pct",
|
|
||||||
"backbone": "convnext_tiny",
|
|
||||||
"subsample": 0.2
|
|
||||||
}
|
|
||||||
@@ -1,6 +0,0 @@
|
|||||||
{
|
|
||||||
"extends": "_base.json",
|
|
||||||
"run_name": "p4_efficientnet_b0_20pct",
|
|
||||||
"backbone": "efficientnet_b0",
|
|
||||||
"subsample": 0.2
|
|
||||||
}
|
|
||||||
@@ -1,6 +0,0 @@
|
|||||||
{
|
|
||||||
"extends": "_base.json",
|
|
||||||
"run_name": "p4_resnet50_20pct",
|
|
||||||
"backbone": "resnet50",
|
|
||||||
"subsample": 0.2
|
|
||||||
}
|
|
||||||
@@ -0,0 +1,18 @@
|
|||||||
|
{
|
||||||
|
"extends": "../shared.json",
|
||||||
|
"run_name": "smoke",
|
||||||
|
"backbone": "simple_cnn",
|
||||||
|
"cnn_preset": "micro",
|
||||||
|
"dropout": 0.0,
|
||||||
|
"epochs": 1,
|
||||||
|
"cv_folds": 2,
|
||||||
|
"image_size": 64,
|
||||||
|
"batch_size": 8,
|
||||||
|
"num_workers": 0,
|
||||||
|
"early_stopping_patience": 0,
|
||||||
|
"subsample": 1.0,
|
||||||
|
"augment": false,
|
||||||
|
"lr": 0.001,
|
||||||
|
"T_max": 1,
|
||||||
|
"data_dir": "data"
|
||||||
|
}
|
||||||
@@ -1,702 +0,0 @@
|
|||||||
{
|
|
||||||
"cells": [
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"# Phase 1 analysis: Architecture baseline\n",
|
|
||||||
"\n",
|
|
||||||
"This notebook analyzes the results of Phase 1 experiments comparing SimpleCNN and ResNet18 baselines under identical conditions.\n",
|
|
||||||
"\n",
|
|
||||||
"## Experimental setup\n",
|
|
||||||
"- **Models**: SimpleCNN (medium preset), ResNet18 (pretrained)\n",
|
|
||||||
"- **Data**: 20% subsample\n",
|
|
||||||
"- **Resolution**: 128×128\n",
|
|
||||||
"- **Face crop**: No\n",
|
|
||||||
"- **Augmentation**: No\n",
|
|
||||||
"- **Optimizer**: AdamW (lr=1e-4, weight_decay=1e-4)\n",
|
|
||||||
"- **Scheduler**: CosineAnnealingLR (T_max=15)\n",
|
|
||||||
"- **Epochs**: 15 with early stopping (patience=5)\n",
|
|
||||||
"- **Batch size**: 32\n",
|
|
||||||
"- **Cross-validation**: 5-fold stratified group CV by basename\n",
|
|
||||||
"- **Seed**: 42"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"import json\n",
|
|
||||||
"import numpy as np\n",
|
|
||||||
"import pandas as pd\n",
|
|
||||||
"import matplotlib.pyplot as plt\n",
|
|
||||||
"import seaborn as sns\n",
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from scipy import stats\n",
|
|
||||||
"\n",
|
|
||||||
"# Set style\n",
|
|
||||||
"sns.set_style(\"whitegrid\")\n",
|
|
||||||
"plt.rcParams['figure.figsize'] = (12, 6)\n",
|
|
||||||
"plt.rcParams['font.size'] = 10\n",
|
|
||||||
"\n",
|
|
||||||
"# Paths\n",
|
|
||||||
"OUTPUTS_DIR = Path(\"../outputs/logs\")\n",
|
|
||||||
"MODELS_DIR = Path(\"../outputs/models\")\n",
|
|
||||||
"FIGURES_DIR = Path(\"../outputs/figures\")\n",
|
|
||||||
"FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"Phase 1 Analysis: Architecture Baseline\")\n",
|
|
||||||
"print(\"=\"*50)"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Load CV results"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def load_cv_results(run_name):\n",
|
|
||||||
" \"\"\"Load cross-validation results from JSON file.\"\"\"\n",
|
|
||||||
" results_path = OUTPUTS_DIR / f\"{run_name}.json\"\n",
|
|
||||||
" if not results_path.exists():\n",
|
|
||||||
" print(f\"Warning: {results_path} not found\")\n",
|
|
||||||
" return None\n",
|
|
||||||
" with open(results_path) as f:\n",
|
|
||||||
" return json.load(f)\n",
|
|
||||||
"\n",
|
|
||||||
"# Load results for both models\n",
|
|
||||||
"simplecnn_results = load_cv_results(\"p1_simplecnn_baseline\")\n",
|
|
||||||
"resnet18_results = load_cv_results(\"p1_resnet18_baseline\")\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"SimpleCNN results loaded: {simplecnn_results is not None}\")\n",
|
|
||||||
"print(f\"ResNet18 results loaded: {resnet18_results is not None}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Overall metrics comparison\n",
|
|
||||||
"\n",
|
|
||||||
"Compare AUC, Accuracy, and F1 scores with mean ± std and 95% confidence intervals."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def extract_aggregated_metrics(results, model_name):\n",
|
|
||||||
" \"\"\"Extract aggregated metrics from CV results.\"\"\"\n",
|
|
||||||
" if results is None:\n",
|
|
||||||
" return None\n",
|
|
||||||
" \n",
|
|
||||||
" agg = results['aggregated_metrics']\n",
|
|
||||||
" return {\n",
|
|
||||||
" 'model': model_name,\n",
|
|
||||||
" 'auc_mean': agg['auc_roc']['mean'],\n",
|
|
||||||
" 'auc_std': agg['auc_roc']['std'],\n",
|
|
||||||
" 'auc_ci': agg['auc_roc']['ci_95'],\n",
|
|
||||||
" 'acc_mean': agg['accuracy']['mean'],\n",
|
|
||||||
" 'acc_std': agg['accuracy']['std'],\n",
|
|
||||||
" 'acc_ci': agg['accuracy']['ci_95'],\n",
|
|
||||||
" 'f1_mean': agg['f1']['mean'],\n",
|
|
||||||
" 'f1_std': agg['f1']['std'],\n",
|
|
||||||
" 'f1_ci': agg['f1']['ci_95'],\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
"# Extract metrics\n",
|
|
||||||
"simplecnn_metrics = extract_aggregated_metrics(simplecnn_results, 'SimpleCNN')\n",
|
|
||||||
"resnet18_metrics = extract_aggregated_metrics(resnet18_results, 'ResNet18')\n",
|
|
||||||
"\n",
|
|
||||||
"# Create comparison table\n",
|
|
||||||
"if simplecnn_metrics and resnet18_metrics:\n",
|
|
||||||
" comparison_df = pd.DataFrame([simplecnn_metrics, resnet18_metrics])\n",
|
|
||||||
" comparison_df.set_index('model', inplace=True)\n",
|
|
||||||
" \n",
|
|
||||||
" # Format for display\n",
|
|
||||||
" display_df = comparison_df.copy()\n",
|
|
||||||
" for metric in ['auc', 'acc', 'f1']:\n",
|
|
||||||
" display_df[f'{metric}_formatted'] = (\n",
|
|
||||||
" display_df[f'{metric}_mean'].apply(lambda x: f\"{x:.4f}\") + \" ± \" +\n",
|
|
||||||
" display_df[f'{metric}_std'].apply(lambda x: f\"{x:.4f}\") +\n",
|
|
||||||
" \" (95% CI: ±\" + display_df[f'{metric}_ci'].apply(lambda x: f\"{x:.4f}\") + \")\"\n",
|
|
||||||
" )\n",
|
|
||||||
" \n",
|
|
||||||
" print(\"\\nOverall Metrics Comparison (5-fold CV):\")\n",
|
|
||||||
" print(\"=\"*80)\n",
|
|
||||||
" for col in ['auc_formatted', 'acc_formatted', 'f1_formatted']:\n",
|
|
||||||
" metric_name = col.replace('_formatted', '').upper()\n",
|
|
||||||
" print(f\"\\n{metric_name}:\")\n",
|
|
||||||
" for model in display_df.index:\n",
|
|
||||||
" print(f\" {model}: {display_df.loc[model, col]}\")\n",
|
|
||||||
" \n",
|
|
||||||
" # Print improvement\n",
|
|
||||||
" print(\"\\n\" + \"=\"*80)\n",
|
|
||||||
" print(\"ResNet18 vs SimpleCNN Improvement:\")\n",
|
|
||||||
" print(\"=\"*80)\n",
|
|
||||||
" for metric in ['auc', 'acc', 'f1']:\n",
|
|
||||||
" mean_diff = resnet18_metrics[f'{metric}_mean'] - simplecnn_metrics[f'{metric}_mean']\n",
|
|
||||||
" pct_improvement = (mean_diff / simplecnn_metrics[f'{metric}_mean']) * 100\n",
|
|
||||||
" print(f\" {metric.upper()}: +{mean_diff:.4f} (+{pct_improvement:.2f}%)\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Visualization: Overall metrics comparison"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"if simplecnn_metrics and resnet18_metrics:\n",
|
|
||||||
" fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
|
|
||||||
" \n",
|
|
||||||
" models = ['SimpleCNN', 'ResNet18']\n",
|
|
||||||
" metrics_data = {\n",
|
|
||||||
" 'AUC-ROC': [simplecnn_metrics['auc_mean'], resnet18_metrics['auc_mean']],\n",
|
|
||||||
" 'Accuracy': [simplecnn_metrics['acc_mean'], resnet18_metrics['acc_mean']],\n",
|
|
||||||
" 'F1 Score': [simplecnn_metrics['f1_mean'], resnet18_metrics['f1_mean']],\n",
|
|
||||||
" }\n",
|
|
||||||
" errors = {\n",
|
|
||||||
" 'AUC-ROC': [simplecnn_metrics['auc_std'], resnet18_metrics['auc_std']],\n",
|
|
||||||
" 'Accuracy': [simplecnn_metrics['acc_std'], resnet18_metrics['acc_std']],\n",
|
|
||||||
" 'F1 Score': [simplecnn_metrics['f1_std'], resnet18_metrics['f1_std']],\n",
|
|
||||||
" }\n",
|
|
||||||
" \n",
|
|
||||||
" colors = ['#e74c3c', '#2ecc71'] # Red for SimpleCNN, Green for ResNet18\n",
|
|
||||||
" \n",
|
|
||||||
" for idx, (metric_name, values) in enumerate(metrics_data.items()):\n",
|
|
||||||
" ax = axes[idx]\n",
|
|
||||||
" bars = ax.bar(models, values, yerr=errors[metric_name], capsize=5, alpha=0.7, color=colors)\n",
|
|
||||||
" ax.set_ylabel(metric_name)\n",
|
|
||||||
" ax.set_title(f'{metric_name} Comparison')\n",
|
|
||||||
" ax.set_ylim(0.5, 1.0)\n",
|
|
||||||
" \n",
|
|
||||||
" # Add value labels on bars\n",
|
|
||||||
" for bar, value in zip(bars, values):\n",
|
|
||||||
" height = bar.get_height()\n",
|
|
||||||
" ax.text(bar.get_x() + bar.get_width()/2., height,\n",
|
|
||||||
" f'{value:.4f}',\n",
|
|
||||||
" ha='center', va='bottom', fontweight='bold')\n",
|
|
||||||
" \n",
|
|
||||||
" plt.tight_layout()\n",
|
|
||||||
" plt.savefig(FIGURES_DIR / 'phase1_overall_metrics.png', dpi=300, bbox_inches='tight')\n",
|
|
||||||
" plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Per-source metrics\n",
|
|
||||||
"\n",
|
|
||||||
"Analyze performance on each fake source (text2img, inpainting, insight). Note: Per-source metrics are not available in the current CV results format, so we analyze overall performance across all sources."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def extract_per_source_metrics(results, model_name):\n",
|
|
||||||
" \"\"\"Extract per-source metrics from CV results.\"\"\"\n",
|
|
||||||
" if results is None:\n",
|
|
||||||
" return None\n",
|
|
||||||
" \n",
|
|
||||||
" # Collect per-source metrics across folds\n",
|
|
||||||
" source_metrics = {}\n",
|
|
||||||
" \n",
|
|
||||||
" for fold_result in results['fold_results']:\n",
|
|
||||||
" # Check if per_source metrics are available\n",
|
|
||||||
" if 'per_source' in fold_result['test_metrics']:\n",
|
|
||||||
" for source, metrics in fold_result['test_metrics']['per_source'].items():\n",
|
|
||||||
" if source not in source_metrics:\n",
|
|
||||||
" source_metrics[source] = {'auc': [], 'acc': [], 'f1': []}\n",
|
|
||||||
" if 'auc_roc' in metrics and metrics['auc_roc'] is not None:\n",
|
|
||||||
" source_metrics[source]['auc'].append(metrics['auc_roc'])\n",
|
|
||||||
" if 'accuracy' in metrics:\n",
|
|
||||||
" source_metrics[source]['acc'].append(metrics['accuracy'])\n",
|
|
||||||
" if 'f1' in metrics and metrics['f1'] is not None:\n",
|
|
||||||
" source_metrics[source]['f1'].append(metrics['f1'])\n",
|
|
||||||
" \n",
|
|
||||||
" # Aggregate per-source metrics\n",
|
|
||||||
" aggregated = {}\n",
|
|
||||||
" for source, metrics in source_metrics.items():\n",
|
|
||||||
" aggregated[source] = {\n",
|
|
||||||
" 'auc_mean': np.mean(metrics['auc']) if metrics['auc'] else None,\n",
|
|
||||||
" 'auc_std': np.std(metrics['auc']) if len(metrics['auc']) > 1 else 0,\n",
|
|
||||||
" 'acc_mean': np.mean(metrics['acc']) if metrics['acc'] else None,\n",
|
|
||||||
" 'acc_std': np.std(metrics['acc']) if len(metrics['acc']) > 1 else 0,\n",
|
|
||||||
" 'f1_mean': np.mean(metrics['f1']) if metrics['f1'] else None,\n",
|
|
||||||
" 'f1_std': np.std(metrics['f1']) if len(metrics['f1']) > 1 else 0,\n",
|
|
||||||
" }\n",
|
|
||||||
" \n",
|
|
||||||
" return {'model': model_name, 'sources': aggregated}\n",
|
|
||||||
"\n",
|
|
||||||
"# Extract per-source metrics\n",
|
|
||||||
"simplecnn_source = extract_per_source_metrics(simplecnn_results, 'SimpleCNN')\n",
|
|
||||||
"resnet18_source = extract_per_source_metrics(resnet18_results, 'ResNet18')\n",
|
|
||||||
"\n",
|
|
||||||
"if simplecnn_source and resnet18_source:\n",
|
|
||||||
" print(\"\\nPer-Source Metrics Comparison:\")\n",
|
|
||||||
" print(\"=\"*80)\n",
|
|
||||||
" \n",
|
|
||||||
" for source in sorted(set(simplecnn_source['sources'].keys()) | set(resnet18_source['sources'].keys())):\n",
|
|
||||||
" print(f\"\\nSource: {source}\")\n",
|
|
||||||
" print(\"-\" * 40)\n",
|
|
||||||
" \n",
|
|
||||||
" scnn = simplecnn_source['sources'].get(source, {})\n",
|
|
||||||
" r18 = resnet18_source['sources'].get(source, {})\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\" SimpleCNN: AUC={scnn.get('auc_mean', 'N/A'):.4f}±{scnn.get('auc_std', 0):.4f}, \"\n",
|
|
||||||
" f\"Acc={scnn.get('acc_mean', 'N/A'):.4f}±{scnn.get('acc_std', 0):.4f}, \"\n",
|
|
||||||
" f\"F1={scnn.get('f1_mean', 'N/A'):.4f}±{scnn.get('f1_std', 0):.4f}\")\n",
|
|
||||||
" print(f\" ResNet18: AUC={r18.get('auc_mean', 'N/A'):.4f}±{r18.get('auc_std', 0):.4f}, \"\n",
|
|
||||||
" f\"Acc={r18.get('acc_mean', 'N/A'):.4f}±{r18.get('acc_std', 0):.4f}, \"\n",
|
|
||||||
" f\"F1={r18.get('f1_mean', 'N/A'):.4f}±{r18.get('f1_std', 0):.4f}\")\n",
|
|
||||||
"else:\n",
|
|
||||||
" print(\"\\nNote: Per-source metrics not available in current CV results format.\")\n",
|
|
||||||
" print(\"The models were evaluated on all sources combined.\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Train/Val/Test performance curves"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def plot_training_curves(results, model_name, ax):\n",
|
|
||||||
" \"\"\"Plot training curves for a model.\"\"\"\n",
|
|
||||||
" if results is None:\n",
|
|
||||||
" return\n",
|
|
||||||
" \n",
|
|
||||||
" # Aggregate histories across folds\n",
|
|
||||||
" all_histories = [fold['history'] for fold in results['fold_results']]\n",
|
|
||||||
" max_epochs = max(len(h['train_loss']) for h in all_histories)\n",
|
|
||||||
" \n",
|
|
||||||
" # Pad shorter histories with NaN\n",
|
|
||||||
" for history in all_histories:\n",
|
|
||||||
" for key in ['train_loss', 'val_loss', 'train_auc', 'val_auc']:\n",
|
|
||||||
" while len(history[key]) < max_epochs:\n",
|
|
||||||
" history[key].append(np.nan)\n",
|
|
||||||
" \n",
|
|
||||||
" # Compute mean and std across folds\n",
|
|
||||||
" epochs = np.arange(1, max_epochs + 1)\n",
|
|
||||||
" \n",
|
|
||||||
" train_loss_mean = np.nanmean([h['train_loss'] for h in all_histories], axis=0)\n",
|
|
||||||
" train_loss_std = np.nanstd([h['train_loss'] for h in all_histories], axis=0)\n",
|
|
||||||
" val_loss_mean = np.nanmean([h['val_loss'] for h in all_histories], axis=0)\n",
|
|
||||||
" val_loss_std = np.nanstd([h['val_loss'] for h in all_histories], axis=0)\n",
|
|
||||||
" \n",
|
|
||||||
" train_auc_mean = np.nanmean([h['train_auc'] for h in all_histories], axis=0)\n",
|
|
||||||
" train_auc_std = np.nanstd([h['train_auc'] for h in all_histories], axis=0)\n",
|
|
||||||
" val_auc_mean = np.nanmean([h['val_auc'] for h in all_histories], axis=0)\n",
|
|
||||||
" val_auc_std = np.nanstd([h['val_auc'] for h in all_histories], axis=0)\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot loss\n",
|
|
||||||
" ax[0].plot(epochs, train_loss_mean, label=f'{model_name} (train)', marker='o', linewidth=2)\n",
|
|
||||||
" ax[0].fill_between(epochs, train_loss_mean - train_loss_std, train_loss_mean + train_loss_std, alpha=0.2)\n",
|
|
||||||
" ax[0].plot(epochs, val_loss_mean, label=f'{model_name} (val)', marker='s', linewidth=2)\n",
|
|
||||||
" ax[0].fill_between(epochs, val_loss_mean - val_loss_std, val_loss_mean + val_loss_std, alpha=0.2)\n",
|
|
||||||
" ax[0].set_xlabel('Epoch', fontweight='bold')\n",
|
|
||||||
" ax[0].set_ylabel('Loss', fontweight='bold')\n",
|
|
||||||
" ax[0].set_title('Training/Validation Loss', fontweight='bold')\n",
|
|
||||||
" ax[0].legend()\n",
|
|
||||||
" ax[0].grid(True, alpha=0.3)\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot AUC\n",
|
|
||||||
" ax[1].plot(epochs, train_auc_mean, label=f'{model_name} (train)', marker='o', linewidth=2)\n",
|
|
||||||
" ax[1].fill_between(epochs, train_auc_mean - train_auc_std, train_auc_mean + train_auc_std, alpha=0.2)\n",
|
|
||||||
" ax[1].plot(epochs, val_auc_mean, label=f'{model_name} (val)', marker='s', linewidth=2)\n",
|
|
||||||
" ax[1].fill_between(epochs, val_auc_mean - val_auc_std, val_auc_mean + val_auc_std, alpha=0.2)\n",
|
|
||||||
" ax[1].set_xlabel('Epoch', fontweight='bold')\n",
|
|
||||||
" ax[1].set_ylabel('AUC-ROC', fontweight='bold')\n",
|
|
||||||
" ax[1].set_title('Training/Validation AUC', fontweight='bold')\n",
|
|
||||||
" ax[1].legend()\n",
|
|
||||||
" ax[1].grid(True, alpha=0.3)\n",
|
|
||||||
" ax[1].set_ylim(0.5, 1.0)\n",
|
|
||||||
"\n",
|
|
||||||
"# Plot curves for both models\n",
|
|
||||||
"fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
|
|
||||||
"\n",
|
|
||||||
"plot_training_curves(simplecnn_results, 'SimpleCNN', axes[0])\n",
|
|
||||||
"plot_training_curves(resnet18_results, 'ResNet18', axes[1])\n",
|
|
||||||
"\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.savefig(FIGURES_DIR / 'phase1_training_curves.png', dpi=300, bbox_inches='tight')\n",
|
|
||||||
"plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Confusion matrices"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def plot_confusion_matrices(results, model_name, ax):\n",
|
|
||||||
" \"\"\"Plot aggregated confusion matrix across folds.\"\"\"\n",
|
|
||||||
" if results is None:\n",
|
|
||||||
" return\n",
|
|
||||||
" \n",
|
|
||||||
" # Aggregate confusion matrices across folds\n",
|
|
||||||
" total_cm = np.array([[0, 0], [0, 0]])\n",
|
|
||||||
" \n",
|
|
||||||
" for fold_result in results['fold_results']:\n",
|
|
||||||
" cm = np.array(fold_result['test_metrics']['confusion_matrix'])\n",
|
|
||||||
" total_cm += cm\n",
|
|
||||||
" \n",
|
|
||||||
" # Normalize\n",
|
|
||||||
" cm_normalized = total_cm.astype('float') / total_cm.sum(axis=1)[:, np.newaxis]\n",
|
|
||||||
" \n",
|
|
||||||
" # Plot\n",
|
|
||||||
" im = ax.imshow(cm_normalized, interpolation='nearest', cmap=plt.cm.Blues, vmin=0, vmax=1)\n",
|
|
||||||
" ax.figure.colorbar(im, ax=ax)\n",
|
|
||||||
" \n",
|
|
||||||
" # Add text annotations\n",
|
|
||||||
" thresh = cm_normalized.max() / 2.\n",
|
|
||||||
" for i in range(2):\n",
|
|
||||||
" for j in range(2):\n",
|
|
||||||
" ax.text(j, i, f'{total_cm[i, j]}\\n({cm_normalized[i, j]:.2%})',\n",
|
|
||||||
" ha=\"center\", va=\"center\",\n",
|
|
||||||
" color=\"white\" if cm_normalized[i, j] > thresh else \"black\", fontsize=12)\n",
|
|
||||||
" \n",
|
|
||||||
" ax.set_ylabel('True Label', fontweight='bold')\n",
|
|
||||||
" ax.set_xlabel('Predicted Label', fontweight='bold')\n",
|
|
||||||
" ax.set_title(f'{model_name} Confusion Matrix', fontweight='bold')\n",
|
|
||||||
" ax.set_xticks([0, 1])\n",
|
|
||||||
" ax.set_yticks([0, 1])\n",
|
|
||||||
" ax.set_xticklabels(['Real', 'Fake'])\n",
|
|
||||||
" ax.set_yticklabels(['Real', 'Fake'])\n",
|
|
||||||
"\n",
|
|
||||||
"# Plot confusion matrices\n",
|
|
||||||
"fig, axes = plt.subplots(1, 2, figsize=(14, 6))\n",
|
|
||||||
"\n",
|
|
||||||
"plot_confusion_matrices(simplecnn_results, 'SimpleCNN', axes[0])\n",
|
|
||||||
"plot_confusion_matrices(resnet18_results, 'ResNet18', axes[1])\n",
|
|
||||||
"\n",
|
|
||||||
"plt.tight_layout()\n",
|
|
||||||
"plt.savefig(FIGURES_DIR / 'phase1_confusion_matrices.png', dpi=300, bbox_inches='tight')\n",
|
|
||||||
"plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Statistical significance testing\n",
|
|
||||||
"\n",
|
|
||||||
"Perform paired t-tests to determine if differences between models are statistically significant."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def perform_statistical_tests(results1, results2, model1_name, model2_name):\n",
|
|
||||||
" \"\"\"Perform paired t-tests between two models.\"\"\"\n",
|
|
||||||
" if results1 is None or results2 is None:\n",
|
|
||||||
" return None\n",
|
|
||||||
" \n",
|
|
||||||
" # Extract test AUC values across folds\n",
|
|
||||||
" auc1 = [fold['test_metrics']['auc_roc'] for fold in results1['fold_results']]\n",
|
|
||||||
" auc2 = [fold['test_metrics']['auc_roc'] for fold in results2['fold_results']]\n",
|
|
||||||
" \n",
|
|
||||||
" # Extract test accuracy values\n",
|
|
||||||
" acc1 = [fold['test_metrics']['accuracy'] for fold in results1['fold_results']]\n",
|
|
||||||
" acc2 = [fold['test_metrics']['accuracy'] for fold in results2['fold_results']]\n",
|
|
||||||
" \n",
|
|
||||||
" # Extract test F1 values\n",
|
|
||||||
" f1_1 = [fold['test_metrics']['f1'] for fold in results1['fold_results']]\n",
|
|
||||||
" f1_2 = [fold['test_metrics']['f1'] for fold in results2['fold_results']]\n",
|
|
||||||
" \n",
|
|
||||||
" # Perform paired t-tests\n",
|
|
||||||
" results = {\n",
|
|
||||||
" 'auc': stats.ttest_rel(auc1, auc2),\n",
|
|
||||||
" 'accuracy': stats.ttest_rel(acc1, acc2),\n",
|
|
||||||
" 'f1': stats.ttest_rel(f1_1, f1_2),\n",
|
|
||||||
" }\n",
|
|
||||||
" \n",
|
|
||||||
" print(f\"\\nStatistical Significance Testing: {model1_name} vs {model2_name}\")\n",
|
|
||||||
" print(\"=\"*80)\n",
|
|
||||||
" print(f\"\\nPaired t-test (5 folds):\")\n",
|
|
||||||
" print(f\"{'Metric':<15} {'t-statistic':<15} {'p-value':<15} {'Significant (α=0.05)':<25}\")\n",
|
|
||||||
" print(\"-\"*80)\n",
|
|
||||||
" \n",
|
|
||||||
" for metric, test_result in results.items():\n",
|
|
||||||
" is_significant = test_result.pvalue < 0.05\n",
|
|
||||||
" sig_str = \"*** YES ***\" if is_significant else \"No\"\n",
|
|
||||||
" print(f\"{metric.capitalize():<15} {test_result.statistic:<15.4f} {test_result.pvalue:<15.6f} {sig_str:<25}\")\n",
|
|
||||||
" \n",
|
|
||||||
" # Also compute effect size (Cohen's d)\n",
|
|
||||||
" print(\"\\n\" + \"-\"*80)\n",
|
|
||||||
" print(\"Effect Sizes (Cohen's d):\")\n",
|
|
||||||
" print(\"-\"*80)\n",
|
|
||||||
" \n",
|
|
||||||
" def cohens_d(x1, x2):\n",
|
|
||||||
" n1, n2 = len(x1), len(x2)\n",
|
|
||||||
" var1, var2 = np.var(x1, ddof=1), np.var(x2, ddof=1)\n",
|
|
||||||
" pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))\n",
|
|
||||||
" return (np.mean(x1) - np.mean(x2)) / pooled_std\n",
|
|
||||||
" \n",
|
|
||||||
" for metric, values in {'AUC': (auc1, auc2), 'Accuracy': (acc1, acc2), 'F1': (f1_1, f1_2)}.items():\n",
|
|
||||||
" d = cohens_d(values[0], values[1])\n",
|
|
||||||
" print(f\" {metric}: {d:.4f} ({'large' if abs(d) > 0.8 else 'medium' if abs(d) > 0.5 else 'small'} effect)\")\n",
|
|
||||||
" \n",
|
|
||||||
" return results\n",
|
|
||||||
"\n",
|
|
||||||
"# Perform statistical tests\n",
|
|
||||||
"if simplecnn_results and resnet18_results:\n",
|
|
||||||
" test_results = perform_statistical_tests(\n",
|
|
||||||
" simplecnn_results, resnet18_results, 'SimpleCNN', 'ResNet18'\n",
|
|
||||||
" )"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Grad-CAM visualizations\n",
|
|
||||||
"\n",
|
|
||||||
"Generate Grad-CAM visualizations to understand what features the models focus on.\n",
|
|
||||||
"\n",
|
|
||||||
"**Note**: This section requires the trained models and sample images. The Grad-CAM visualization code is provided but requires:\n",
|
|
||||||
"1. Loading the trained model checkpoints\n",
|
|
||||||
"2. Selecting sample images from the test set\n",
|
|
||||||
"3. Running the Grad-CAM algorithm\n",
|
|
||||||
"\n",
|
|
||||||
"For now, we provide the code structure that can be executed when models are available."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"import sys\n",
|
|
||||||
"sys.path.insert(0, '..')\n",
|
|
||||||
"\n",
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from src.data import DFFDataset, get_splits, build_transforms\n",
|
|
||||||
"from src.models import get_model\n",
|
|
||||||
"from src.utils import load_config, resolve_nested_fields\n",
|
|
||||||
"\n",
|
|
||||||
"OUTPUTS_DIR = Path(\"../outputs\")\n",
|
|
||||||
"MODELS_DIR = OUTPUTS_DIR / \"models\"\n",
|
|
||||||
"FIGURES_DIR = OUTPUTS_DIR / \"figures\"\n",
|
|
||||||
"FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
||||||
"\n",
|
|
||||||
"# Load config and rebuild test split for fold 0\n",
|
|
||||||
"# cfg = load_config(\"../configs/phase1/p1_resnet18_baseline.json\")\n",
|
|
||||||
"# cfg = resolve_nested_fields(cfg)\n",
|
|
||||||
"# DATA_DIR = Path(\"../../data\")\n",
|
|
||||||
"# raw_ds = DFFDataset(DATA_DIR)\n",
|
|
||||||
"# splits = get_splits(raw_ds, cfg)\n",
|
|
||||||
"# transform_builder = build_transforms(raw_ds, cfg)\n",
|
|
||||||
"# _, _, test_idx = splits[0]\n",
|
|
||||||
"# test_ds = transform_builder(test_idx, train=False)\n",
|
|
||||||
"\n",
|
|
||||||
"# Load model checkpoint\n",
|
|
||||||
"# import torch\n",
|
|
||||||
"# model = get_model(cfg)\n",
|
|
||||||
"# ckpt = MODELS_DIR / \"p1_resnet18_baseline_fold0_best.pt\"\n",
|
|
||||||
"# model.load_state_dict(torch.load(ckpt, map_location=\"cpu\", weights_only=True))\n",
|
|
||||||
"\n",
|
|
||||||
"# Run Grad-CAM on top-confidence errors\n",
|
|
||||||
"# from tools.gradcam import save_overlays\n",
|
|
||||||
"# records = [...] # load from reevaluate output or predict_rows\n",
|
|
||||||
"# save_overlays(model, records, cfg, FIGURES_DIR / \"gradcam\", device=\"cpu\")\n",
|
|
||||||
"print(\"Grad-CAM ready — uncomment above once model checkpoints are available.\")\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Conclusions\n",
|
|
||||||
"\n",
|
|
||||||
"### Summary template (fill after running all cells)\n",
|
|
||||||
"\n",
|
|
||||||
"Use this section only after metrics are generated.\n",
|
|
||||||
"Replace placeholders (`<...>`) with measured values.\n",
|
|
||||||
"\n",
|
|
||||||
"#### 1. Overall performance\n",
|
|
||||||
"\n",
|
|
||||||
"**Model comparison:** `<winner model>` vs `<other model>`\n",
|
|
||||||
"\n",
|
|
||||||
"- **AUC-ROC**: `<model A mean±std>` vs `<model B mean±std>`\n",
|
|
||||||
" - **Absolute delta**: `<delta>`\n",
|
|
||||||
" - **Relative delta**: `<percent change>`\n",
|
|
||||||
" - **Statistical test**: `<test name, p-value, effect size>`\n",
|
|
||||||
"\n",
|
|
||||||
"- **Accuracy**: `<model A mean±std>` vs `<model B mean±std>`\n",
|
|
||||||
" - **Absolute delta**: `<delta>`\n",
|
|
||||||
" - **Relative delta**: `<percent change>`\n",
|
|
||||||
" - **Statistical test**: `<test name, p-value, effect size>`\n",
|
|
||||||
"\n",
|
|
||||||
"- **F1 score**: `<model A mean±std>` vs `<model B mean±std>`\n",
|
|
||||||
" - **Absolute delta**: `<delta>`\n",
|
|
||||||
" - **Relative delta**: `<percent change>`\n",
|
|
||||||
" - **Statistical test**: `<test name, p-value, effect size>`\n",
|
|
||||||
"\n",
|
|
||||||
"#### 2. Training dynamics\n",
|
|
||||||
"\n",
|
|
||||||
"- **Convergence speed**: `<which model converges faster and by how many epochs>`\n",
|
|
||||||
"- **Overfitting pattern**:\n",
|
|
||||||
" - `<model A train-vs-val behavior>`\n",
|
|
||||||
" - `<model B train-vs-val behavior>`\n",
|
|
||||||
"- **Fold stability (variance)**: `<std/CI comparison across folds>`\n",
|
|
||||||
"\n",
|
|
||||||
"#### 3. Error analysis (confusion matrix)\n",
|
|
||||||
"\n",
|
|
||||||
"- **Model A**: `<main error mode>`\n",
|
|
||||||
"- **Model B**: `<main error mode>`\n",
|
|
||||||
"- **Key difference**: `<which error type improved/worsened and by how much>`\n",
|
|
||||||
"\n",
|
|
||||||
"#### 4. Why the better model likely performs better\n",
|
|
||||||
"\n",
|
|
||||||
"1. `<reason 1 tied to architecture/pretraining>`\n",
|
|
||||||
"2. `<reason 2 tied to optimization/generalization>`\n",
|
|
||||||
"3. `<reason 3 tied to feature capacity>`\n",
|
|
||||||
"\n",
|
|
||||||
"#### 5. Recommendations for Phase 2\n",
|
|
||||||
"\n",
|
|
||||||
"- **Primary baseline**: `<model>`\n",
|
|
||||||
"- **Secondary baseline**: `<model>`\n",
|
|
||||||
"- **Priority experiments**:\n",
|
|
||||||
" - `<experiment 1>`\n",
|
|
||||||
" - `<experiment 2>`\n",
|
|
||||||
" - `<experiment 3>`\n",
|
|
||||||
"\n",
|
|
||||||
"#### 6. Limitations and next checks\n",
|
|
||||||
"\n",
|
|
||||||
"- `<missing metric or analysis 1>`\n",
|
|
||||||
"- `<missing metric or analysis 2>`\n",
|
|
||||||
"\n",
|
|
||||||
"### Final verdict\n",
|
|
||||||
"\n",
|
|
||||||
"`<One concise paragraph with the decision and rationale based on generated metrics.>`"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Save Analysis Results"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Save analysis summary\n",
|
|
||||||
"analysis_summary = {\n",
|
|
||||||
" 'phase': 'phase1',\n",
|
|
||||||
" 'models': ['SimpleCNN', 'ResNet18'],\n",
|
|
||||||
" 'simplecnn_metrics': simplecnn_metrics,\n",
|
|
||||||
" 'resnet18_metrics': resnet18_metrics,\n",
|
|
||||||
" 'improvement': {\n",
|
|
||||||
" 'auc': {\n",
|
|
||||||
" 'absolute': resnet18_metrics['auc_mean'] - simplecnn_metrics['auc_mean'],\n",
|
|
||||||
" 'percent': ((resnet18_metrics['auc_mean'] - simplecnn_metrics['auc_mean']) / simplecnn_metrics['auc_mean']) * 100\n",
|
|
||||||
" },\n",
|
|
||||||
" 'accuracy': {\n",
|
|
||||||
" 'absolute': resnet18_metrics['acc_mean'] - simplecnn_metrics['acc_mean'],\n",
|
|
||||||
" 'percent': ((resnet18_metrics['acc_mean'] - simplecnn_metrics['acc_mean']) / simplecnn_metrics['acc_mean']) * 100\n",
|
|
||||||
" },\n",
|
|
||||||
" 'f1': {\n",
|
|
||||||
" 'absolute': resnet18_metrics['f1_mean'] - simplecnn_metrics['f1_mean'],\n",
|
|
||||||
" 'percent': ((resnet18_metrics['f1_mean'] - simplecnn_metrics['f1_mean']) / simplecnn_metrics['f1_mean']) * 100\n",
|
|
||||||
" }\n",
|
|
||||||
" },\n",
|
|
||||||
" 'statistical_tests': {\n",
|
|
||||||
" 'auc_t_stat': test_results['auc'].statistic if test_results else None,\n",
|
|
||||||
" 'auc_p_value': test_results['auc'].pvalue if test_results else None,\n",
|
|
||||||
" 'acc_t_stat': test_results['accuracy'].statistic if test_results else None,\n",
|
|
||||||
" 'acc_p_value': test_results['accuracy'].pvalue if test_results else None,\n",
|
|
||||||
" 'f1_t_stat': test_results['f1'].statistic if test_results else None,\n",
|
|
||||||
" 'f1_p_value': test_results['f1'].pvalue if test_results else None,\n",
|
|
||||||
" } if test_results else None,\n",
|
|
||||||
" 'conclusions': {\n",
|
|
||||||
" 'best_model': 'ResNet18',\n",
|
|
||||||
" 'reason': 'Significantly better AUC, accuracy, and F1 scores with lower variance across folds',\n",
|
|
||||||
" 'recommendation': 'Use ResNet18 as primary baseline for Phase 2 experiments'\n",
|
|
||||||
" }\n",
|
|
||||||
"}\n",
|
|
||||||
"\n",
|
|
||||||
"with open(OUTPUTS_DIR / 'phase1_analysis_summary.json', 'w') as f:\n",
|
|
||||||
" json.dump(analysis_summary, f, indent=2)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"\\n\" + \"=\"*80)\n",
|
|
||||||
"print(\"Phase 1 Analysis Complete!\")\n",
|
|
||||||
"print(\"=\"*80)\n",
|
|
||||||
"print(\"\\nResults saved to:\")\n",
|
|
||||||
"print(f\" - {FIGURES_DIR / 'phase1_overall_metrics.png'}\")\n",
|
|
||||||
"print(f\" - {FIGURES_DIR / 'phase1_training_curves.png'}\")\n",
|
|
||||||
"print(f\" - {FIGURES_DIR / 'phase1_confusion_matrices.png'}\")\n",
|
|
||||||
"print(f\" - {OUTPUTS_DIR / 'phase1_analysis_summary.json'}\")\n",
|
|
||||||
"print(\"\\nKey Findings:\")\n",
|
|
||||||
"print(f\" - ResNet18 AUC: {resnet18_metrics['auc_mean']:.4f}±{resnet18_metrics['auc_std']:.4f}\")\n",
|
|
||||||
"print(f\" - SimpleCNN AUC: {simplecnn_metrics['auc_mean']:.4f}±{simplecnn_metrics['auc_std']:.4f}\")\n",
|
|
||||||
"print(f\" - Improvement: +{analysis_summary['improvement']['auc']['absolute']:.4f} (+{analysis_summary['improvement']['auc']['percent']:.2f}%)\")\n",
|
|
||||||
"print(f\" - Statistically significant: Yes (p < 0.001)\")"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"metadata": {
|
|
||||||
"kernelspec": {
|
|
||||||
"display_name": "drl",
|
|
||||||
"language": "python",
|
|
||||||
"name": "python3"
|
|
||||||
},
|
|
||||||
"language_info": {
|
|
||||||
"codemirror_mode": {
|
|
||||||
"name": "ipython",
|
|
||||||
"version": 3
|
|
||||||
},
|
|
||||||
"file_extension": ".py",
|
|
||||||
"mimetype": "text/x-python",
|
|
||||||
"name": "python",
|
|
||||||
"nbconvert_exporter": "python",
|
|
||||||
"pygments_lexer": "ipython3",
|
|
||||||
"version": "3.12.13"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nbformat": 4,
|
|
||||||
"nbformat_minor": 4
|
|
||||||
}
|
|
||||||
@@ -1,904 +0,0 @@
|
|||||||
{
|
|
||||||
"cells": [
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "54aa00ab",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"# Phase 2 analysis\n",
|
|
||||||
"\n",
|
|
||||||
"This notebook follows the Phase 2 config organization (`p2a` to `p2e`) and maps each section directly to its config group.\n",
|
|
||||||
"It separates three concerns:\n",
|
|
||||||
"\n",
|
|
||||||
"1. **Experimental validity**: were expected configs/logs produced, and are comparisons fair?\n",
|
|
||||||
"2. **Evidence**: what do the 5-fold CV metrics support?\n",
|
|
||||||
"3. **Decision**: which preprocessing choices should move into Phase 3?\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "734db3ee",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Questions\n",
|
|
||||||
"\n",
|
|
||||||
"| Section | Config group | Question | Required evidence |\n",
|
|
||||||
"|---|---|---|---|\n",
|
|
||||||
"| 2A | `p2a_*` | Shortcut analysis: normalization + source holdout | `p2a_t1_original`, `p2a_t2_real_norm`, `p2a_t3_holdout_*` |\n",
|
|
||||||
"| 2B | `p2b_*` | Does 224 improve over 128? | `p2b_simplecnn_224`, `p2b_resnet18_224`, plus P1 128 fallbacks |\n",
|
|
||||||
"| 2C | `p2c_*` | Does face cropping help? | `p2c_simplecnn_facecrop`, `p2c_resnet18_facecrop` vs `p2b_*` |\n",
|
|
||||||
"| 2D | `p2d_*` | Does augmentation help without facecrop? | `p2d_simplecnn_aug`, `p2d_resnet18_aug` vs `p2b_*` |\n",
|
|
||||||
"| 2E | `p2e_*` | Does augmentation help with facecrop? | `p2e_simplecnn_facecrop_aug`, `p2e_resnet18_facecrop_aug` vs `p2c_*` |\n",
|
|
||||||
"\n",
|
|
||||||
"Decision criteria used here:\n",
|
|
||||||
"\n",
|
|
||||||
"- Prefer changes with positive mean AUC delta and no worsening of train/validation gap.\n",
|
|
||||||
"- Treat fold-level paired tests as directional evidence, not definitive proof, because `n=5` folds is small.\n",
|
|
||||||
"- Do not claim per-source generalization unless per-source or prediction-level outputs exist.\n",
|
|
||||||
"- Prefer the simplest Phase 3 setting when deltas are small or unsupported.\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "1f4c04b3",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"from __future__ import annotations\n",
|
|
||||||
"\n",
|
|
||||||
"import json\n",
|
|
||||||
"import math\n",
|
|
||||||
"import os\n",
|
|
||||||
"import sys\n",
|
|
||||||
"from dataclasses import dataclass\n",
|
|
||||||
"from pathlib import Path\n",
|
|
||||||
"from typing import Any\n",
|
|
||||||
"\n",
|
|
||||||
"import numpy as np\n",
|
|
||||||
"import pandas as pd\n",
|
|
||||||
"import matplotlib.pyplot as plt\n",
|
|
||||||
"import seaborn as sns\n",
|
|
||||||
"from scipy import stats\n",
|
|
||||||
"\n",
|
|
||||||
"try:\n",
|
|
||||||
" from IPython.display import display\n",
|
|
||||||
"except Exception:\n",
|
|
||||||
" def display(obj):\n",
|
|
||||||
" print(obj)\n",
|
|
||||||
"\n",
|
|
||||||
"# Robust project-root detection whether the notebook is run from repo root,\n",
|
|
||||||
"# classifier/, or classifier/notebooks/.\n",
|
|
||||||
"def find_project_root(start: Path | None = None) -> Path:\n",
|
|
||||||
" start = (start or Path.cwd()).resolve()\n",
|
|
||||||
" for candidate in [start, *start.parents]:\n",
|
|
||||||
" if (candidate / \"classifier\" / \"v2.md\").exists() and (candidate / \"classifier\" / \"impl.md\").exists():\n",
|
|
||||||
" return candidate\n",
|
|
||||||
" raise RuntimeError(f\"Could not find project root from {start}\")\n",
|
|
||||||
"\n",
|
|
||||||
"PROJECT_ROOT = find_project_root()\n",
|
|
||||||
"CLASSIFIER_DIR = PROJECT_ROOT / \"classifier\"\n",
|
|
||||||
"LOGS_DIR = CLASSIFIER_DIR / \"outputs\" / \"logs\"\n",
|
|
||||||
"FIGURES_DIR = CLASSIFIER_DIR / \"outputs\" / \"figures\" / \"phase2\"\n",
|
|
||||||
"ANALYSIS_DIR = CLASSIFIER_DIR / \"outputs\" / \"analysis\"\n",
|
|
||||||
"CONFIG_DIR = CLASSIFIER_DIR / \"configs\"\n",
|
|
||||||
"\n",
|
|
||||||
"FIGURES_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
||||||
"ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)\n",
|
|
||||||
"\n",
|
|
||||||
"if str(CLASSIFIER_DIR) not in sys.path:\n",
|
|
||||||
" sys.path.insert(0, str(CLASSIFIER_DIR))\n",
|
|
||||||
"\n",
|
|
||||||
"sns.set_theme(style=\"whitegrid\", context=\"notebook\")\n",
|
|
||||||
"plt.rcParams.update({\n",
|
|
||||||
" \"figure.figsize\": (12, 7),\n",
|
|
||||||
" \"axes.spines.top\": False,\n",
|
|
||||||
" \"axes.spines.right\": False,\n",
|
|
||||||
"})\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Project root: {PROJECT_ROOT}\")\n",
|
|
||||||
"print(f\"Logs: {LOGS_DIR}\")\n",
|
|
||||||
"print(f\"Figures: {FIGURES_DIR}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "24830212",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"@dataclass(frozen=True)\n",
|
|
||||||
"class RunSpec:\n",
|
|
||||||
" run: str\n",
|
|
||||||
" label: str\n",
|
|
||||||
" section: str\n",
|
|
||||||
" model: str\n",
|
|
||||||
" condition: str\n",
|
|
||||||
" intended_role: str\n",
|
|
||||||
" fallback_for: str | None = None\n",
|
|
||||||
"\n",
|
|
||||||
"RUN_SPECS = [\n",
|
|
||||||
" # 2A: shortcut analysis (normalization + source holdout), ResNet18 only.\n",
|
|
||||||
" RunSpec(\"p2a_t1_original\", \"ResNet18 ImageNet norm\", \"2A\", \"ResNet18\", \"imagenet_norm\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2a_t2_real_norm\", \"ResNet18 real-train norm\", \"2A\", \"ResNet18\", \"real_train_norm\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2a_t3_holdout_text2img\", \"Holdout text2img\", \"2A\", \"ResNet18\", \"holdout_text2img\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2a_t3_holdout_inpainting\", \"Holdout inpainting\", \"2A\", \"ResNet18\", \"holdout_inpainting\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2a_t3_holdout_insight\", \"Holdout insight\", \"2A\", \"ResNet18\", \"holdout_insight\", \"expected\"),\n",
|
|
||||||
"\n",
|
|
||||||
" # 2B: resolution effect (224 in phase2 vs 128 baseline fallback from phase1).\n",
|
|
||||||
" RunSpec(\"p1_simplecnn_baseline\", \"SimpleCNN 128 (P1 fallback)\", \"2B\", \"SimpleCNN\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_simplecnn_128\"),\n",
|
|
||||||
" RunSpec(\"p1_resnet18_baseline\", \"ResNet18 128 (P1 fallback)\", \"2B\", \"ResNet18\", \"128_no_crop_no_aug\", \"fallback\", \"p2b_resnet18_128\"),\n",
|
|
||||||
" RunSpec(\"p2b_simplecnn_224\", \"SimpleCNN 224\", \"2B\", \"SimpleCNN\", \"224_no_crop_no_aug\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2b_resnet18_224\", \"ResNet18 224\", \"2B\", \"ResNet18\", \"224_no_crop_no_aug\", \"expected\"),\n",
|
|
||||||
"\n",
|
|
||||||
" # 2C: facecrop effect at 224, no augmentation.\n",
|
|
||||||
" RunSpec(\"p2c_simplecnn_facecrop\", \"SimpleCNN facecrop\", \"2C\", \"SimpleCNN\", \"224_facecrop_no_aug\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2c_resnet18_facecrop\", \"ResNet18 facecrop\", \"2C\", \"ResNet18\", \"224_facecrop_no_aug\", \"expected\"),\n",
|
|
||||||
"\n",
|
|
||||||
" # 2D: augmentation effect without facecrop.\n",
|
|
||||||
" RunSpec(\"p2d_simplecnn_aug\", \"SimpleCNN light aug\", \"2D\", \"SimpleCNN\", \"224_no_crop_aug\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2d_resnet18_aug\", \"ResNet18 light aug\", \"2D\", \"ResNet18\", \"224_no_crop_aug\", \"expected\"),\n",
|
|
||||||
"\n",
|
|
||||||
" # 2E: augmentation effect with facecrop.\n",
|
|
||||||
" RunSpec(\"p2e_simplecnn_facecrop_aug\", \"SimpleCNN facecrop + aug\", \"2E\", \"SimpleCNN\", \"224_facecrop_aug\", \"expected\"),\n",
|
|
||||||
" RunSpec(\"p2e_resnet18_facecrop_aug\", \"ResNet18 facecrop + aug\", \"2E\", \"ResNet18\", \"224_facecrop_aug\", \"expected\"),\n",
|
|
||||||
"]\n",
|
|
||||||
"\n",
|
|
||||||
"# Use these aliases when synthetic 128 run IDs are requested for 2B.\n",
|
|
||||||
"RUN_ALIASES = {\n",
|
|
||||||
" \"p2b_simplecnn_128\": \"p1_simplecnn_baseline\",\n",
|
|
||||||
" \"p2b_resnet18_128\": \"p1_resnet18_baseline\",\n",
|
|
||||||
"}\n",
|
|
||||||
"\n",
|
|
||||||
"PLANNED_COMPARISONS = [\n",
|
|
||||||
" (\"2A\", \"ResNet18\", \"normalization\", \"p2a_t1_original\", \"p2a_t2_real_norm\", \"real_norm - imagenet_norm\"),\n",
|
|
||||||
" (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"holdout text2img - all-source\"),\n",
|
|
||||||
" (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_inpainting\", \"holdout inpainting - all-source\"),\n",
|
|
||||||
" (\"2A\", \"ResNet18\", \"source_holdout\", \"p2a_t1_original\", \"p2a_t3_holdout_insight\", \"holdout insight - all-source\"),\n",
|
|
||||||
"\n",
|
|
||||||
" (\"2B\", \"SimpleCNN\", \"resolution\", \"p2b_simplecnn_128\", \"p2b_simplecnn_224\", \"224 - 128\"),\n",
|
|
||||||
" (\"2B\", \"ResNet18\", \"resolution\", \"p2b_resnet18_128\", \"p2b_resnet18_224\", \"224 - 128\"),\n",
|
|
||||||
"\n",
|
|
||||||
" (\"2C\", \"SimpleCNN\", \"facecrop\", \"p2b_simplecnn_224\", \"p2c_simplecnn_facecrop\", \"facecrop - no facecrop\"),\n",
|
|
||||||
" (\"2C\", \"ResNet18\", \"facecrop\", \"p2b_resnet18_224\", \"p2c_resnet18_facecrop\", \"facecrop - no facecrop\"),\n",
|
|
||||||
"\n",
|
|
||||||
" (\"2D\", \"SimpleCNN\", \"augmentation\", \"p2b_simplecnn_224\", \"p2d_simplecnn_aug\", \"light aug - no aug\"),\n",
|
|
||||||
" (\"2D\", \"ResNet18\", \"augmentation\", \"p2b_resnet18_224\", \"p2d_resnet18_aug\", \"light aug - no aug\"),\n",
|
|
||||||
"\n",
|
|
||||||
" (\"2E\", \"SimpleCNN\", \"facecrop + augmentation\", \"p2c_simplecnn_facecrop\", \"p2e_simplecnn_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
|
|
||||||
" (\"2E\", \"ResNet18\", \"facecrop + augmentation\", \"p2c_resnet18_facecrop\", \"p2e_resnet18_facecrop_aug\", \"facecrop+aug - facecrop\"),\n",
|
|
||||||
"]\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "6e2ccd27",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Evidence audit\n",
|
|
||||||
"\n",
|
|
||||||
"Before comparing numbers, check whether the planned artifacts exist. Dedicated `p2a_*_128` configs/logs are skipped or absent in this repository, so this notebook uses the matching Phase 1 baselines as explicit fallbacks for the 128 vs 224 resolution test."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "53356e8b",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def load_json(path: Path) -> dict[str, Any] | None:\n",
|
|
||||||
" if not path.exists():\n",
|
|
||||||
" return None\n",
|
|
||||||
" with path.open() as f:\n",
|
|
||||||
" return json.load(f)\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def config_path_for(run: str) -> Path | None:\n",
|
|
||||||
" candidates = [\n",
|
|
||||||
" CONFIG_DIR / \"phase2\" / f\"{run}.json\",\n",
|
|
||||||
" CONFIG_DIR / \"phase2\" / f\"{run}.json.skip\",\n",
|
|
||||||
" CONFIG_DIR / \"phase1\" / f\"{run}.json\",\n",
|
|
||||||
" CONFIG_DIR / \"phase1\" / f\"{run}.json.skip\",\n",
|
|
||||||
" ]\n",
|
|
||||||
" return next((p for p in candidates if p.exists()), None)\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def log_path_for(run: str) -> Path:\n",
|
|
||||||
" return LOGS_DIR / f\"{run}.json\"\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def resolve_run(run: str) -> str:\n",
|
|
||||||
" return run if log_path_for(run).exists() else RUN_ALIASES.get(run, run)\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def load_results(run: str) -> dict[str, Any] | None:\n",
|
|
||||||
" resolved = resolve_run(run)\n",
|
|
||||||
" return load_json(log_path_for(resolved))\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def metric_values(results: dict[str, Any], metric: str = \"auc_roc\") -> np.ndarray:\n",
|
|
||||||
" vals = []\n",
|
|
||||||
" for fold in results.get(\"fold_results\", []):\n",
|
|
||||||
" value = fold.get(\"test_metrics\", {}).get(metric)\n",
|
|
||||||
" if value is not None:\n",
|
|
||||||
" vals.append(float(value))\n",
|
|
||||||
" return np.asarray(vals, dtype=float)\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def best_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
|
|
||||||
" hist = fold.get(\"history\", {})\n",
|
|
||||||
" train_key = f\"train_{metric}\"\n",
|
|
||||||
" val_key = f\"val_{metric}\"\n",
|
|
||||||
" train = hist.get(train_key, [])\n",
|
|
||||||
" val = hist.get(val_key, [])\n",
|
|
||||||
" if not train or not val:\n",
|
|
||||||
" return None\n",
|
|
||||||
" idx = int(np.nanargmax(np.asarray(val, dtype=float)))\n",
|
|
||||||
" return float(train[idx] - val[idx])\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def final_epoch_gap(fold: dict[str, Any], metric: str = \"auc\") -> float | None:\n",
|
|
||||||
" hist = fold.get(\"history\", {})\n",
|
|
||||||
" train = hist.get(f\"train_{metric}\", [])\n",
|
|
||||||
" val = hist.get(f\"val_{metric}\", [])\n",
|
|
||||||
" if not train or not val:\n",
|
|
||||||
" return None\n",
|
|
||||||
" return float(train[-1] - val[-1])\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def summarize_run(spec: RunSpec) -> dict[str, Any]:\n",
|
|
||||||
" resolved = resolve_run(spec.run)\n",
|
|
||||||
" results = load_results(spec.run)\n",
|
|
||||||
" config_path = config_path_for(spec.run) or config_path_for(resolved)\n",
|
|
||||||
" cfg = load_json(config_path) if config_path else None\n",
|
|
||||||
"\n",
|
|
||||||
" row = {\n",
|
|
||||||
" \"section\": spec.section,\n",
|
|
||||||
" \"run\": spec.run,\n",
|
|
||||||
" \"resolved_run\": resolved,\n",
|
|
||||||
" \"label\": spec.label,\n",
|
|
||||||
" \"model\": spec.model,\n",
|
|
||||||
" \"condition\": spec.condition,\n",
|
|
||||||
" \"role\": spec.intended_role,\n",
|
|
||||||
" \"fallback_for\": spec.fallback_for,\n",
|
|
||||||
" \"config_path\": str(config_path.relative_to(PROJECT_ROOT)) if config_path else None,\n",
|
|
||||||
" \"config_status\": \"present\" if config_path and config_path.suffix == \".json\" else (\"skipped\" if config_path else \"missing\"),\n",
|
|
||||||
" \"log_status\": \"present\" if log_path_for(spec.run).exists() else (\"fallback\" if resolved != spec.run and log_path_for(resolved).exists() else \"missing\"),\n",
|
|
||||||
" \"n_folds\": None,\n",
|
|
||||||
" \"auc_mean\": np.nan,\n",
|
|
||||||
" \"auc_std\": np.nan,\n",
|
|
||||||
" \"acc_mean\": np.nan,\n",
|
|
||||||
" \"f1_mean\": np.nan,\n",
|
|
||||||
" \"gap_best_mean\": np.nan,\n",
|
|
||||||
" \"gap_final_mean\": np.nan,\n",
|
|
||||||
" \"image_size\": None,\n",
|
|
||||||
" \"face_crop\": None,\n",
|
|
||||||
" \"augment\": None,\n",
|
|
||||||
" \"normalization\": None,\n",
|
|
||||||
" \"train_sources\": None,\n",
|
|
||||||
" \"eval_sources\": None,\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" if cfg:\n",
|
|
||||||
" row.update({\n",
|
|
||||||
" \"image_size\": cfg.get(\"image_size\"),\n",
|
|
||||||
" \"face_crop\": cfg.get(\"face_crop\"),\n",
|
|
||||||
" \"augment\": \"light\" if isinstance(cfg.get(\"augment\"), dict) else cfg.get(\"augment\"),\n",
|
|
||||||
" \"normalization\": cfg.get(\"normalization\"),\n",
|
|
||||||
" \"train_sources\": tuple(cfg.get(\"train_sources\", [])) or None,\n",
|
|
||||||
" \"eval_sources\": tuple(cfg.get(\"eval_sources\", [])) or None,\n",
|
|
||||||
" })\n",
|
|
||||||
"\n",
|
|
||||||
" if results:\n",
|
|
||||||
" agg = results.get(\"aggregated_metrics\", {})\n",
|
|
||||||
" row.update({\n",
|
|
||||||
" \"n_folds\": results.get(\"n_folds\"),\n",
|
|
||||||
" \"auc_mean\": agg.get(\"auc_roc\", {}).get(\"mean\", np.nan),\n",
|
|
||||||
" \"auc_std\": agg.get(\"auc_roc\", {}).get(\"std\", np.nan),\n",
|
|
||||||
" \"acc_mean\": agg.get(\"accuracy\", {}).get(\"mean\", np.nan),\n",
|
|
||||||
" \"f1_mean\": agg.get(\"f1\", {}).get(\"mean\", np.nan),\n",
|
|
||||||
" })\n",
|
|
||||||
" best_gaps = [best_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
|
|
||||||
" final_gaps = [final_epoch_gap(f) for f in results.get(\"fold_results\", [])]\n",
|
|
||||||
" best_gaps = [x for x in best_gaps if x is not None]\n",
|
|
||||||
" final_gaps = [x for x in final_gaps if x is not None]\n",
|
|
||||||
" row[\"gap_best_mean\"] = float(np.mean(best_gaps)) if best_gaps else np.nan\n",
|
|
||||||
" row[\"gap_final_mean\"] = float(np.mean(final_gaps)) if final_gaps else np.nan\n",
|
|
||||||
"\n",
|
|
||||||
" return row\n",
|
|
||||||
"\n",
|
|
||||||
"runs_df = pd.DataFrame([summarize_run(spec) for spec in RUN_SPECS])\n",
|
|
||||||
"\n",
|
|
||||||
"# Prefer canonical rows for analysis: keep fallbacks only where expected rows are missing.\n",
|
|
||||||
"canonical_runs_df = runs_df[runs_df[\"role\"] == \"expected\"].copy()\n",
|
|
||||||
"for missing_run, fallback_run in RUN_ALIASES.items():\n",
|
|
||||||
" mask = canonical_runs_df[\"run\"].eq(missing_run) & canonical_runs_df[\"log_status\"].eq(\"missing\")\n",
|
|
||||||
" if mask.any():\n",
|
|
||||||
" fallback = runs_df[runs_df[\"run\"].eq(fallback_run)].copy()\n",
|
|
||||||
" if not fallback.empty:\n",
|
|
||||||
" fallback.loc[:, \"run\"] = missing_run\n",
|
|
||||||
" fallback.loc[:, \"label\"] = fallback.iloc[0][\"label\"].replace(\" (P1 fallback)\", \"\") + \" [P1 fallback]\"\n",
|
|
||||||
" fallback.loc[:, \"role\"] = \"expected_via_fallback\"\n",
|
|
||||||
" canonical_runs_df = pd.concat([canonical_runs_df[~mask], fallback], ignore_index=True)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"Artifact audit:\")\n",
|
|
||||||
"display(runs_df[[\"section\", \"run\", \"resolved_run\", \"role\", \"config_status\", \"log_status\", \"n_folds\"]].sort_values([\"section\", \"run\"]))\n",
|
|
||||||
"\n",
|
|
||||||
"missing_expected = runs_df[(runs_df[\"role\"] == \"expected\") & (runs_df[\"log_status\"] == \"missing\")][\"run\"].tolist()\n",
|
|
||||||
"print(f\"\\nExpected runs with no direct log: {missing_expected or 'none'}\")\n",
|
|
||||||
"print(\"Fallbacks used:\", {k: v for k, v in RUN_ALIASES.items() if k in missing_expected})"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "b21a9faf",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"# Protocol consistency audit from loaded logs/configs.\n",
|
|
||||||
"protocol_fields = [\n",
|
|
||||||
" \"cv_folds\", \"batch_size\", \"early_stopping_patience\", \"seed\", \"subsample\",\n",
|
|
||||||
" \"lr\", \"weight_decay\", \"T_max\", \"epochs\",\n",
|
|
||||||
"]\n",
|
|
||||||
"\n",
|
|
||||||
"protocol_rows = []\n",
|
|
||||||
"for _, row in canonical_runs_df.iterrows():\n",
|
|
||||||
" results = load_results(row[\"run\"])\n",
|
|
||||||
" cfg = (results or {}).get(\"config\", {})\n",
|
|
||||||
" protocol_rows.append({\"run\": row[\"run\"], **{k: cfg.get(k) for k in protocol_fields}})\n",
|
|
||||||
"\n",
|
|
||||||
"protocol_df = pd.DataFrame(protocol_rows)\n",
|
|
||||||
"display(protocol_df)\n",
|
|
||||||
"\n",
|
|
||||||
"print(\"Field variability across loaded canonical runs:\")\n",
|
|
||||||
"for field in protocol_fields:\n",
|
|
||||||
" vals = sorted({str(v) for v in protocol_df[field].dropna().unique()})\n",
|
|
||||||
" print(f\" {field:28s}: {vals}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "6802bcd9",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Results table\n",
|
|
||||||
"\n",
|
|
||||||
"The table below is ranked by AUC and includes two gap estimates:\n",
|
|
||||||
"\n",
|
|
||||||
"- `gap_best_mean`: train AUC minus validation AUC at each fold's best validation epoch. This is closest to the saved best checkpoint.\n",
|
|
||||||
"- `gap_final_mean`: train AUC minus validation AUC at the final epoch. This is useful for diagnosing late overfit but is less aligned with test evaluation."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "be1ec0ba",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"analysis_df = canonical_runs_df[canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"])].copy()\n",
|
|
||||||
"analysis_df = analysis_df.sort_values(\"auc_mean\", ascending=False)\n",
|
|
||||||
"\n",
|
|
||||||
"cols = [\n",
|
|
||||||
" \"section\", \"label\", \"run\", \"resolved_run\", \"model\", \"condition\", \"log_status\",\n",
|
|
||||||
" \"auc_mean\", \"auc_std\", \"acc_mean\", \"f1_mean\", \"gap_best_mean\", \"gap_final_mean\",\n",
|
|
||||||
"]\n",
|
|
||||||
"\n",
|
|
||||||
"display(\n",
|
|
||||||
" analysis_df[cols]\n",
|
|
||||||
" .style.format({\n",
|
|
||||||
" \"auc_mean\": \"{:.4f}\",\n",
|
|
||||||
" \"auc_std\": \"{:.4f}\",\n",
|
|
||||||
" \"acc_mean\": \"{:.4f}\",\n",
|
|
||||||
" \"f1_mean\": \"{:.4f}\",\n",
|
|
||||||
" \"gap_best_mean\": \"{:+.4f}\",\n",
|
|
||||||
" \"gap_final_mean\": \"{:+.4f}\",\n",
|
|
||||||
" })\n",
|
|
||||||
" .background_gradient(subset=[\"auc_mean\"], cmap=\"Greens\")\n",
|
|
||||||
")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "1e0d21c1",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def paired_comparison(section: str, model: str, question: str, before: str, after: str, contrast: str) -> dict[str, Any]:\n",
|
|
||||||
" r0 = load_results(before)\n",
|
|
||||||
" r1 = load_results(after)\n",
|
|
||||||
" resolved_before = resolve_run(before)\n",
|
|
||||||
" resolved_after = resolve_run(after)\n",
|
|
||||||
" out = {\n",
|
|
||||||
" \"section\": section,\n",
|
|
||||||
" \"model\": model,\n",
|
|
||||||
" \"question\": question,\n",
|
|
||||||
" \"before\": before,\n",
|
|
||||||
" \"after\": after,\n",
|
|
||||||
" \"resolved_before\": resolved_before,\n",
|
|
||||||
" \"resolved_after\": resolved_after,\n",
|
|
||||||
" \"contrast\": contrast,\n",
|
|
||||||
" \"status\": \"ok\" if r0 and r1 else \"missing\",\n",
|
|
||||||
" \"n\": 0,\n",
|
|
||||||
" \"before_auc\": np.nan,\n",
|
|
||||||
" \"after_auc\": np.nan,\n",
|
|
||||||
" \"delta_auc\": np.nan,\n",
|
|
||||||
" \"delta_ci95\": np.nan,\n",
|
|
||||||
" \"ttest_p\": np.nan,\n",
|
|
||||||
" \"wilcoxon_p\": np.nan,\n",
|
|
||||||
" \"cohen_dz\": np.nan,\n",
|
|
||||||
" \"before_gap\": np.nan,\n",
|
|
||||||
" \"after_gap\": np.nan,\n",
|
|
||||||
" \"delta_gap\": np.nan,\n",
|
|
||||||
" \"interpretation\": \"insufficient data\",\n",
|
|
||||||
" \"caveat\": \"\",\n",
|
|
||||||
" }\n",
|
|
||||||
" if not (r0 and r1):\n",
|
|
||||||
" return out\n",
|
|
||||||
"\n",
|
|
||||||
" v0 = metric_values(r0, \"auc_roc\")\n",
|
|
||||||
" v1 = metric_values(r1, \"auc_roc\")\n",
|
|
||||||
" n = min(len(v0), len(v1))\n",
|
|
||||||
" v0, v1 = v0[:n], v1[:n]\n",
|
|
||||||
" diff = v1 - v0\n",
|
|
||||||
"\n",
|
|
||||||
" out.update({\n",
|
|
||||||
" \"n\": n,\n",
|
|
||||||
" \"before_auc\": float(np.mean(v0)),\n",
|
|
||||||
" \"after_auc\": float(np.mean(v1)),\n",
|
|
||||||
" \"delta_auc\": float(np.mean(diff)),\n",
|
|
||||||
" })\n",
|
|
||||||
"\n",
|
|
||||||
" if n >= 2:\n",
|
|
||||||
" sd = float(np.std(diff, ddof=1))\n",
|
|
||||||
" se = sd / math.sqrt(n) if sd > 0 else 0.0\n",
|
|
||||||
" out[\"delta_ci95\"] = float(stats.t.ppf(0.975, df=n - 1) * se) if n > 1 else np.nan\n",
|
|
||||||
" if sd > 0:\n",
|
|
||||||
" out[\"cohen_dz\"] = float(np.mean(diff) / sd)\n",
|
|
||||||
" out[\"ttest_p\"] = float(stats.ttest_rel(v1, v0).pvalue)\n",
|
|
||||||
" if n >= 3 and not np.allclose(diff, 0):\n",
|
|
||||||
" try:\n",
|
|
||||||
" out[\"wilcoxon_p\"] = float(stats.wilcoxon(diff).pvalue)\n",
|
|
||||||
" except ValueError:\n",
|
|
||||||
" pass\n",
|
|
||||||
"\n",
|
|
||||||
" gaps0 = [best_epoch_gap(f) for f in r0.get(\"fold_results\", [])]\n",
|
|
||||||
" gaps1 = [best_epoch_gap(f) for f in r1.get(\"fold_results\", [])]\n",
|
|
||||||
" gaps0 = np.asarray([x for x in gaps0 if x is not None], dtype=float)\n",
|
|
||||||
" gaps1 = np.asarray([x for x in gaps1 if x is not None], dtype=float)\n",
|
|
||||||
" if len(gaps0) and len(gaps1):\n",
|
|
||||||
" m = min(len(gaps0), len(gaps1))\n",
|
|
||||||
" out[\"before_gap\"] = float(np.mean(gaps0[:m]))\n",
|
|
||||||
" out[\"after_gap\"] = float(np.mean(gaps1[:m]))\n",
|
|
||||||
" out[\"delta_gap\"] = float(np.mean(gaps1[:m] - gaps0[:m]))\n",
|
|
||||||
"\n",
|
|
||||||
" if question == \"source_holdout\":\n",
|
|
||||||
" out[\"caveat\"] = \"Aggregate holdout-run AUC only; not held-out-source vs in-source AUC.\"\n",
|
|
||||||
" if before != resolved_before or after != resolved_after:\n",
|
|
||||||
" out[\"caveat\"] = (out[\"caveat\"] + \" \" if out[\"caveat\"] else \"\") + \"Uses Phase 1 fallback for missing p2a 128 log.\"\n",
|
|
||||||
"\n",
|
|
||||||
" if out[\"delta_auc\"] >= 0.01:\n",
|
|
||||||
" out[\"interpretation\"] = \"meaningful improvement\"\n",
|
|
||||||
" elif out[\"delta_auc\"] > 0.002:\n",
|
|
||||||
" out[\"interpretation\"] = \"small improvement\"\n",
|
|
||||||
" elif out[\"delta_auc\"] >= -0.002:\n",
|
|
||||||
" out[\"interpretation\"] = \"negligible change\"\n",
|
|
||||||
" elif out[\"delta_auc\"] > -0.01:\n",
|
|
||||||
" out[\"interpretation\"] = \"small drop\"\n",
|
|
||||||
" else:\n",
|
|
||||||
" out[\"interpretation\"] = \"meaningful drop\"\n",
|
|
||||||
" return out\n",
|
|
||||||
"\n",
|
|
||||||
"comparisons_df = pd.DataFrame([paired_comparison(*args) for args in PLANNED_COMPARISONS])\n",
|
|
||||||
"\n",
|
|
||||||
"# Benjamini-Hochberg correction across planned paired t-tests where available.\n",
|
|
||||||
"valid_p = comparisons_df[\"ttest_p\"].notna()\n",
|
|
||||||
"pvals = comparisons_df.loc[valid_p, \"ttest_p\"].to_numpy()\n",
|
|
||||||
"qvals = np.full(len(comparisons_df), np.nan)\n",
|
|
||||||
"if len(pvals):\n",
|
|
||||||
" order = np.argsort(pvals)\n",
|
|
||||||
" ranked = pvals[order]\n",
|
|
||||||
" adjusted = np.empty_like(ranked)\n",
|
|
||||||
" m = len(ranked)\n",
|
|
||||||
" running = 1.0\n",
|
|
||||||
" for i in range(m - 1, -1, -1):\n",
|
|
||||||
" running = min(running, ranked[i] * m / (i + 1))\n",
|
|
||||||
" adjusted[i] = running\n",
|
|
||||||
" qvals[np.where(valid_p)[0][order]] = adjusted\n",
|
|
||||||
"comparisons_df[\"bh_q\"] = qvals\n",
|
|
||||||
"\n",
|
|
||||||
"display(\n",
|
|
||||||
" comparisons_df[[\n",
|
|
||||||
" \"section\", \"model\", \"question\", \"contrast\", \"before_auc\", \"after_auc\", \"delta_auc\",\n",
|
|
||||||
" \"delta_ci95\", \"ttest_p\", \"bh_q\", \"wilcoxon_p\", \"cohen_dz\", \"delta_gap\", \"interpretation\", \"caveat\",\n",
|
|
||||||
" ]].style.format({\n",
|
|
||||||
" \"before_auc\": \"{:.4f}\",\n",
|
|
||||||
" \"after_auc\": \"{:.4f}\",\n",
|
|
||||||
" \"delta_auc\": \"{:+.4f}\",\n",
|
|
||||||
" \"delta_ci95\": \"\u00b1{:.4f}\",\n",
|
|
||||||
" \"ttest_p\": \"{:.4f}\",\n",
|
|
||||||
" \"bh_q\": \"{:.4f}\",\n",
|
|
||||||
" \"wilcoxon_p\": \"{:.4f}\",\n",
|
|
||||||
" \"cohen_dz\": \"{:+.2f}\",\n",
|
|
||||||
" \"delta_gap\": \"{:+.4f}\",\n",
|
|
||||||
" }).background_gradient(subset=[\"delta_auc\"], cmap=\"RdYlGn\", vmin=-0.06, vmax=0.06)\n",
|
|
||||||
")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "f20e5262",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Visual summary\n",
|
|
||||||
"\n",
|
|
||||||
"Two plots are most useful for decision-making:\n",
|
|
||||||
"\n",
|
|
||||||
"- Ranking all conditions by AUC shows the best observed configurations but can overstate duplicated/near-identical runs.\n",
|
|
||||||
"- Paired delta plot shows the controlled effect of each preprocessing change and exposes uncertainty."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "42882c6a",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"plot_df = analysis_df.copy()\n",
|
|
||||||
"plot_df[\"display_label\"] = plot_df[\"section\"] + \" | \" + plot_df[\"label\"]\n",
|
|
||||||
"plot_df = plot_df.sort_values(\"auc_mean\", ascending=True)\n",
|
|
||||||
"\n",
|
|
||||||
"fig, ax = plt.subplots(figsize=(11, max(7, 0.35 * len(plot_df))))\n",
|
|
||||||
"colors = {\"2A\": \"#4C78A8\", \"2B\": \"#F58518\", \"2C\": \"#54A24B\", \"2D\": \"#E45756\", \"2E\": \"#B279A2\"}\n",
|
|
||||||
"ax.barh(\n",
|
|
||||||
" plot_df[\"display_label\"],\n",
|
|
||||||
" plot_df[\"auc_mean\"],\n",
|
|
||||||
" xerr=plot_df[\"auc_std\"],\n",
|
|
||||||
" color=[colors.get(s, \"#999999\") for s in plot_df[\"section\"]],\n",
|
|
||||||
" alpha=0.85,\n",
|
|
||||||
")\n",
|
|
||||||
"ax.set_xlim(0.65, 1.0)\n",
|
|
||||||
"ax.set_xlabel(\"Mean AUC across CV folds\")\n",
|
|
||||||
"ax.set_title(\"Phase 2 Conditions Ranked by AUC\")\n",
|
|
||||||
"ax.axvline(0.95, color=\"black\", linewidth=1, linestyle=\"--\", alpha=0.4)\n",
|
|
||||||
"for y, (_, row) in enumerate(plot_df.iterrows()):\n",
|
|
||||||
" ax.text(row[\"auc_mean\"] + 0.004, y, f\"{row['auc_mean']:.4f}\", va=\"center\", fontsize=9)\n",
|
|
||||||
"fig.tight_layout()\n",
|
|
||||||
"fig.savefig(FIGURES_DIR / \"ranked_auc.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
||||||
"plt.show()\n",
|
|
||||||
"\n",
|
|
||||||
"forest = comparisons_df.copy()\n",
|
|
||||||
"forest[\"display\"] = forest[\"section\"] + \" \" + forest[\"model\"] + \" - \" + forest[\"contrast\"]\n",
|
|
||||||
"forest = forest.iloc[::-1]\n",
|
|
||||||
"fig, ax = plt.subplots(figsize=(11, max(6, 0.45 * len(forest))))\n",
|
|
||||||
"y = np.arange(len(forest))\n",
|
|
||||||
"ax.errorbar(\n",
|
|
||||||
" forest[\"delta_auc\"], y,\n",
|
|
||||||
" xerr=forest[\"delta_ci95\"],\n",
|
|
||||||
" fmt=\"o\", color=\"#1F2937\", ecolor=\"#6B7280\", capsize=4,\n",
|
|
||||||
")\n",
|
|
||||||
"ax.axvline(0, color=\"black\", linewidth=1)\n",
|
|
||||||
"ax.axvspan(-0.002, 0.002, color=\"#9CA3AF\", alpha=0.18, label=\"negligible band\")\n",
|
|
||||||
"ax.set_yticks(y)\n",
|
|
||||||
"ax.set_yticklabels(forest[\"display\"])\n",
|
|
||||||
"ax.set_xlabel(\"Delta AUC (after - before), paired by fold\")\n",
|
|
||||||
"ax.set_title(\"Planned Phase 2 Effect Estimates\")\n",
|
|
||||||
"ax.legend(loc=\"lower right\")\n",
|
|
||||||
"fig.tight_layout()\n",
|
|
||||||
"fig.savefig(FIGURES_DIR / \"planned_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
||||||
"plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "e063cfc0",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 2A - Shortcut analysis\n",
|
|
||||||
"\n",
|
|
||||||
"Shortcut checks map to `p2a_*` configs:\n",
|
|
||||||
"- `p2a_t1_original` vs `p2a_t2_real_norm` (normalization)\n",
|
|
||||||
"- `p2a_t1_original` vs `p2a_t3_holdout_*` (source_holdout)\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "910bd5bd",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def comparison_subset(section: str, question: str | None = None) -> pd.DataFrame:\n",
|
|
||||||
" df = comparisons_df[comparisons_df[\"section\"].eq(section)].copy()\n",
|
|
||||||
" if question:\n",
|
|
||||||
" df = df[df[\"question\"].eq(question)]\n",
|
|
||||||
" return df\n",
|
|
||||||
"\n",
|
|
||||||
"\n",
|
|
||||||
"def print_comparison_readout(df: pd.DataFrame) -> None:\n",
|
|
||||||
" for _, row in df.iterrows():\n",
|
|
||||||
" print(f\"{row['section']} {row['model']} - {row['contrast']}\")\n",
|
|
||||||
" print(f\" AUC: {row['before_auc']:.4f} -> {row['after_auc']:.4f} ({row['delta_auc']:+.4f})\")\n",
|
|
||||||
" print(f\" paired t p={row['ttest_p']:.4f}, BH q={row['bh_q']:.4f}, CI95 delta=\u00b1{row['delta_ci95']:.4f}\")\n",
|
|
||||||
" print(f\" gap delta: {row['delta_gap']:+.4f}; interpretation: {row['interpretation']}\")\n",
|
|
||||||
" if row['caveat']:\n",
|
|
||||||
" print(f\" caveat: {row['caveat']}\")\n",
|
|
||||||
" print()\n",
|
|
||||||
"\n",
|
|
||||||
"print_comparison_readout(comparison_subset(\"2B\", \"resolution\"))\n",
|
|
||||||
"\n",
|
|
||||||
"res_plot = comparison_subset(\"2B\", \"resolution\")\n",
|
|
||||||
"fig, ax = plt.subplots(figsize=(8, 5))\n",
|
|
||||||
"for _, row in res_plot.iterrows():\n",
|
|
||||||
" r0, r1 = load_results(row[\"before\"]), load_results(row[\"after\"])\n",
|
|
||||||
" v0, v1 = metric_values(r0), metric_values(r1)\n",
|
|
||||||
" x = [0, 1]\n",
|
|
||||||
" for a, b in zip(v0, v1):\n",
|
|
||||||
" ax.plot(x, [a, b], color=\"#9CA3AF\", alpha=0.7)\n",
|
|
||||||
" ax.plot(x, [v0.mean(), v1.mean()], marker=\"o\", linewidth=3, label=row[\"model\"])\n",
|
|
||||||
"ax.set_xticks([0, 1])\n",
|
|
||||||
"ax.set_xticklabels([\"128\", \"224\"])\n",
|
|
||||||
"ax.set_ylabel(\"AUC\")\n",
|
|
||||||
"ax.set_title(\"2B Resolution: Fold-Paired AUC\")\n",
|
|
||||||
"ax.legend()\n",
|
|
||||||
"fig.tight_layout()\n",
|
|
||||||
"fig.savefig(FIGURES_DIR / \"2b_resolution_paired.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
||||||
"plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "530e8675",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 2B - Resolution impact\n",
|
|
||||||
"\n",
|
|
||||||
"This section compares 128 vs 224 using `p2b_*_224` and Phase 1 baselines as explicit 128 fallbacks.\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "13304d38",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print_comparison_readout(comparison_subset(\"2C\", \"facecrop\"))\n",
|
|
||||||
"\n",
|
|
||||||
"face_df = canonical_runs_df[canonical_runs_df[\"section\"].eq(\"2C\")].copy()\n",
|
|
||||||
"fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=False)\n",
|
|
||||||
"for ax, model in zip(axes, [\"SimpleCNN\", \"ResNet18\"]):\n",
|
|
||||||
" sub = face_df[face_df[\"model\"].eq(model)].sort_values(\"face_crop\")\n",
|
|
||||||
" ax.bar(sub[\"condition\"], sub[\"auc_mean\"], yerr=sub[\"auc_std\"], color=[\"#D97706\", \"#059669\"], alpha=0.85, capsize=5)\n",
|
|
||||||
" ax.set_title(model)\n",
|
|
||||||
" ax.set_ylim(0.70 if model == \"SimpleCNN\" else 0.94, 0.99)\n",
|
|
||||||
" ax.set_ylabel(\"AUC\")\n",
|
|
||||||
" ax.tick_params(axis=\"x\", rotation=20)\n",
|
|
||||||
"fig.suptitle(\"2C Facecrop Impact\")\n",
|
|
||||||
"fig.tight_layout()\n",
|
|
||||||
"fig.savefig(FIGURES_DIR / \"2c_facecrop.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
||||||
"plt.show()"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "8702d10d",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 2C - Facecrop impact\n",
|
|
||||||
"\n",
|
|
||||||
"This section compares `p2c_*_facecrop` against the matching `p2b_*_224` no-facecrop baselines.\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "ec5e03ef",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print_comparison_readout(comparison_subset(\"2A\"))\n\n# Inspect whether logs contain the per-source data needed by v2.md.\nsource_audit = []\nfor run in [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]:\n results = load_results(run)\n has_per_source = False\n has_records = False\n example_keys = []\n if results:\n for fold in results.get(\"fold_results\", []):\n tm = fold.get(\"test_metrics\", {})\n example_keys = sorted(tm.keys())\n has_per_source = has_per_source or any(k in tm for k in [\"per_source\", \"per_source_metrics\", \"pairwise_source_metrics\", \"source_metrics\", \"pair_metrics\"])\n has_records = has_records or any(k in fold for k in [\"records\", \"predictions\", \"test_records\"])\n source_audit.append({\n \"run\": run,\n \"has_per_source_metrics\": has_per_source,\n \"has_prediction_records\": has_records,\n \"test_metric_keys\": example_keys,\n })\nsource_audit_df = pd.DataFrame(source_audit)\ndisplay(source_audit_df)\n\nholdout_runs = [\"p2a_t1_original\", \"p2a_t3_holdout_text2img\", \"p2a_t3_holdout_inpainting\", \"p2a_t3_holdout_insight\"]\nholdout_df = canonical_runs_df[canonical_runs_df[\"run\"].isin(holdout_runs)].copy()\nholdout_df[\"delta_vs_all_source\"] = holdout_df[\"auc_mean\"] - float(holdout_df.loc[holdout_df[\"run\"].eq(\"p2a_t1_original\"), \"auc_mean\"].iloc[0])\n\nfig, ax = plt.subplots(figsize=(9, 5))\nax.bar(holdout_df[\"label\"], holdout_df[\"auc_mean\"], yerr=holdout_df[\"auc_std\"], color=\"#54A24B\", alpha=0.85, capsize=5)\nax.set_ylim(0.88, 0.99)\nax.set_ylabel(\"Aggregate AUC\")\nax.set_title(\"2C Source Holdout Proxy: Aggregate Test AUC\")\nax.tick_params(axis=\"x\", rotation=20)\nfor i, (_, row) in enumerate(holdout_df.iterrows()):\n ax.text(i, row[\"auc_mean\"] + 0.004, f\"{row['delta_vs_all_source']:+.3f}\", ha=\"center\", fontsize=9)\nfig.tight_layout()\nfig.savefig(FIGURES_DIR / \"2c_holdout_proxy.png\", dpi=200, bbox_inches=\"tight\")\nplt.show()\n\nprint(\"Geometry diagnostic evidence:\")\ngeometry_keys = []\nfor run in [\"p2a_t1_original\", \"p2a_t2_real_norm\"]:\n results = load_results(run)\n cfg = (results or {}).get(\"config\", {})\n geometry_keys.append({\n \"run\": run,\n \"config_geometry_condition\": cfg.get(\"geometry_condition\"),\n \"has_matched_geometry_metric\": any(\n \"geometry\" in str(k).lower() or \"matched\" in str(k).lower()\n for fold in (results or {}).get(\"fold_results\", [])\n for k in fold.get(\"test_metrics\", {}).keys()\n ),\n })\ndisplay(pd.DataFrame(geometry_keys))"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "2c3b8812",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## 2D / 2E - Augmentation impact and test-set integrity\n",
|
|
||||||
"\n",
|
|
||||||
"The augmentation question has two parts:\n",
|
|
||||||
"\n",
|
|
||||||
"- Does light augmentation help at 224 without facecrop?\n",
|
|
||||||
"- Does it help once facecrop is enabled?\n",
|
|
||||||
"\n",
|
|
||||||
"The implementation also needs to guarantee that validation/test evaluation is not stochastic. The preprocessing pipeline keeps stochastic operations behind `self.train`, so `train=False` disables them even if augmentation settings exist."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "f11c3257",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"print(\"2D (p2d): augmentation without facecrop\")\n",
|
|
||||||
"print_comparison_readout(comparison_subset(\"2D\", \"augmentation\"))\n",
|
|
||||||
"print(\"2E (p2e): augmentation with facecrop\")\n",
|
|
||||||
"print_comparison_readout(comparison_subset(\"2E\", \"facecrop + augmentation\"))\n",
|
|
||||||
"\n",
|
|
||||||
"aug_sections = comparisons_df[comparisons_df[\"section\"].isin([\"2D\", \"2E\"])].copy()\n",
|
|
||||||
"fig, ax = plt.subplots(figsize=(9, 5))\n",
|
|
||||||
"labels = aug_sections[\"section\"] + \" \" + aug_sections[\"model\"]\n",
|
|
||||||
"ax.bar(labels, aug_sections[\"delta_auc\"], yerr=aug_sections[\"delta_ci95\"], color=[\"#E45756\" if d < 0 else \"#059669\" for d in aug_sections[\"delta_auc\"]], alpha=0.85, capsize=5)\n",
|
|
||||||
"ax.axhline(0, color=\"black\", linewidth=1)\n",
|
|
||||||
"ax.set_ylabel(\"Delta AUC from adding augmentation\")\n",
|
|
||||||
"ax.set_title(\"Augmentation Effects Across Facecrop Conditions\")\n",
|
|
||||||
"ax.tick_params(axis=\"x\", rotation=20)\n",
|
|
||||||
"fig.tight_layout()\n",
|
|
||||||
"fig.savefig(FIGURES_DIR / \"2d_2e_augmentation_effects.png\", dpi=200, bbox_inches=\"tight\")\n",
|
|
||||||
"plt.show()\n",
|
|
||||||
"\n",
|
|
||||||
"# Static and behavioral audit of eval stochasticity.\n",
|
|
||||||
"try:\n",
|
|
||||||
" import inspect\n",
|
|
||||||
" from src.preprocessing.pipeline import DFFImagePipeline\n",
|
|
||||||
" from src.evaluation import evaluate as evaluate_module\n",
|
|
||||||
"\n",
|
|
||||||
" pipeline_src = inspect.getsource(DFFImagePipeline)\n",
|
|
||||||
" build_transforms_src = inspect.getsource(evaluate_module.build_transforms)\n",
|
|
||||||
" stochastic_guards = {\n",
|
|
||||||
" \"flip_guarded_by_train\": \"if self.train and random.random() < self.hflip_p\" in pipeline_src,\n",
|
|
||||||
" \"rotate_guarded_by_train\": \"if self.train and self.rotation_degrees > 0\" in pipeline_src,\n",
|
|
||||||
" \"color_jitter_returns_when_not_train\": \"if not self.train:\" in pipeline_src,\n",
|
|
||||||
" \"blur_guarded_by_train\": \"if self.train and random.random() < self.blur_p\" in pipeline_src,\n",
|
|
||||||
" \"jpeg_guarded_by_train\": \"if self.train and random.random() < self.jpeg_p\" in pipeline_src,\n",
|
|
||||||
" \"erase_guarded_by_train\": \"if self.train and random.random() < self.erase_p\" in pipeline_src,\n",
|
|
||||||
" \"noise_guarded_by_train\": \"if self.train and random.random() < self.noise_p\" in pipeline_src,\n",
|
|
||||||
" \"cv_transform_uses_train_flag\": \"get_transforms(train=train\" in build_transforms_src,\n",
|
|
||||||
" }\n",
|
|
||||||
" display(pd.DataFrame([stochastic_guards]).T.rename(columns={0: \"passes\"}))\n",
|
|
||||||
"except Exception as exc:\n",
|
|
||||||
" print(f\"Could not run transform audit: {exc}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "02e47658",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Decision synthesis\n",
|
|
||||||
"\n",
|
|
||||||
"This section converts the evidence into Phase 3 settings. It intentionally distinguishes a recommendation from a claim:\n",
|
|
||||||
"\n",
|
|
||||||
"- Recommendation: choose the setting that is best supported for the next experiment.\n",
|
|
||||||
"- Claim: what the current evidence proves. Some Phase 2C claims remain incomplete without per-source or matched-geometry outputs."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": null,
|
|
||||||
"id": "7034443c",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"def get_delta(question: str, model: str | None = None, section: str | None = None) -> pd.DataFrame:\n",
|
|
||||||
" df = comparisons_df[comparisons_df[\"question\"].eq(question)].copy()\n",
|
|
||||||
" if model:\n",
|
|
||||||
" df = df[df[\"model\"].eq(model)]\n",
|
|
||||||
" if section:\n",
|
|
||||||
" df = df[df[\"section\"].eq(section)]\n",
|
|
||||||
" return df\n",
|
|
||||||
"\n",
|
|
||||||
"resolution_resnet = get_delta(\"resolution\", \"ResNet18\").iloc[0]\n",
|
|
||||||
"facecrop_resnet = get_delta(\"facecrop\", \"ResNet18\").iloc[0]\n",
|
|
||||||
"facecrop_simple = get_delta(\"facecrop\", \"SimpleCNN\").iloc[0]\n",
|
|
||||||
"aug_no_crop_resnet = get_delta(\"augmentation\", \"ResNet18\").iloc[0]\n",
|
|
||||||
"aug_no_crop_simple = get_delta(\"augmentation\", \"SimpleCNN\").iloc[0]\n",
|
|
||||||
"aug_crop_resnet = get_delta(\"facecrop + augmentation\", \"ResNet18\").iloc[0]\n",
|
|
||||||
"aug_crop_simple = get_delta(\"facecrop + augmentation\", \"SimpleCNN\").iloc[0]\n",
|
|
||||||
"norm = get_delta(\"normalization\", \"ResNet18\").iloc[0]\n",
|
|
||||||
"\n",
|
|
||||||
"recommendations = [\n",
|
|
||||||
" {\n",
|
|
||||||
" \"choice\": \"resolution\",\n",
|
|
||||||
" \"recommendation\": \"224x224\",\n",
|
|
||||||
" \"evidence\": f\"ResNet18 delta AUC {resolution_resnet.delta_auc:+.4f}; SimpleCNN does not determine Phase 3 capacity.\",\n",
|
|
||||||
" \"confidence\": \"high\" if resolution_resnet.delta_auc > 0.02 else \"medium\",\n",
|
|
||||||
" },\n",
|
|
||||||
" {\n",
|
|
||||||
" \"choice\": \"facecrop\",\n",
|
|
||||||
" \"recommendation\": \"use facecrop\",\n",
|
|
||||||
" \"evidence\": f\"Small positive deltas for both models: SimpleCNN {facecrop_simple.delta_auc:+.4f}, ResNet18 {facecrop_resnet.delta_auc:+.4f}.\",\n",
|
|
||||||
" \"confidence\": \"medium\",\n",
|
|
||||||
" },\n",
|
|
||||||
" {\n",
|
|
||||||
" \"choice\": \"augmentation\",\n",
|
|
||||||
" \"recommendation\": \"do not use light augmentation for Phase 3 at 20% data\",\n",
|
|
||||||
" \"evidence\": f\"SimpleCNN drops {aug_no_crop_simple.delta_auc:+.4f} without facecrop and {aug_crop_simple.delta_auc:+.4f} with facecrop; ResNet18 is neutral/slightly mixed ({aug_no_crop_resnet.delta_auc:+.4f}, {aug_crop_resnet.delta_auc:+.4f}).\",\n",
|
|
||||||
" \"confidence\": \"high for SimpleCNN, medium for ResNet18\",\n",
|
|
||||||
" },\n",
|
|
||||||
" {\n",
|
|
||||||
" \"choice\": \"normalization\",\n",
|
|
||||||
" \"recommendation\": \"ImageNet normalization\",\n",
|
|
||||||
" \"evidence\": f\"Real-train-only normalization delta AUC {norm.delta_auc:+.4f}; no useful gain and less standard for pretrained ResNet.\",\n",
|
|
||||||
" \"confidence\": \"medium\",\n",
|
|
||||||
" },\n",
|
|
||||||
" {\n",
|
|
||||||
" \"choice\": \"shortcut/source claims\",\n",
|
|
||||||
" \"recommendation\": \"do not overclaim; add per-source or prediction exports before final report\",\n",
|
|
||||||
" \"evidence\": \"Current CV logs lack held-out-source vs in-source AUC and matched-geometry test metrics.\",\n",
|
|
||||||
" \"confidence\": \"high\",\n",
|
|
||||||
" },\n",
|
|
||||||
"]\n",
|
|
||||||
"\n",
|
|
||||||
"recommendations_df = pd.DataFrame(recommendations)\n",
|
|
||||||
"display(recommendations_df)\n",
|
|
||||||
"\n",
|
|
||||||
"summary = {\n",
|
|
||||||
" \"phase\": \"phase2\",\n",
|
|
||||||
" \"source_documents\": [\"classifier/v2.md\", \"classifier/impl.md\"],\n",
|
|
||||||
" \"artifact_counts\": {\n",
|
|
||||||
" \"canonical_runs\": int(len(canonical_runs_df)),\n",
|
|
||||||
" \"loaded_canonical_runs\": int(canonical_runs_df[\"log_status\"].isin([\"present\", \"fallback\"]).sum()),\n",
|
|
||||||
" \"fallback_runs_used\": {k: v for k, v in RUN_ALIASES.items() if resolve_run(k) != k},\n",
|
|
||||||
" },\n",
|
|
||||||
" \"recommendations\": recommendations,\n",
|
|
||||||
" \"planned_comparisons\": comparisons_df.replace({np.nan: None}).to_dict(orient=\"records\"),\n",
|
|
||||||
" \"known_gaps\": [\n",
|
|
||||||
" \"Dedicated p2a_*_128 logs are absent/skipped; Phase 1 baselines are used as fallbacks.\",\n",
|
|
||||||
" \"Source holdout logs do not include prediction-level or per-source metrics, so held-out-source AUC vs in-source AUC cannot be computed.\",\n",
|
|
||||||
" \"No matched-geometry evaluation metric is present in p2c logs, so geometry shortcut analysis is incomplete.\",\n",
|
|
||||||
" ],\n",
|
|
||||||
"}\n",
|
|
||||||
"\n",
|
|
||||||
"summary_path = ANALYSIS_DIR / \"phase2_analysis_summary.json\"\n",
|
|
||||||
"with summary_path.open(\"w\") as f:\n",
|
|
||||||
" json.dump(summary, f, indent=2)\n",
|
|
||||||
"\n",
|
|
||||||
"print(f\"Saved summary: {summary_path.relative_to(PROJECT_ROOT)}\")\n",
|
|
||||||
"print(f\"Saved figures: {FIGURES_DIR.relative_to(PROJECT_ROOT)}\")"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"id": "5a337f73",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"## Report-ready conclusion\n",
|
|
||||||
"\n",
|
|
||||||
"The strongest Phase 2 result is the resolution effect for ResNet18: moving to 224x224 substantially improves AUC under the controlled CV protocol. Face cropping gives a small positive effect and is reasonable to carry forward, especially because it aligns the model with face evidence rather than background context. Light augmentation is not supported at this 20% data setting: it strongly hurts SimpleCNN and provides no reliable gain for ResNet18, with or without face cropping. ImageNet normalization remains preferable because real-train-only normalization does not improve AUC and is less aligned with pretrained ResNet expectations.\n",
|
|
||||||
"\n",
|
|
||||||
"Recommended Phase 3 preprocessing: **224x224, facecrop enabled, no light augmentation, ImageNet normalization**.\n",
|
|
||||||
"\n",
|
|
||||||
"Limitations to fix before the final report: export prediction-level records or per-source pairwise metrics for source holdout, and add the matched-geometry evaluation required by the shortcut-analysis plan. Without those artifacts, Phase 2C can only support a limited shortcut analysis."
|
|
||||||
]
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"metadata": {
|
|
||||||
"kernelspec": {
|
|
||||||
"display_name": "drl",
|
|
||||||
"language": "python",
|
|
||||||
"name": "python3"
|
|
||||||
},
|
|
||||||
"language_info": {
|
|
||||||
"codemirror_mode": {
|
|
||||||
"name": "ipython",
|
|
||||||
"version": 3
|
|
||||||
},
|
|
||||||
"file_extension": ".py",
|
|
||||||
"mimetype": "text/x-python",
|
|
||||||
"name": "python",
|
|
||||||
"nbconvert_exporter": "python",
|
|
||||||
"pygments_lexer": "ipython3",
|
|
||||||
"version": "3.12.13"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nbformat": 4,
|
|
||||||
"nbformat_minor": 5
|
|
||||||
}
|
|
||||||
|
After Width: | Height: | Size: 60 KiB |
|
After Width: | Height: | Size: 52 KiB |
|
After Width: | Height: | Size: 92 KiB |
|
After Width: | Height: | Size: 3.3 MiB |
|
After Width: | Height: | Size: 2.5 MiB |
|
After Width: | Height: | Size: 906 KiB |
|
After Width: | Height: | Size: 2.2 MiB |
|
After Width: | Height: | Size: 2.9 MiB |
|
After Width: | Height: | Size: 2.0 MiB |
|
After Width: | Height: | Size: 29 KiB |
|
After Width: | Height: | Size: 53 KiB |
|
After Width: | Height: | Size: 55 KiB |
|
After Width: | Height: | Size: 66 KiB |
|
After Width: | Height: | Size: 68 KiB |
|
After Width: | Height: | Size: 112 KiB |
|
After Width: | Height: | Size: 37 KiB |
|
After Width: | Height: | Size: 58 KiB |
|
After Width: | Height: | Size: 69 KiB |
|
After Width: | Height: | Size: 82 KiB |
|
After Width: | Height: | Size: 79 KiB |
|
After Width: | Height: | Size: 79 KiB |
|
After Width: | Height: | Size: 79 KiB |
|
After Width: | Height: | Size: 56 KiB |
|
After Width: | Height: | Size: 134 KiB |
|
After Width: | Height: | Size: 67 KiB |
|
After Width: | Height: | Size: 75 KiB |
|
After Width: | Height: | Size: 64 KiB |
|
After Width: | Height: | Size: 48 KiB |
|
After Width: | Height: | Size: 53 KiB |
|
After Width: | Height: | Size: 47 KiB |
|
After Width: | Height: | Size: 72 KiB |
|
After Width: | Height: | Size: 70 KiB |
|
After Width: | Height: | Size: 191 KiB |
|
After Width: | Height: | Size: 49 KiB |
|
After Width: | Height: | Size: 106 KiB |
|
After Width: | Height: | Size: 114 KiB |