Clean state

2026-04-30 01:25:39 +01:00
commit bb3dfb92d5
266 changed files with 37043 additions and 0 deletions
@@ -0,0 +1,60 @@
+# Deep and Reinforcement Learning (2025/2026 — M.IA003), FEUP/FCUP
+## Deep Learning Project
+
+**Submission deadline:** May 15th, 2026
+
+This work will need to be submitted using the Moodle platform. It will be developed during practical classes, but it is expected that the students will complement this work using extra-class hours.
+
+---
+
+## 1. Objective
+The objective of this work is to develop deep learning discriminative and generative models, applied to the context of “deep fakes”. The discriminative models will be designed to classify images as “real” vs. “fake”, whereas the generative models will be trained to produce new “fake” examples.
+
+## 2. Dataset
+The data that you will be using belongs to the DeepFakeFace (DFF) dataset. You can access the dataset files and description via the Hugging Face link. In addition, you can find a detailed description of the dataset in this paper.
+
+The dataset was generated to assess the ability of deepfake detectors to distinguish AI-generated and authentic images. It contains 30,000 real images of celebrities taken from the IMDB-WIKI dataset. The dataset also contains 90,000 fake images generated with the three following models:
+
+- Stable Diffusion v1.5  
+- Stable Diffusion Inpainting  
+- InsightFace  
+
+Each model generated 30,000 fake images.
+
+## 3. Implementation
+In order to complete this work, you will need to implement two different models:
+
+1. One classifier, which is trained to distinguish between real and fake images  
+2. One generative model, which is trained to create new fake images  
+
+For the first model, you will be free to implement any of the discriminative approaches that will be considered during the theoretical classes (e.g., multilayer perceptrons, convolutional neural networks, visual transformers, etc.).
+
+For the second model, you will be free to implement any of the generative models that will be considered during the theoretical classes (e.g., generative adversarial networks, variational autoencoders, diffusion models, etc.).
+
+For both models, you will need to define a proper training strategy as well as the correct way and metrics to evaluate the performance.
+
+## 4. Project evaluation
+The project will be evaluated by taking in consideration the suitability of the proposed model for the specific task, its correctness, and complexity.
+
+**VERY IMPORTANT:** The main objective and core of this project will be that of iteratively improving the proposed solutions via a continuous observation of intermediate results and the proposal of adjustments to the algorithm. Projects that simply present a solution without showcasing the evolution of the proposed model will not be considered as sufficient.
+
+## 5. Submission of the solution
+Your project must be delivered in Moodle by **May 15th, 2026, at 23:59:59**.
+
+- Final code solution, as a notebook or series of files.
+- Slides for presentation (**pdf format**) focusing on the main issues of the assignment for a 10 minute presentation; any additional information that cannot be presented in that time slot can be included in annexes to the presentation. The presentation should contain the following information:
+  - Brief description of the deep learning solutions considered for the problem, both for the discriminative and generative part.
+  - **MOST IMPORTANT:** Description of the different implementation steps considered to improve the proposed models: motivate your choices in terms of type of approach, model architecture, training strategy, etc. Show intermediate results, how you interpreted those, and what you decide to change in order to improve the results.
+  - Results:
+    - Classification performance obtained by the developed discriminative model. Description of the experimental setup, train/val/test splits, and performance metrics chosen.
+    - Data generation performance obtained by the developed generative model. Description of the experimental setup and performance metrics chosen.
+  - Discussion and conclusions: comments on the performance obtained and final remarks.
+- Filled auto-evaluation file regarding the contribution of each member of the group.
+
+Further information about the project submission and presentation:
+
+- The code provided as the solution will need to allow to train the considered models and reproduce the results that you reported. Please do not include dataset files. You can assume I have local access to the dataset.
+- The work must be done by groups of 3 people. Groups formed by less than 3 people must be justified and approved before starting working.
+- Delays in the submission will incur in a grade penalization and eventually in not accepting the work.
+- All works must be presented on May 22nd and 29th, during the practical classes. All group members must be present during the demonstration. If a member of the group is not present to the work presentation, he will receive a zero grade for this work, thus implying failing to pass.
+- Each member of the group must comment on their contribution to the work, and must know what the other members of the group have done. Failing to describe in details what your solution is doing and why will determine a penalization in the overall evaluation of the project.
@@ -0,0 +1,647 @@
+# Deepfake Detection Classifier - Implementation Plan
+
+## Overview
+This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.
+
+---
+
+## Phase 0: Pre-Implementation Setup
+
+### Infrastructure and Configuration
+- [x] Create `classifier/configs/shared.json` with shared parameters:
+  - seed: 42
+  - val_ratio: 0.1
+  - test_ratio: 0.1
+  - batch_size: 32
+  - optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
+  - scheduler: {type: "cosine_annealing", T_max: 15}
+  - early_stopping_patience: 5
+  - num_workers: 4
+  - cv_folds: 5
+  - data_dir: "data"
+  - face_crop_margin: 0.6
+
+- [x] Implement config loading/merging so experiment configs inherit `shared.json` defaults and override only the variables under test
+- [x] Resolve shared nested fields such as `optimizer.lr`, `optimizer.weight_decay`, and `scheduler.T_max` into the training arguments used by the runner
+- [x] Update existing configs to reference `shared.json` or otherwise document which shared defaults they intentionally override
+- [x] Define one CV protocol for all phases:
+  - outer fold: held-out test fold
+  - inner validation split: group-aware split from the remaining training folds for early stopping/model selection
+  - final reported metrics: aggregate held-out test-fold results across the 5 outer folds
+
+### Data Preparation
+- [x] Verify dataset structure and integrity
+- [x] Check that real and fake images are properly organized by source
+- [x] Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)
+
+### Cleanup
+- [x] Remove `classifier/tools/ensemble.py` (not part of reorganization plan, conflicts with explainability goals)
+- [x] Remove robustness evaluation from `classifier/tools/analyze.py` (lines 51-104, 82-104, 144) - not part of experimental plan
+- [x] Remove any unused or obsolete config files from previous experiments (see detailed list below)
+- [X] Clean up old output directories if needed (keep important results for reference)
+
+#### Config Files to Remove (39 total)
+
+**Root configs (6):**
+- [x] `classifier/configs/resnet18_quick.json`
+- [x] `classifier/configs/resnet18.json`
+- [x] `classifier/configs/simple_cnn_large.json`
+- [x] `classifier/configs/simple_cnn_micro.json`
+- [x] `classifier/configs/simple_cnn_small.json`
+- [x] `classifier/configs/simple_cnn.json`
+
+**Phase 1 old configs (7):**
+- [x] `classifier/configs/phase1/p1_cnn_base.json` (uses lr=1e-3, epochs=20 - should be 1e-4, 15)
+- [x] `classifier/configs/phase1/p1_cnn_aug.json`
+- [x] `classifier/configs/phase1/p1_resnet18_base.json` (duplicate of new baseline)
+- [x] `classifier/configs/phase1/p1_resnet18_aug.json`
+- [x] `classifier/configs/phase1/holdout/` (entire directory - 6 configs, source holdout not in new plan)
+
+**Phase 2 old configs (7):**
+- [x] `classifier/configs/phase2/p2_resnet18_224.json` (should be p2a_resnet18_224.json)
+- [x] `classifier/configs/phase2/p2_resnet18_facecrop.json` (should be p2b_resnet18_facecrop.json)
+- [x] `classifier/configs/phase2/p2_resnet18_frozen.json` (frozen backbone not in new plan)
+- [x] `classifier/configs/phase2/p2_resnet34_224.json` (ResNet34 should be in Phase 3)
+- [x] `classifier/configs/phase2/p2_resnet34.json` (ResNet34 should be in Phase 3)
+- [x] `classifier/configs/phase2/p2_resnet50_frozen.json` (ResNet50 should be in Phase 3)
+- [x] `classifier/configs/phase2/p2_resnet50.json` (ResNet50 should be in Phase 3)
+
+**Phase 3 old configs (4):**
+- [x] `classifier/configs/phase3/p3_efficientnet_b2.json` (EfficientNet-B2 not in new plan, only B0)
+- [x] `classifier/configs/phase3/p3_resnet18_facecrop_full.json` (ResNet18 full dataset should be Phase 4)
+- [x] `classifier/configs/phase3/p3_resnet18_freqaug.json` (frequency augmentation not in new plan)
+- [x] `classifier/configs/phase3/p3_vit_b16.json` (ViT not in new plan, replaced with ConvNeXt/MobileNet)
+- Note: `p3_efficientnet_b0.json` - REMOVED (will be recreated after Phase2 with correct settings)
+
+**Source holdout (6):**
+- [x] `classifier/configs/source_holdout/` (entire directory - 6 configs, source holdout not in new plan)
+
+**Ablation (3):**
+- [x] `classifier/configs/ablation/` (entire directory - 3 configs, ablation studies not in new plan)
+
+**Configs to KEEP (3):**
+- ✅ `classifier/configs/shared.json`
+- ✅ `classifier/configs/phase1/p1_simplecnn_baseline.json`
+- ✅ `classifier/configs/phase1/p1_resnet18_baseline.json`
+
+**Phase 2 alias configs removed (8):**
+- [x] `classifier/configs/phase2/p2b_resnet18_128.json` (alias for p1_resnet18_baseline)
+- [x] `classifier/configs/phase2/p2b_simplecnn_128.json` (alias for p1_simplecnn_baseline)
+- [x] `classifier/configs/phase2/p2c_resnet18_nofacecrop.json` (alias for p2b_resnet18_224)
+- [x] `classifier/configs/phase2/p2c_simplecnn_nofacecrop.json` (alias for p2b_simplecnn_224)
+- [x] `classifier/configs/phase2/p2d_resnet18_noaug.json` (alias for p2b_resnet18_224)
+- [x] `classifier/configs/phase2/p2d_simplecnn_noaug.json` (alias for p2b_simplecnn_224)
+- [x] `classifier/configs/phase2/p2e_resnet18_facecrop_only.json` (alias for p2c_resnet18_facecrop)
+- [x] `classifier/configs/phase2/p2e_simplecnn_facecrop_only.json` (alias for p2c_simplecnn_facecrop)
+
+Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.
+
+---
+
+## Phase 1: Architecture Baseline
+
+### 1.1 Experiment Configs
+- [x] Create `classifier/configs/phase1/p1_simplecnn_baseline.json`
+  - backbone: simple_cnn
+  - cnn_preset: medium
+  - dropout: 0.0
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4 (consistent with ResNet)
+  - weight_decay: 1e-4
+  - image_size: 128
+  - data_dir: data
+  - early_stopping_patience: 5
+  - subsample: 0.2
+  - face_crop: false
+  - augment: false
+  - seed: 42
+
+- [x] Create `classifier/configs/phase1/p1_resnet18_baseline.json`
+  - backbone: resnet18
+  - pretrained: true
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4
+  - weight_decay: 1e-4
+  - image_size: 128
+  - data_dir: data
+  - early_stopping_patience: 5
+  - subsample: 0.2
+  - face_crop: false
+  - augment: false
+  - seed: 42
+
+### 1.2 Code Updates
+- [x] Implement 5-fold stratified group cross-validation by basename in training pipeline
+- [x] Update `classifier/src/training/trainer.py` to support CV
+- [x] Update `classifier/src/evaluation/evaluate.py` to support CV
+- [x] Ensure all metrics report mean ± std and confidence intervals across folds
+
+### 1.3 Training
+- [x] Train SimpleCNN with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json`)
+- [x] Train ResNet18 with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json`)
+- [x] Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)
+
+### 1.4 Analysis
+- [x] Use `classifier/notebooks/03_phase1_analysis.ipynb` for Phase 1 analysis
+- [x] Compare SimpleCNN vs ResNet18 performance
+- [x] Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
+- [x] Per-source metrics (text2img, inpainting, insight)
+- [x] Train/val/test performance curves
+- [x] Confusion matrices
+- [x] Statistical significance testing
+- [x] Generate Grad-CAM visualizations (10-20 images per model)
+- [x] Document conclusions: Which baseline is better and why
+
+---
+
+## Phase 2: Preprocessing Impact
+
+### 2.1 Shortcut Analysis (2A)
+- [x] Create `classifier/configs/phase2/p2a_t1_original.json`
+  - backbone: resnet18
+  - image_size: 224
+  - subsample: 0.2
+  - seed: 42
+  - augment: false
+  - normalization: imagenet
+  - data_dir: data
+
+- [x] Create `classifier/configs/phase2/p2a_t2_real_norm.json`
+  - extends: p2a_t1_original.json
+  - normalization: real_norm
+  - **Normalization**: Calculate mean/std from real training images only within each fold
+
+- [x] Geometry diagnostic was explored and then removed from the codebase (`src/evaluation/geometry.py` no longer exists):
+  - Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
+  - Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
+
+- [ ] Train the 2 shortcut configs with 5-fold stratified group CV
+- [ ] Compare results:
+  - Standard vs matched-geometry eval for `p2a_t1_original` (letterboxing impact)
+  - `p2a_t1_original` vs `p2a_t2_real_norm` (color distribution shortcut)
+
+- [x] Create `classifier/configs/phase2/p2a_t3_holdout_text2img.json`
+  - extends: p2a_t1_original.json
+  - train_sources: ["wiki", "inpainting", "insight"]
+  - eval_sources: ["wiki", "inpainting", "insight", "text2img"]
+
+- [x] Create `classifier/configs/phase2/p2a_t3_holdout_inpainting.json`
+  - extends: p2a_t1_original.json
+  - train_sources: ["wiki", "text2img", "insight"]
+  - eval_sources: ["wiki", "text2img", "insight", "inpainting"]
+
+- [x] Create `classifier/configs/phase2/p2a_t3_holdout_insight.json`
+  - extends: p2a_t1_original.json
+  - train_sources: ["wiki", "text2img", "inpainting"]
+  - eval_sources: ["wiki", "text2img", "inpainting", "insight"]
+
+- [ ] Train the 3 source holdout configs with 5-fold stratified group CV
+- [ ] Compare held-out source performance vs in-source performance:
+  - Calculate AUC for held-out source (text2img, inpainting, insight)
+  - Compute Δ (in-source AUC - held-out AUC)
+  - If Δ > 0.05-0.10, model is learning source-specific features
+
+### 2.2 Resolution Impact (2B)
+- [x] Create `classifier/configs/phase2/p2b_simplecnn_224.json`
+  - backbone: simple_cnn
+  - image_size: 224
+  - subsample: 0.2
+  - augment: false
+  - seed: 42
+  - data_dir: data
+
+- [x] Create `classifier/configs/phase2/p2b_resnet18_224.json`
+  - backbone: resnet18
+  - image_size: 224
+  - subsample: 0.2
+  - augment: false
+  - seed: 42
+  - data_dir: data
+
+- [ ] Train both 224 configs with 5-fold stratified group CV
+- [ ] Compare 128×128 vs 224×224 for each model
+  - 128 baseline is `p1_*_baseline` (comparison mapping in notebook)
+
+### 2.3 Facecrop Impact (2C)
+- [x] Create `classifier/configs/phase2/p2c_simplecnn_facecrop.json`
+  - backbone: simple_cnn
+  - image_size: 224
+  - subsample: 0.2
+  - augment: false
+  - seed: 42
+  - data_dir: cropped/classifier
+
+- [x] Create `classifier/configs/phase2/p2c_resnet18_facecrop.json`
+  - backbone: resnet18
+  - image_size: 224
+  - subsample: 0.2
+  - augment: false
+  - seed: 42
+  - data_dir: cropped/classifier
+
+- [ ] Train both facecrop configs with 5-fold stratified group CV
+- [ ] Compare `p2b_resnet18_224` (no facecrop) vs `p2c_resnet18_facecrop` for each model
+  - No-facecrop baseline is `p2b_*_224` (comparison mapping in notebook)
+
+### 2.4 Augmentation Impact (2D)
+- [x] Create `classifier/configs/phase2/p2d_simplecnn_aug.json`
+  - backbone: simple_cnn
+  - image_size: 224
+  - subsample: 0.2
+  - seed: 42
+  - augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
+  - data_dir: data
+
+- [x] Create `classifier/configs/phase2/p2d_resnet18_aug.json`
+  - backbone: resnet18
+  - image_size: 224
+  - subsample: 0.2
+  - seed: 42
+  - augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
+  - data_dir: data
+
+- [ ] Train both augmentation configs with 5-fold stratified group CV
+- [ ] Compare `p2b_resnet18_224` (no aug) vs `p2d_resnet18_aug` for each model
+  - No-aug baseline is `p2b_*_224` (comparison mapping in notebook)
+
+### 2.5 Augmentation + Facecrop (2E)
+- [x] Create `classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json`
+  - backbone: simple_cnn
+  - image_size: 224
+  - subsample: 0.2
+  - seed: 42
+  - augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
+  - data_dir: cropped/classifier
+
+- [x] Create `classifier/configs/phase2/p2e_resnet18_facecrop_aug.json`
+  - backbone: resnet18
+  - image_size: 224
+  - subsample: 0.2
+  - seed: 42
+  - augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
+  - data_dir: cropped/classifier
+
+- [ ] Train both facecrop+aug configs with 5-fold stratified group CV
+- [ ] Compare `p2c_resnet18_facecrop` (facecrop only) vs `p2e_resnet18_facecrop_aug` for each model
+  - Facecrop-only baseline is `p2c_*_facecrop` (comparison mapping in notebook)
+
+### 2.6 Phase 2 Analysis
+- [ ] Use `classifier/notebooks/04_phase2_analysis.ipynb` for Phase 2 analysis
+- [ ] For each experiment (2A-2E):
+  - [ ] Load 5-fold stratified group CV results (mean ± std and confidence intervals)
+  - [ ] Generate overall metrics (AUC, Accuracy, F1)
+  - [ ] Generate per-source metrics (text2img, inpainting, insight)
+  - [ ] Calculate train/val gap
+  - [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
+  - [ ] Statistical significance testing vs baseline
+  - [ ] Generate comparison visualizations (bar charts, heatmaps)
+- [ ] For 2C (Shortcut Analysis):
+  - [ ] Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
+  - [ ] Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
+  - [ ] Load source holdout results (3 configs)
+  - [ ] Calculate held-out source AUC vs in-source AUC for each holdout experiment
+  - [ ] Compute Δ (in-source AUC - held-out AUC)
+  - [ ] If Δ > 0.05-0.10, model is learning source-specific features
+  - [ ] Generate source holdout comparison table
+- [ ] For each model/condition:
+  - [ ] Generate Grad-CAM visualizations (10-20 images per condition)
+  - [ ] Organize by experiment, prediction type, and source
+- [ ] Answer key questions:
+  - [ ] Which preprocessing choices are statistically significant?
+  - [ ] Do certain sources benefit more from specific preprocessing?
+  - [ ] Is there an interaction between facecrop and augmentation?
+  - [ ] Are shortcuts being learned (resolution, color distribution)?
+  - [ ] Is the model learning source-specific features (source holdout)?
+  - [ ] Does augmentation remove shortcuts or over-regularize?
+  - [ ] What features do models focus on (based on Grad-CAM)?
+- [ ] Generate comprehensive metrics comparison table
+- [ ] Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
+- [ ] Provide evidence-based conclusions for each experiment
+- [ ] Provide recommendations for Phase 3 (best preprocessing settings)
+
+---
+
+## Phase 3: Extended Architecture Exploration
+
+### 3.1 Experiment Configs
+Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.
+
+- [ ] Create `classifier/configs/phase3/p3_resnet34.json`
+  - backbone: resnet34
+  - pretrained: true
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4
+  - weight_decay: 1e-4
+  - image_size: 224
+  - face_crop: true (or best from Phase 2B/E)
+  - face_crop_margin: 0.6
+  - augment: false (or best from Phase 2D/E)
+  - subsample: 0.2
+  - seed: 42
+  - early_stopping_patience: 5
+
+- [ ] Create `classifier/configs/phase3/p3_resnet50.json`
+  - backbone: resnet50
+  - pretrained: true
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4
+  - weight_decay: 1e-4
+  - image_size: 224
+  - face_crop: true (or best from Phase 2B/E)
+  - face_crop_margin: 0.6
+  - augment: false (or best from Phase 2D/E)
+  - subsample: 0.2
+  - seed: 42
+  - early_stopping_patience: 5
+
+- [ ] Create `classifier/configs/phase3/p3_efficientnet_b0.json`
+  - backbone: efficientnet_b0
+  - pretrained: true
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4
+  - weight_decay: 1e-4
+  - image_size: 224
+  - face_crop: true (or best from Phase 2B/E)
+  - augment: false (or best from Phase 2D/E)
+  - subsample: 0.2
+  - seed: 42
+  - early_stopping_patience: 5
+
+- [ ] Create `classifier/configs/phase3/p3_convnext_tiny.json`
+  - backbone: convnext_tiny
+  - pretrained: true
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4
+  - weight_decay: 1e-4
+  - image_size: 224
+  - face_crop: true (or best from Phase 2B/E)
+  - augment: false (or best from Phase 2D/E)
+  - subsample: 0.2
+  - seed: 42
+  - early_stopping_patience: 5
+
+- [ ] Create `classifier/configs/phase3/p3_mobilenetv3_small.json`
+  - backbone: mobilenetv3_small
+  - pretrained: true
+  - epochs: 15
+  - batch_size: 32
+  - lr: 1e-4
+  - weight_decay: 1e-4
+  - image_size: 224
+  - face_crop: true (or best from Phase 2B/E)
+  - augment: false (or best from Phase 2D/E)
+  - subsample: 0.2
+  - seed: 42
+  - early_stopping_patience: 5
+
+### 3.2 Model Implementation
+- [ ] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
+- [ ] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
+- [ ] Register both models in `classifier/src/models/__init__.py`
+
+### 3.3 Training
+- [ ] Train ResNet34 with 5-fold stratified group CV
+- [ ] Train ResNet50 with 5-fold stratified group CV
+- [ ] Train EfficientNet-B0 with 5-fold stratified group CV
+- [ ] Train ConvNeXt-Tiny with 5-fold stratified group CV
+- [ ] Train MobileNetV3-Small with 5-fold stratified group CV
+- [ ] Save all checkpoints and metrics
+
+### 3.4 Analysis
+- [ ] Use `classifier/notebooks/05_phase3_analysis.ipynb` for Phase 3 analysis
+- [ ] Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
+- [ ] Generate overall metrics for each model
+- [ ] Generate per-source metrics for each model
+- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
+- [ ] Statistical significance testing vs baselines
+- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
+- [ ] Parameter count vs performance analysis
+- [ ] Conclusions: Which architectures work best and why
+
+---
+
+## Phase 4: Final Analysis on Best Models
+
+### 4.1 Select Top Models
+- [ ] Based on Phases 1-3 results, select top 3-4 models
+- [ ] Document selection criteria (e.g., top AUC, balanced performance, efficiency)
+
+### 4.2 Data Quantity Scaling (4A)
+- [ ] For each selected model, create configs for different data sizes:
+  - [ ] `classifier/configs/phase4/p4a_<model>_20pct.json` (subsample: 0.2)
+  - [ ] `classifier/configs/phase4/p4a_<model>_50pct.json` (subsample: 0.5)
+  - [ ] `classifier/configs/phase4/p4a_<model>_100pct.json` (subsample: 1.0)
+- [ ] In every 4A config, explicitly set the best Phase 2 preprocessing choices:
+  - image_size: best from Phase 2A
+  - face_crop: best from Phase 2B/E
+  - augment: best from Phase 2D/E
+- [ ] Train each model with 5-fold stratified group CV at all three data sizes
+- [ ] Compare how each model scales with more data
+
+### 4.3 Full Dataset Evaluation (4B)
+- [ ] For each selected model, create config for full dataset:
+  - `classifier/configs/phase4/p4b_<model>_full.json` (subsample: 1.0)
+- [ ] In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
+- [ ] Train each model on full dataset with 5-fold stratified group CV
+- [ ] Generate detailed per-source metrics
+- [ ] Generate Grad-CAM visualizations (10-20 images each)
+- [ ] Perform hard example analysis (false positives/negatives) with visualizations
+- [ ] Generate confidence distribution histograms
+- [ ] Cross-validation results (mean ± std with confidence intervals)
+
+### 4.4 Analysis
+- [ ] Use `classifier/notebooks/06_phase4_analysis.ipynb` for Phase 4 analysis
+- [ ] Load data quantity scaling results
+- [ ] Load full dataset evaluation results
+- [ ] Generate comprehensive metrics comparison table
+- [ ] Generate per-source metrics for final models
+- [ ] Generate Grad-CAM galleries for final models
+- [ ] Perform hard example analysis with visualizations
+- [ ] Generate confidence distribution histograms
+- [ ] Final model comparison and selection
+- [ ] Conclusions and recommendations
+
+---
+
+## Notebooks and Analysis
+
+This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.
+
+### 5.1 Exploratory Data Analysis
+- [x] Create `classifier/notebooks/01_eda.ipynb`
+- [x] Dataset overview (real vs fake distribution, sources)
+- [x] Image resolution/aspect ratio analysis (identify potential shortcuts)
+- [x] Color distribution analysis (identify potential shortcuts)
+- [x] Sample visualization from each source
+- [x] Statistical summary of the dataset
+- [x] Data quality checks
+
+### 5.2 Preprocessing Pipeline
+- [x] Create `classifier/notebooks/02_preprocessing.ipynb`
+- [x] Square crop and resize implementation demonstration
+- [x] Face crop (MTCNN) demonstration and effectiveness analysis
+- [x] Augmentation pipeline visualization (before/after examples)
+- [x] Z-score normalization comparison (ImageNet vs real-image-only)
+- [x] Data split verification (group-aware by basename, no overlap)
+- [x] Preprocessing impact visualization
+
+### 5.3 Phase 1 Analysis
+- [x] Create `classifier/notebooks/03_phase1_analysis.ipynb`
+- [x] Load Phase 1 training results
+- [x] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
+- [x] Generate per-source metrics for each model
+- [x] Generate train/val/test performance curves
+- [x] Generate confusion matrices
+- [x] Perform statistical significance testing between models
+- [x] Generate Grad-CAM visualizations (10-20 images each)
+- [x] Document conclusions: Which baseline is better and why
+
+### 5.4 Phase 2 Analysis
+- [x] Create `classifier/notebooks/04_phase2_analysis.ipynb`
+- [ ] Load all Phase 2 experiment results
+- [ ] For each experiment (2A-2E):
+  - [ ] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
+  - [ ] Generate overall metrics
+  - [ ] Generate per-source metrics
+  - [ ] Calculate train/val gap
+  - [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
+  - [ ] Perform statistical significance testing
+- [ ] Generate comparison tables across all Phase 2 experiments
+- [ ] Generate comparison visualizations (bar charts, heatmaps)
+- [ ] For each model/condition, generate Grad-CAM visualizations (10-20 images)
+- [ ] Organize visualizations by experiment, model, prediction type, and source
+- [ ] Answer key analysis questions
+- [ ] Generate comprehensive metrics comparison table
+- [ ] Provide evidence-based conclusions for each experiment
+- [ ] Provide recommendations for Phase 3
+
+### 5.5 Phase 3 Analysis
+- [ ] Create `classifier/notebooks/05_phase3_analysis.ipynb`
+- [ ] Load Phase 3 training results
+- [ ] Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
+- [ ] Generate per-source metrics for each model
+- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
+- [ ] Perform statistical significance testing vs baselines
+- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
+- [ ] Parameter count vs performance analysis
+- [ ] Conclusions: Which architectures work best and why
+
+### 5.6 Phase 4 Analysis
+- [ ] Create `classifier/notebooks/06_phase4_analysis.ipynb`
+- [ ] Load data quantity scaling results
+- [ ] Load full dataset evaluation results
+- [ ] Generate comprehensive metrics comparison table
+- [ ] Generate per-source metrics for final models
+- [ ] Generate Grad-CAM galleries for final models
+- [ ] Perform hard example analysis with visualizations
+- [ ] Generate confidence distribution histograms
+- [ ] Final model comparison and selection
+- [ ] Conclusions and recommendations
+
+### 5.7 Grad-CAM Deep Dive (Optional)
+- [ ] Create `classifier/notebooks/07_gradcam_deep_dive.ipynb`
+- [ ] Load Grad-CAM results from all phases
+- [ ] Comprehensive Grad-CAM analysis across all phases and models
+- [ ] Feature visualization for different model architectures
+- [ ] CNN vs EfficientNet vs ConvNeXt comparison
+- [ ] What regions do different architectures focus on?
+- [ ] Are there systematic differences in attention patterns?
+- [ ] Evidence of shortcut removal analysis across phases
+- [ ] Temporal analysis: does model attention change with different preprocessing?
+- [ ] Generate visual explanations suitable for presentation
+
+---
+
+## Code Implementation Tasks
+
+### Cross-Validation Implementation
+- [x] Update `classifier/src/training/trainer.py` to support 5-fold stratified group CV by basename
+- [x] Update `classifier/src/evaluation/evaluate.py` to support grouped CV splits
+- [x] Implement metric aggregation across folds (mean ± std)
+- [x] Ensure all metrics report confidence intervals
+- [x] Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
+- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
+- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
+
+### Model Implementations
+- [ ] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
+- [ ] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
+- [ ] Register both models in `classifier/src/models/__init__.py`
+
+### Normalization Implementation
+- [ ] Implement function to calculate mean/std from real training images only
+- [ ] Update `classifier/src/preprocessing/pipeline.py` to support custom normalization stats
+- [ ] Test ImageNet normalization vs real-image-only normalization
+
+### Evaluation Improvements
+- [ ] Ensure test set uses `train=False` to disable augmentation
+- [ ] Ensure diagnostic evaluation transforms never change the training data
+- [ ] Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
+- [ ] Implement per-source metrics with detection rate and false alarm rate
+- [ ] Implement pairwise AUC calculations
+- [ ] Implement train/val gap calculations
+- [ ] Implement pairwise source AUC variance calculations
+
+### Grad-CAM Improvements
+- [ ] Ensure Grad-CAM works for all model types (CNN-based)
+- [ ] Implement Grad-CAM for ConvNeXt
+- [ ] Implement Grad-CAM for MobileNetV3
+- [ ] Organize Grad-CAM outputs by experiment, model, prediction type, source
+
+---
+
+## Final Report Preparation
+- [ ] Compile results from all phases
+- [ ] Create presentation slides (PDF format)
+- [ ] Brief description of deep learning solutions (discriminative + generative)
+- [ ] Description of implementation steps and improvements
+  - [ ] Motivate choices for architecture, training strategy, etc.
+  - [ ] Show intermediate results
+  - [ ] Interpret results and what changed
+  - [ ] What was decided to improve results
+- [ ] Classification performance results
+  - [ ] Experimental setup
+  - [ ] Train/val/test splits
+  - [ ] Performance metrics chosen
+- [ ] Data generation performance results
+  - [ ] Experimental setup
+  - [ ] Performance metrics chosen
+- [ ] Discussion and conclusions
+  - [ ] Comments on performance
+  - [ ] Final remarks
+- [ ] Fill auto-evaluation file
+
+---
+
+## Summary
+
+Total tasks: ~150+
+
+This implementation plan covers:
+- ✅ All 4 phases with comprehensive experiments
+- ✅ 5-fold stratified group cross-validation for all experiments
+- ✅ 7 analysis notebooks for robust validation
+- ✅ Shortcut analysis (resolution/ratio + color distribution + source holdout)
+- ✅ Source holdout experiments to detect source-specific feature learning
+- ✅ Grad-CAM visualizations for explainability
+- ✅ Statistical analysis with confidence intervals
+- ✅ Per-source metrics for all experiments
+- ✅ Data quantity scaling analysis
+- ✅ Full dataset evaluation on best models
+- ✅ Comprehensive documentation and reporting
+
+**Key Features:**
+- Reproducible experiments with fixed seeds
+- Stratified group CV keeps basename groups together while balancing class distribution
+- Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
+- Source holdout experiments to test generalization to unseen sources
+- Grad-CAM for explainability
+- Statistical rigor with confidence intervals
+- Per-source analysis to understand model behavior
+- Clear progression from baselines -> preprocessing -> architectures -> final evaluation
@@ -0,0 +1,449 @@
+# Classifier Reorganization Plan (v2)
+
+## Analysis of Current Phasing Issues
+
+Your current phasing has several problems that make it difficult to present a rigorous, explainable report:
+
+### Current Problems
+
+1. **Inconsistent comparison conditions**:
+   - SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4
+   - SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES)
+   - Makes direct comparisons invalid
+
+2. **No cross-validation**:
+   - Only a single 80/10/10 split
+   - Results may be split-dependent
+   - No confidence intervals on metrics
+
+3. **Augmentation testing is incomplete**:
+   - Only tested on ResNet18 (Phase 3), not across architectures
+   - Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad)
+   - No way to distinguish these cases
+
+4. **Facecrop impact not generalized**:
+   - Only ResNet18 tested with facecrop
+   - Don't know if EfficientNet or ViT benefit similarly
+
+5. **Full dataset only on one model**:
+   - Only ResNet18 tested on full dataset
+   - Don't know if data quantity helps all models equally
+
+6. **Test set integrity**:
+   - Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary)
+   - Need to ensure same train/val/test splits across all model comparisons
+   - Need central config for shared parameters across phases
+
+---
+
+## Recommended Reorganization
+
+I suggest reorganizing into **4 phases** with clear, isolated variables. All phases use **5-fold stratified cross-validation** as standard practice to ensure balanced class distribution across folds.
+
+### Phase 1: Controlled Baseline Comparison
+
+**Goal**: Compare simple architectures under identical conditions to establish baselines
+
+**Fixed conditions for ALL models**:
+- Data: 20% subsample
+- Resolution: 128×128
+- No face crop
+- No augmentation
+- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
+- Scheduler: CosineAnnealingLR (T_max=15)
+- Epochs: 15 with early stopping (patience=5)
+- Batch size: 32
+- 5-fold stratified cross-validation (report mean ± std)
+
+| Model | Params | Expected AUC (mean ± std) |
+|-------|--------|---------------------------|
+| SimpleCNN | ~400k | ? |
+| ResNet18 | ~11.7M | ? |
+
+**This gives you**: Clean, comparable baseline for simple architectures with confidence intervals
+
+**These same 2 models will be used in Phase 2 for preprocessing experiments.**
+
+---
+
+### Phase 2: Preprocessing Impact (Same 2 Models from Phase 1)
+
+**Goal**: Test each preprocessing change on the SAME 2 models from Phase 1
+
+**Experimental questions**:
+- Does higher resolution improve performance?
+- Does face cropping improve performance?
+- Does augmentation improve or hurt performance?
+- Does augmentation interact with face cropping?
+- Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
+
+#### 2A: Shortcut Analysis
+**Goal**: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments.
+
+**Test 1: Resolution/Ratio Shortcuts (Letterboxing)**
+- Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue
+- Models: **ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+- Resolution: 224×224
+- No facecrop, no augmentation
+
+| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
+|------------|-----|---------------|-------------------------|
+| Original images (standard eval) | ? | ? | ? |
+| Matched geometry (letterboxed real images) | ? | ? | ? |
+
+**Test 2: Color Distribution Shortcuts**
+- Compare: Train with ImageNet normalization stats vs real-image-only normalization stats
+- Models: **ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+- Resolution: 224×224
+- No facecrop, no augmentation
+- ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
+- Real-image stats: Calculate mean/std from real training images only, apply to all
+
+| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
+|------------|-----|---------------|-------------------------|
+| ImageNet normalization | ? | ? | ? |
+| Real-image-only normalization | ? | ? | ? |
+
+**Test 3: Source-Specific Feature Learning (Source Holdout)**
+- Compare: Train on all sources vs train with one source held out
+- Models: **ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+- Resolution: 224×224
+- No facecrop, no augmentation
+- Hold out each fake source (text2img, inpainting, insight) separately
+
+| Experiment | Held-out Source | Train Sources | Held-out AUC | In-Source AUC | Δ (In-Source - Held-out) |
+|------------|-----------------|---------------|--------------|---------------|--------------------------|
+| Baseline | None | All | - | ? | - |
+| Holdout text2img | text2img | wiki, inpainting, insight | ? | ? | ? |
+| Holdout inpainting | inpainting | wiki, text2img, insight | ? | ? | ? |
+| Holdout insight | insight | wiki, text2img, inpainting | ? | ? | ? |
+
+**Interpretation**: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results.
+
+#### 2B: Resolution Impact (no facecrop, no augmentation)
+- Test: 128×128 vs 224×224
+- Models: **SimpleCNN, ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+
+| Model | 128×128 AUC | 224×224 AUC | Δ |
+|-------|-------------|-------------|---|
+| SimpleCNN | ? | ? | ? |
+| ResNet18 | ? | ? | ? |
+
+#### 2C: Facecrop Impact (224×224, no augmentation)
+- Test: No facecrop vs MTCNN facecrop
+- Models: **SimpleCNN, ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+
+| Model | No Facecrop AUC | Facecrop AUC | Δ |
+|-------|-----------------|--------------|---|
+| SimpleCNN | ? | ? | ? |
+| ResNet18 | ? | ? | ? |
+
+#### 2D: Augmentation Impact (224×224, without facecrop)
+- Test: No augmentation vs augmentation
+- Models: **SimpleCNN, ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+- **Verify test set has no augmentation** (code inspection of `get_transforms(train=False, ...)`)
+- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
+
+| Model | No Aug AUC | With Aug AUC | Δ | Train/Val Gap (No Aug) | Train/Val Gap (With Aug) |
+|-------|------------|--------------|---|------------------------|--------------------------|
+| SimpleCNN | ? | ? | ? | ? | ? |
+| ResNet18 | ? | ? | ? | ? | ? |
+
+**Experimental question**: Does augmentation without facecrop improve or hurt performance?
+
+#### 2E: Augmentation + Facecrop Combined (224×224)
+- Test: Facecrop only vs Facecrop + augmentation
+- Models: **SimpleCNN, ResNet18**
+- Data: 20% subsample
+- 5-fold stratified CV (balanced class distribution)
+- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
+
+| Model | Facecrop Only AUC | Facecrop + Aug AUC | Δ | Train/Val Gap (Only) | Train/Val Gap (With Aug) |
+|-------|-------------------|--------------------|---|----------------------|--------------------------|
+| SimpleCNN | ? | ? | ? | ? | ? |
+| ResNet18 | ? | ? | ? | ? | ? |
+
+**Experimental question**: Does augmentation with facecrop improve or hurt performance compared to facecrop alone?
+
+**This gives you**:
+- Isolated impact of each preprocessing choice on SimpleCNN and ResNet18
+- Verification that the model is not learning shortcuts
+- Understanding of how augmentation interacts with face cropping
+- Shortcut removal analysis through train/val gap and per-source AUC metrics
+
+---
+
+### Phase 3: Extended Architecture Exploration
+
+**Goal**: Test additional architectures to find the best performing models
+
+**Fixed conditions** (based on best findings from Phase 2):
+- Data: 20% subsample
+- Resolution: Best from Phase 2A (likely 224×224)
+- Facecrop: Best from Phase 2B/E (likely Yes)
+- Augmentation: Best from Phase 2D/E (depends on experimental results)
+- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
+- Scheduler: CosineAnnealingLR (T_max=15)
+- Epochs: 15 with early stopping (patience=5)
+- Batch size: 32
+- 5-fold stratified cross-validation (balanced class distribution)
+
+| Model | Params | Rationale |
+|-------|--------|-----------|
+| ResNet34 | ~21.8M | Deeper ResNet - test if more capacity helps |
+| ResNet50 | ~25.6M | Even deeper with bottleneck blocks |
+| EfficientNet-B0 | ~4.0M | Efficient compound scaling |
+| ConvNeXt-Tiny | ~29M | Modern CNN, different architecture family |
+| MobileNetV3-Small | ~2.5M | Lightweight efficiency comparison |
+
+**This gives you**: Extended architecture exploration to identify top-performing models for Phase 4
+- ResNet depth progression (18 -> 34 -> 50)
+- Efficient architectures (EfficientNet-B0, MobileNetV3-Small)
+- Modern CNN with different inductive bias (ConvNeXt-Tiny)
+- Size range (2.5M to 29M parameters)
+
+---
+
+### Phase 4: Final Analysis on Best Models
+
+**Goal**: Comprehensive evaluation of top-performing models from Phases 1-3
+
+**Select top 3-4 models** based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny)
+
+#### 4A: Data Quantity Scaling
+Test how each best model scales with more data:
+
+| Model | 20% Data AUC | 50% Data AUC | 100% Data AUC | Δ (100% - 20%) |
+|-------|--------------|--------------|---------------|----------------|
+| Model 1 | ? | ? | ? | ? |
+| Model 2 | ? | ? | ? | ? |
+| Model 3 | ? | ? | ? | ? |
+| Model 4 | ? | ? | ? | ? |
+
+**Fixed conditions**:
+- Resolution: Best from Phase 2A
+- Facecrop: Best from Phase 2B/E
+- Augmentation: Best from Phase 2D/E
+- 5-fold stratified cross-validation (balanced class distribution)
+
+#### 4B: Comprehensive Evaluation on Full Dataset
+- Train best models on **full dataset** (100%)
+- Detailed per-source metrics (text2img, inpainting, insight)
+- Grad-CAM visualizations for explainability
+- Hard example analysis (false positives/negatives)
+- Confidence distribution analysis
+- Cross-validation results (mean ± std)
+
+**This gives you**: Final, comprehensive evaluation of the best models with full explainability
+
+---
+
+### Notebooks and Analysis
+
+**Goal**: Use Jupyter notebooks for comprehensive analysis and validation of each phase
+
+#### **01_eda.ipynb** - Exploratory Data Analysis
+- Dataset overview (real vs fake distribution, sources)
+- Image resolution/aspect ratio analysis (identify potential shortcuts)
+- Color distribution analysis (identify potential shortcuts)
+- Sample visualization from each source (text2img, inpainting, insight, wiki)
+- Statistical summary of the dataset
+- Data quality checks
+
+#### **02_preprocessing.ipynb** - Preprocessing Pipeline
+- Square crop and resize implementation demonstration
+- Face crop (MTCNN) demonstration and effectiveness analysis
+- Augmentation pipeline visualization (before/after examples)
+- Z-score normalization comparison (ImageNet vs real-image-only)
+- Data split verification (group-aware by basename, no overlap)
+- Preprocessing impact visualization
+
+#### **03_phase1_analysis.ipynb** - Phase 1: Architecture Baseline
+- SimpleCNN vs ResNet18 comparison
+- 5-fold stratified CV results (mean ± std with confidence intervals)
+- Per-source metrics for each model (text2img, inpainting, insight)
+- Train/val/test performance curves across epochs
+- Confusion matrices for each model
+- Statistical significance testing between models
+- Grad-CAM visualizations for both models (10-20 images each)
+- Conclusions: Which baseline is better and why
+
+#### **04_phase2_analysis.ipynb** - Phase 2: Preprocessing Impact
+- **2A**: Resolution impact (128×128 vs 224×224)
+- **2B**: Facecrop impact
+- **2C**: Shortcut analysis (resolution/ratio + color distribution)
+- **2D**: Augmentation impact (without facecrop)
+- **2E**: Augmentation + facecrop combined
+
+For each experiment:
+- 5-fold CV results (mean ± std with confidence intervals)
+- Per-source metrics (text2img, inpainting, insight)
+- Statistical significance testing vs baseline
+- Comparison tables across all Phase 2 experiments
+- Grad-CAM visualizations (10-20 images per condition)
+- Analysis of train/val gap changes
+- Analysis of per-source AUC variance changes
+
+**Overall Phase 2 conclusions**:
+- Which preprocessing choices work best and why
+- Are shortcuts being learned (resolution, color distribution)?
+- Does augmentation remove shortcuts or over-regularize?
+- Recommendations for Phase 3 (best preprocessing settings)
+
+#### **05_phase3_analysis.ipynb** - Phase 3: Extended Architecture Exploration
+- ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small
+- 5-fold CV results (mean ± std) for each model
+- Per-source metrics for each model
+- Comparison with Phase 1 baselines (ResNet18, SimpleCNN)
+- Statistical significance testing vs baselines
+- Grad-CAM visualizations for top models (10-20 images each)
+- Parameter count vs performance analysis
+- Conclusions: Which architectures work best and why
+
+#### **06_phase4_analysis.ipynb** - Phase 4: Final Analysis
+- **4A**: Data quantity scaling (20%, 50%, 100%) on top 3-4 models
+- **4B**: Comprehensive evaluation on full dataset
+- Detailed per-source metrics for final models
+- Grad-CAM visualizations for final models (10-20 images each)
+- Hard example analysis (false positives/negatives) with visualizations
+- Confidence distribution analysis (histograms)
+- Cross-validation results (mean ± std with confidence intervals)
+- Final model comparison and selection
+- Conclusions and recommendations
+
+#### **07_gradcam_deep_dive.ipynb** - Grad-CAM Deep Dive (optional)
+- Comprehensive Grad-CAM analysis across all phases and models
+- Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt)
+- Comparison of what different models focus on (face regions, backgrounds, artifacts)
+- Evidence of shortcut removal (or lack thereof) across phases
+- Temporal analysis: does model attention change with different preprocessing?
+- Visual explanations suitable for presentation
+
+**Notebook requirements**:
+- Each notebook should be self-contained and reproducible
+- Include statistical analysis with confidence intervals
+- Generate publication-ready visualizations
+- Address all experimental questions and hypotheses
+- Provide clear conclusions for each phase
+- Use consistent formatting and style across all notebooks
+- Save all results (metrics, figures, tables) for easy reference
+
+---
+
+## Key Improvements
+
+### 1. Stratified Cross-Validation Implementation
+```python
+# Use sklearn's StratifiedKFold to ensure balanced class distribution across folds
+from sklearn.model_selection import StratifiedKFold
+
+skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
+for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
+    # Train on train_idx, validate on val_idx
+    # Store metrics per fold
+```
+
+### 2. Augmentation Shortcut Removal Analysis (Phase 2D/2E)
+Track these metrics with/without augmentation:
+
+| Metric | Without Aug | With Aug | Interpretation |
+|--------|-------------|----------|----------------|
+| Train AUC | 0.99 | 0.95 | ↓ Expected |
+| Val AUC | 0.90 | 0.89 | ↓ Slight |
+| **Train/Val Gap** | **0.09** | **0.06** | **↓ Good!** |
+| text2img AUC | 0.98 | 0.96 | ↓ Slight |
+| InsightFace AUC | 0.82 | 0.85 | **↑ Good!** |
+| **AUC Variance** | **0.08** | **0.06** | **↓ Good!** |
+
+**Interpretation**: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts.
+
+### 3. Consistent Hyperparameters
+- Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN)
+- Same epochs, ES patience, batch size
+- Only vary the architecture being tested
+
+### 4. Test Set Integrity and Reproducibility
+
+**Test set from original source**:
+- Verify that test set uses original images with minimal preprocessing
+- Test set should use `get_transforms(train=False, ...)` to disable augmentation
+- Ensure test images are not preprocessed in a way that could affect model comparisons
+
+**Reproducible splits across models**:
+- The code already uses `cfg.get("seed", 42)` for reproducible splits
+- All experiments should use the same seed (42) to ensure identical train/val/test splits
+- This ensures fair comparison between models
+
+**Central config for shared parameters**:
+- Create a central config file (`classifier/configs/shared.json`) with parameters common across all phases
+- This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc.
+- Individual experiment configs can override these defaults
+
+Example shared config:
+```json
+{
+  "seed": 42,
+  "val_ratio": 0.1,
+  "test_ratio": 0.1,
+  "batch_size": 32,
+  "optimizer": {
+    "type": "adamw",
+    "lr": 1e-4,
+    "weight_decay": 1e-4
+  },
+  "scheduler": {
+    "type": "cosine_annealing",
+    "T_max": 15
+  },
+  "early_stopping_patience": 5,
+  "num_workers": 4
+}
+```
+
+---
+
+## Summary Table for Report
+
+| Phase | Variable Tested | Models | Data | Resolution | Facecrop | Augment | CV |
+|-------|-----------------|--------|------|------------|----------|---------|----|
+| 1 | Architecture Baseline | SimpleCNN, ResNet18 | 20% | 128 | No | No | 5-fold stratified |
+| 2A | Shortcut Analysis | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
+| 2A-Holdout | Source Holdout | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
+| 2B | Resolution | SimpleCNN, ResNet18 | 20% | 128/224 | No | No | 5-fold stratified |
+| 2C | Facecrop | SimpleCNN, ResNet18 | 20% | 224 | ± | No | 5-fold stratified |
+| 2D | Augmentation (no facecrop) | SimpleCNN, ResNet18 | 20% | 224 | No | ± | 5-fold stratified |
+| 2E | Augmentation + Facecrop | SimpleCNN, ResNet18 | 20% | 224 | Yes | ± | 5-fold stratified |
+| 3 | Extended Architectures | ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small | 20% | Best | Best | Best | 5-fold stratified |
+| 4A | Data Quantity | Top 3-4 models | 20/50/100% | Best | Best | Best | 5-fold stratified |
+| 4B | Final Evaluation | Top 3-4 models | 100% | Best | Best | Best | 5-fold stratified |
+
+This structure gives you:
+- ✅ Identical comparison conditions across all phases
+- ✅ 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution)
+- ✅ Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2)
+- ✅ Shortcut analysis to verify no bias (Phase 2C)
+- ✅ Experimental questions about augmentation impact (Phase 2D/2E)
+- ✅ Shortcut removal analysis via train/val gap and per-source AUC metrics
+- ✅ Facecrop tested on baseline models (Phase 2B)
+- ✅ Extended architecture exploration with proven models (Phase 3)
+- ✅ Final comprehensive analysis on best models (Phase 4)
+- ✅ Data quantity scaling on multiple best models (Phase 4A)
+- ✅ Clear, isolated variables per phase
+- ✅ Explainable progression for report
+
+**Key Experimental Questions in Phase 2**:
+- **2C (Shortcut Analysis)**: Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
+- **2D (Augmentation without facecrop)**: Does augmentation improve or hurt performance?
+- **2E (Augmentation with facecrop)**: Does augmentation improve or hurt performance compared to facecrop alone?
@@ -0,0 +1,279 @@
+# Generator Plan
+
+The assignment rewards *iterative improvement with intermediate results*. This plan is structured around **model evolution as the spine**: each step has a *because* tied to an observed failure of the previous step. Pipeline ablations are honest but de-emphasized — they clear the table for the real story.
+
+---
+
+## Standard Settings (Applied Everywhere Unless Noted)
+
+| Setting | Value | Reason |
+|---------|-------|--------|
+| Batch size | 64 | Consistent across experiments |
+| Mixed precision | float16 + GradScaler | Speed |
+| EMA decay | 0.9999 | Sample from EMA weights for GANs |
+| FID evaluation | Every 25 epochs | Objective quality tracking |
+| FID n_real | 5000 | Held-out real images |
+| Default epochs | 100 | Best-of-each in Phase 4 retrains to 200 |
+
+Per-model optimizer/hyperparameters are listed inside each phase.
+
+---
+
+## Phase 1 — Pipeline Selection *(quick, one figure)*
+
+**Goal**: Pick the data pipeline used for every downstream experiment. Don't dwell here — this is clearing the table, not the story.
+
+Fixed model: **DCGAN at 64×64** (cheapest baseline, fast iteration). One variable per experiment.
+
+| Experiment | Variable | Variants | Decision |
+|---|---|---|---|
+| 1A | Resolution | 64×64 vs 128×128 | Pick by FID — assumed transferable |
+| 1B | Face crop + alignment | Full image vs MTCNN-aligned | Pick by FID — assumed transferable |
+| 1C | Augmentation | H-flip only vs H-flip + rotation ±5° + mild color jitter | Per-family: validate inside Phase 2 for GAN, default to H-flip-only for VAE/DDPM |
+| 1D | Combined dataset | Aligned only vs aligned + raw mixed | Pick by FID — expected to underperform aligned-only |
+
+**Caveat on transferability**: Phase 1 uses DCGAN as a proxy to choose the pipeline cheaply, then assumes the choice transfers to VAE and DDPM. Resolution and alignment are largely architecture-invariant (more pixels help everyone; structural consistency helps any spatial prior). Augmentation is *not* — diffusion models benefit less from aug, and MSE-VAE may even be hurt by color jitter. So 1C is treated as an **indicative** result for GANs and re-checked per family rather than baked in globally.
+
+**1D — combined dataset rationale**: Mixing aligned + raw doubles the variance the generator must model (face anywhere/any scale + face fixed) and dilutes the geometric prior. Hypothesis: combined < aligned-only. Cheap to test (one extra DCGAN run). Included for completeness so the report shows we considered it rather than asserting it.
+
+**MTCNN alignment** (one-time preprocessing, cached to disk):
+
+```python
+from facenet_pytorch import MTCNN
+from skimage.transform import SimilarityTransform, warp
+import numpy as np
+from PIL import Image
+
+mtcnn = MTCNN(keep_all=False, device='cuda')
+
+REF_LANDMARKS = np.array([   # reference positions in 128×128
+    [38.0, 51.0],  # left eye
+    [90.0, 51.0],  # right eye
+    [64.0, 71.0],  # nose
+    [45.0, 95.0],  # left mouth
+    [83.0, 95.0],  # right mouth
+], dtype=np.float32)
+
+def align_face(img: Image.Image, out_size: int = 128):
+    boxes, _, landmarks = mtcnn.detect(img, landmarks=True)
+    if boxes is None:
+        return None
+    tform = SimilarityTransform()
+    tform.estimate(landmarks[0], REF_LANDMARKS)
+    aligned = warp(np.array(img), tform.inverse,
+                   output_shape=(out_size, out_size),
+                   order=3, preserve_range=True).astype(np.uint8)
+    return Image.fromarray(aligned)
+```
+
+**Augmentation philosophy** — only structure-preserving transforms (face-aligned crops are consistent by design):
+
+| Transform | Apply? | Reason |
+|---|---|---|
+| Horizontal flip | Yes, p=0.5 | Faces are symmetric |
+| Rotation | Yes, ±5° | Residual head tilt post-alignment |
+| Color jitter | Yes, mild | brightness ±0.1, contrast ±0.1, saturation ±0.05 |
+| Translation | No | Breaks alignment |
+| Vertical flip | No | Meaningless for faces |
+| Strong blur / noise | No | Teaches the model to generate blur |
+
+**Output**: ~1 page in the report. Best pipeline carries forward to all phases.
+
+---
+
+## Phase 2 — GAN Evolution *(main spine)*
+
+**Goal**: The richest narrative — each step has a clear *because* from observed failure. This is the strongest part of the storyline; keep it front and center.
+
+Best pipeline from Phase 1 fixed throughout.
+
+---
+
+### 2.1 — DCGAN *(baseline)*
+
+Simplest GAN baseline. BCE loss, no gradient penalty.
+
+- Adam β1=0.5, β2=0.999, lr=2e-4
+- ngf=ndf=64, latent_dim=100
+- Resolution: 64×64
+
+**Expected failure**: mode collapse, training instability, oscillating losses. Document these explicitly — they motivate 2.2.
+
+---
+
+### 2.2 — WGAN-GP
+
+**Because**: DCGAN showed mode collapse and instability → Wasserstein loss + gradient penalty.
+
+- Adam β1=0.0, β2=0.9, lr_g=lr_d=1e-4
+- ngf=ndf=64, latent_dim=128, n_critic=2, gp_lambda=10
+- Resolution: 64×64
+
+**Expected**: more stable training, better diversity. Likely remaining issues: texture artifacts, limited global coherence at higher resolution.
+
+---
+
+### 2.3 — WGAN-GP + Spectral Norm + GroupNorm + Self-Attention
+
+**Because**: WGAN-GP showed texture artifacts / limited coherence → principled Lipschitz constraint and long-range dependencies.
+
+- Generator: BatchNorm → GroupNorm (no batch-size coupling)
+- Critic: InstanceNorm → Spectral Normalization (principled Lipschitz constraint)
+- Self-attention at 16×16 in both generator and critic
+
+```python
+class SelfAttention(nn.Module):
+    def __init__(self, in_ch):
+        super().__init__()
+        mid = max(in_ch // 8, 1)
+        self.q = nn.Conv2d(in_ch, mid, 1, bias=False)
+        self.k = nn.Conv2d(in_ch, mid, 1, bias=False)
+        self.v = nn.Conv2d(in_ch, in_ch, 1, bias=False)
+        self.gamma = nn.Parameter(torch.zeros(1))
+        self._mid = mid
+
+    def forward(self, x):
+        b, c, h, w = x.shape
+        q = self.q(x).view(b, self._mid, -1).transpose(-2, -1)
+        k = self.k(x).view(b, self._mid, -1)
+        v = self.v(x).view(b, c, -1)
+        attn = torch.softmax(q @ k * self._mid ** -0.5, dim=-1)
+        return x + self.gamma * (v @ attn.transpose(-2, -1)).view(b, c, h, w)
+```
+
+---
+
+### 2.4 — Scale to 128×128 *(if 2.3 looks coherent at 64×64)*
+
+**Because**: 2.3 produces coherent samples at 64×64 → does the architecture hold up at higher resolution?
+
+Same architecture as 2.3, retrained at 128×128. Add attention at 32×32 if memory permits.
+
+---
+
+### Phase 2 Results
+
+| Step | Model | FID @ 100ep ↓ | Main observed failure | Motivates next step |
+|---|---|---|---|---|
+| 2.1 | DCGAN | ? | ? | ? |
+| 2.2 | WGAN-GP | ? | ? | ? |
+| 2.3 | WGAN-GP + SN + Attn | ? | ? | ? |
+| 2.4 | + 128×128 | ? | ? | — |
+
+For each step: FID curve, 16-sample grid, one paragraph on what failed and why the next change addresses it.
+
+---
+
+## Phase 3 — VAE Track
+
+**Goal**: A self-contained evolution story for the likelihood-based family. Every step motivated by a known limitation of the previous.
+
+| Step | Model | Because |
+|---|---|---|
+| 3.1 | Vanilla VAE (MSE) | Baseline — expect blur |
+| 3.2 | + Perceptual loss (VGG) | MSE blur is fundamental to pixel-space reconstruction |
+| 3.3 | + PatchGAN discriminator (VQGAN-lite) | Perceptual loss still lacks local texture realism |
+
+**3.1 — Vanilla VAE**: Adam lr=1e-3, latent_dim=256, β=1.0. Plain convolutional encoder/decoder, MSE reconstruction.
+
+**3.2 — Perceptual loss**: VGG-16 feature matching at relu1_2, relu2_2, relu3_3.
+
+**3.3 — Patch discriminator**: PatchGAN adversarial loss targeting local texture realism.
+
+```
+L = L_mse + λ_perc·L_vgg + λ_adv·L_adv + β·L_kl
+λ_perc=0.1, λ_adv=0.1, β=0.0001
+```
+
+**Decoder fix** (applied from 3.1 onward): replace `ConvTranspose2d` with `Upsample(nearest) + Conv2d` — eliminates checkerboard artifacts.
+
+| Step | Model | FID ↓ | Main observed failure |
+|---|---|---|---|
+| 3.1 | VAE MSE | ? | ? |
+| 3.2 | + Perceptual | ? | ? |
+| 3.3 | + PatchGAN | ? | ? |
+
+---
+
+## Phase 4 — DDPM Track
+
+**Goal**: A self-contained evolution story for the diffusion family.
+
+| Step | Model | Because |
+|---|---|---|
+| 4.1 | DDPM linear + ε-pred | Baseline |
+| 4.2 | + cosine schedule | Linear schedule wastes capacity at low timesteps |
+| 4.3 | + v-prediction | ε-prediction is unstable across the full trajectory |
+| 4.4 | + wider U-Net / more attention | If 4.3 still underfits |
+
+**4.1 — Baseline**: AdamW lr=2e-4, base_ch=128, T=1000, attention at 8×8 and 16×16. DDIM sampling, 100 steps.
+
+**4.2 — Cosine schedule**:
+
+```python
+def cosine_betas(T: int, s: float = 0.008):
+    t = torch.linspace(0, T, T + 1)
+    f = torch.cos((t / T + s) / (1 + s) * math.pi / 2) ** 2
+    alpha_bar = f / f[0]
+    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
+    return betas.clamp(0, 0.999)
+```
+
+**4.3 — v-prediction**: replaces ε target with `v = √ᾱ·ε − √(1−ᾱ)·x₀`.
+
+**4.4 — Wider U-Net**: base_ch 128 → 192, attention at 8×8, 16×16, 32×32.
+
+| Step | Model | FID ↓ | Main observed failure |
+|---|---|---|---|
+| 4.1 | DDPM linear + ε | ? | ? |
+| 4.2 | + cosine | ? | ? |
+| 4.3 | + v-pred | ? | ? |
+| 4.4 | + wider | ? | ? |
+
+---
+
+## Phase 5 — Cross-Family Comparison
+
+**Goal**: Side-by-side comparison of the best from each family (2.4, 3.3, 4.4) under identical conditions.
+
+Best-of-each retrained for 200 epochs at the same resolution and pipeline.
+
+### 5A — Quantitative
+
+| Model | FID ↓ | IS ↑ | LPIPS diversity ↑ | Params | Train time |
+|---|---|---|---|:---:|:---:|
+| Best GAN (2.4) | ? | ? | ? | ? | ? |
+| Best VAE (3.3) | ? | ? | ? | ? | ? |
+| Best DDPM (4.4) | ? | ? | ? | ? | ? |
+
+### 5B — Qualitative
+
+- **Visual grids**: 16-image sample grids per finalist
+- **Progression**: epoch 10 → 50 → 100 → 200 side by side
+- **Latent interpolation**: smooth transitions between two latent codes (GAN, VAE)
+- **Diversity**: average pairwise LPIPS distance across 100 generated images
+- **Failure modes**: worst-generated images per model
+
+---
+
+## Compute Budget Notes
+
+Three families × multiple steps is a lot of runs. If compute is tight:
+
+- **Keep the GAN track complete** (2.1 → 2.4) — it carries the strongest narrative.
+- **VAE and DDPM can drop the last step each** (stop at 3.2 and 4.3) without hurting the story.
+- Phase 1 ablations can use 50 epochs instead of 100 — pipeline deltas show up early.
+
+---
+
+## Summary
+
+| Phase | Purpose | Models | Output |
+|---|---|---|---|
+| 1 | Pipeline selection | DCGAN @ 64×64 across data variants | Best pipeline |
+| 2 | GAN evolution (main spine) | DCGAN → WGAN-GP → +SN+Attn → 128×128 | GAN failure→fix narrative |
+| 3 | VAE evolution | VAE → +Perceptual → +PatchGAN | VAE failure→fix narrative |
+| 4 | DDPM evolution | DDPM → cosine → v-pred → wider | DDPM failure→fix narrative |
+| 5 | Cross-family comparison | Best of each, retrained 200ep | Final FID + IS + qualitative |
+
+**The narrative**: baseline fails in a specific way → fix targets that failure → new failure emerges → next fix targets that → repeat per family → compare families on equal footing.