Notebooks Classificador
This commit is contained in:
@@ -1,642 +0,0 @@
|
||||
# Deepfake Detection Classifier - Implementation Plan
|
||||
|
||||
## Overview
|
||||
This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Pre-Implementation Setup
|
||||
|
||||
### Infrastructure and Configuration
|
||||
- [x] Create `classifier/configs/shared.json` with shared parameters:
|
||||
- seed: 42
|
||||
- val_ratio: 0.1
|
||||
- test_ratio: 0.1
|
||||
- batch_size: 32
|
||||
- optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
|
||||
- scheduler: {type: "cosine_annealing", T_max: 15}
|
||||
- early_stopping_patience: 5
|
||||
- num_workers: 4
|
||||
- cv_folds: 5
|
||||
- data_dir: "data"
|
||||
- face_crop_margin: 0.6
|
||||
|
||||
- [x] Implement config loading/merging so experiment configs inherit `shared.json` defaults and override only the variables under test
|
||||
- [x] Resolve shared nested fields such as `optimizer.lr`, `optimizer.weight_decay`, and `scheduler.T_max` into the training arguments used by the runner
|
||||
- [x] Update existing configs to reference `shared.json` or otherwise document which shared defaults they intentionally override
|
||||
- [x] Define one CV protocol for all phases:
|
||||
- outer fold: held-out test fold
|
||||
- inner validation split: group-aware split from the remaining training folds for early stopping/model selection
|
||||
- final reported metrics: aggregate held-out test-fold results across the 5 outer folds
|
||||
|
||||
### Data Preparation
|
||||
- [x] Verify dataset structure and integrity
|
||||
- [x] Check that real and fake images are properly organized by source
|
||||
- [x] Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)
|
||||
|
||||
### Cleanup
|
||||
- [x] Remove `classifier/tools/ensemble.py` (not part of reorganization plan, conflicts with explainability goals)
|
||||
- [x] Remove robustness evaluation from `classifier/tools/analyze.py` (lines 51-104, 82-104, 144) - not part of experimental plan
|
||||
- [x] Remove any unused or obsolete config files from previous experiments (see detailed list below)
|
||||
- [X] Clean up old output directories if needed (keep important results for reference)
|
||||
|
||||
#### Config Files to Remove (39 total)
|
||||
|
||||
**Root configs (6):**
|
||||
- [x] `classifier/configs/resnet18_quick.json`
|
||||
- [x] `classifier/configs/resnet18.json`
|
||||
- [x] `classifier/configs/simple_cnn_large.json`
|
||||
- [x] `classifier/configs/simple_cnn_micro.json`
|
||||
- [x] `classifier/configs/simple_cnn_small.json`
|
||||
- [x] `classifier/configs/simple_cnn.json`
|
||||
|
||||
**Phase 1 old configs (7):**
|
||||
- [x] `classifier/configs/phase1/p1_cnn_base.json` (uses lr=1e-3, epochs=20 - should be 1e-4, 15)
|
||||
- [x] `classifier/configs/phase1/p1_cnn_aug.json`
|
||||
- [x] `classifier/configs/phase1/p1_resnet18_base.json` (duplicate of new baseline)
|
||||
- [x] `classifier/configs/phase1/p1_resnet18_aug.json`
|
||||
- [x] `classifier/configs/phase1/holdout/` (entire directory - 6 configs, source holdout not in new plan)
|
||||
|
||||
**Phase 2 old configs (7):**
|
||||
- [x] `classifier/configs/phase2/p2_resnet18_224.json` (should be p2a_resnet18_224.json)
|
||||
- [x] `classifier/configs/phase2/p2_resnet18_facecrop.json` (should be p2b_resnet18_facecrop.json)
|
||||
- [x] `classifier/configs/phase2/p2_resnet18_frozen.json` (frozen backbone not in new plan)
|
||||
- [x] `classifier/configs/phase2/p2_resnet34_224.json` (ResNet34 should be in Phase 3)
|
||||
- [x] `classifier/configs/phase2/p2_resnet34.json` (ResNet34 should be in Phase 3)
|
||||
- [x] `classifier/configs/phase2/p2_resnet50_frozen.json` (ResNet50 should be in Phase 3)
|
||||
- [x] `classifier/configs/phase2/p2_resnet50.json` (ResNet50 should be in Phase 3)
|
||||
|
||||
**Phase 3 old configs (4):**
|
||||
- [x] `classifier/configs/phase3/p3_efficientnet_b2.json` (EfficientNet-B2 not in new plan, only B0)
|
||||
- [x] `classifier/configs/phase3/p3_resnet18_facecrop_full.json` (ResNet18 full dataset should be Phase 4)
|
||||
- [x] `classifier/configs/phase3/p3_resnet18_freqaug.json` (frequency augmentation not in new plan)
|
||||
- [x] `classifier/configs/phase3/p3_vit_b16.json` (ViT not in new plan, replaced with ConvNeXt/MobileNet)
|
||||
- Note: `p3_efficientnet_b0.json` - REMOVED (will be recreated after Phase2 with correct settings)
|
||||
|
||||
**Source holdout (6):**
|
||||
- [x] `classifier/configs/source_holdout/` (entire directory - 6 configs, source holdout not in new plan)
|
||||
|
||||
**Ablation (3):**
|
||||
- [x] `classifier/configs/ablation/` (entire directory - 3 configs, ablation studies not in new plan)
|
||||
|
||||
**Configs to KEEP (3):**
|
||||
- ✅ `classifier/configs/shared.json`
|
||||
- ✅ `classifier/configs/phase1/p1_simplecnn_baseline.json`
|
||||
- ✅ `classifier/configs/phase1/p1_resnet18_baseline.json`
|
||||
|
||||
**Phase 2 alias configs removed (8):**
|
||||
- [x] `classifier/configs/phase2/p2b_resnet18_128.json` (alias for p1_resnet18_baseline)
|
||||
- [x] `classifier/configs/phase2/p2b_simplecnn_128.json` (alias for p1_simplecnn_baseline)
|
||||
- [x] `classifier/configs/phase2/p2c_resnet18_nofacecrop.json` (alias for p2b_resnet18_224)
|
||||
- [x] `classifier/configs/phase2/p2c_simplecnn_nofacecrop.json` (alias for p2b_simplecnn_224)
|
||||
- [x] `classifier/configs/phase2/p2d_resnet18_noaug.json` (alias for p2b_resnet18_224)
|
||||
- [x] `classifier/configs/phase2/p2d_simplecnn_noaug.json` (alias for p2b_simplecnn_224)
|
||||
- [x] `classifier/configs/phase2/p2e_resnet18_facecrop_only.json` (alias for p2c_resnet18_facecrop)
|
||||
- [x] `classifier/configs/phase2/p2e_simplecnn_facecrop_only.json` (alias for p2c_simplecnn_facecrop)
|
||||
|
||||
Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Architecture Baseline
|
||||
|
||||
### 1.1 Experiment Configs
|
||||
- [x] Create `classifier/configs/phase1/p1_simplecnn_baseline.json`
|
||||
- backbone: simple_cnn
|
||||
- cnn_preset: medium
|
||||
- dropout: 0.0
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 1e-4 (consistent with ResNet)
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 128
|
||||
- data_dir: data
|
||||
- early_stopping_patience: 5
|
||||
- subsample: 0.2
|
||||
- face_crop: false
|
||||
- augment: false
|
||||
- seed: 42
|
||||
|
||||
- [x] Create `classifier/configs/phase1/p1_resnet18_baseline.json`
|
||||
- backbone: resnet18
|
||||
- pretrained: true
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 1e-4
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 128
|
||||
- data_dir: data
|
||||
- early_stopping_patience: 5
|
||||
- subsample: 0.2
|
||||
- face_crop: false
|
||||
- augment: false
|
||||
- seed: 42
|
||||
|
||||
### 1.2 Code Updates
|
||||
- [x] Implement 5-fold stratified group cross-validation by basename in training pipeline
|
||||
- [x] Update `classifier/src/training/trainer.py` to support CV
|
||||
- [x] Update `classifier/src/evaluation/evaluate.py` to support CV
|
||||
- [x] Ensure all metrics report mean ± std and confidence intervals across folds
|
||||
|
||||
### 1.3 Training
|
||||
- [x] Train SimpleCNN with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json`)
|
||||
- [x] Train ResNet18 with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json`)
|
||||
- [x] Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)
|
||||
|
||||
### 1.4 Analysis
|
||||
- [x] Use `classifier/notebooks/03_phase1_analysis.ipynb` for Phase 1 analysis
|
||||
- [x] Compare SimpleCNN vs ResNet18 performance
|
||||
- [x] Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
|
||||
- [x] Per-source metrics (text2img, inpainting, insight)
|
||||
- [x] Train/val/test performance curves
|
||||
- [x] Confusion matrices
|
||||
- [x] Statistical significance testing
|
||||
- [x] Generate Grad-CAM visualizations (10-20 images per model)
|
||||
- [x] Document conclusions: Which baseline is better and why
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Preprocessing Impact
|
||||
|
||||
### 2.1 Shortcut Analysis (2A)
|
||||
- [x] Create `classifier/configs/phase2/p2a_t1_original.json`
|
||||
- backbone: resnet18
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- augment: false
|
||||
- normalization: imagenet
|
||||
- data_dir: data
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2a_t2_real_norm.json`
|
||||
- extends: p2a_t1_original.json
|
||||
- normalization: real_norm
|
||||
- **Normalization**: Calculate mean/std from real training images only within each fold
|
||||
|
||||
- [x] Geometry diagnostic was explored and then removed from the codebase (`src/evaluation/geometry.py` no longer exists):
|
||||
- Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
|
||||
- Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
|
||||
|
||||
- [ ] Train the 2 shortcut configs with 5-fold stratified group CV
|
||||
- [ ] Compare results:
|
||||
- Standard vs matched-geometry eval for `p2a_t1_original` (letterboxing impact)
|
||||
- `p2a_t1_original` vs `p2a_t2_real_norm` (color distribution shortcut)
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_text2img.json`
|
||||
- extends: p2a_t1_original.json
|
||||
- train_sources: ["wiki", "inpainting", "insight"]
|
||||
- eval_sources: ["wiki", "inpainting", "insight", "text2img"]
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_inpainting.json`
|
||||
- extends: p2a_t1_original.json
|
||||
- train_sources: ["wiki", "text2img", "insight"]
|
||||
- eval_sources: ["wiki", "text2img", "insight", "inpainting"]
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_insight.json`
|
||||
- extends: p2a_t1_original.json
|
||||
- train_sources: ["wiki", "text2img", "inpainting"]
|
||||
- eval_sources: ["wiki", "text2img", "inpainting", "insight"]
|
||||
|
||||
- [ ] Train the 3 source holdout configs with 5-fold stratified group CV
|
||||
- [ ] Compare held-out source performance vs in-source performance:
|
||||
- Calculate AUC for held-out source (text2img, inpainting, insight)
|
||||
- Compute Δ (in-source AUC - held-out AUC)
|
||||
- If Δ > 0.05-0.10, model is learning source-specific features
|
||||
|
||||
### 2.2 Resolution Impact (2B)
|
||||
- [x] Create `classifier/configs/phase2/p2b_simplecnn_224.json`
|
||||
- backbone: simple_cnn
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- augment: false
|
||||
- seed: 42
|
||||
- data_dir: data
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2b_resnet18_224.json`
|
||||
- backbone: resnet18
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- augment: false
|
||||
- seed: 42
|
||||
- data_dir: data
|
||||
|
||||
- [ ] Train both 224 configs with 5-fold stratified group CV
|
||||
- [ ] Compare 128×128 vs 224×224 for each model
|
||||
- 128 baseline is `p1_*_baseline` (comparison mapping in notebook)
|
||||
|
||||
### 2.3 Facecrop Impact (2C)
|
||||
- [x] Create `classifier/configs/phase2/p2c_simplecnn_facecrop.json`
|
||||
- backbone: simple_cnn
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- augment: false
|
||||
- seed: 42
|
||||
- data_dir: cropped/classifier
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2c_resnet18_facecrop.json`
|
||||
- backbone: resnet18
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- augment: false
|
||||
- seed: 42
|
||||
- data_dir: cropped/classifier
|
||||
|
||||
- [ ] Train both facecrop configs with 5-fold stratified group CV
|
||||
- [ ] Compare `p2b_resnet18_224` (no facecrop) vs `p2c_resnet18_facecrop` for each model
|
||||
- No-facecrop baseline is `p2b_*_224` (comparison mapping in notebook)
|
||||
|
||||
### 2.4 Augmentation Impact (2D)
|
||||
- [x] Create `classifier/configs/phase2/p2d_simplecnn_aug.json`
|
||||
- backbone: simple_cnn
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||||
- data_dir: data
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2d_resnet18_aug.json`
|
||||
- backbone: resnet18
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||||
- data_dir: data
|
||||
|
||||
- [ ] Train both augmentation configs with 5-fold stratified group CV
|
||||
- [ ] Compare `p2b_resnet18_224` (no aug) vs `p2d_resnet18_aug` for each model
|
||||
- No-aug baseline is `p2b_*_224` (comparison mapping in notebook)
|
||||
|
||||
### 2.5 Augmentation + Facecrop (2E)
|
||||
- [x] Create `classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json`
|
||||
- backbone: simple_cnn
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||||
- data_dir: cropped/classifier
|
||||
|
||||
- [x] Create `classifier/configs/phase2/p2e_resnet18_facecrop_aug.json`
|
||||
- backbone: resnet18
|
||||
- image_size: 224
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||||
- data_dir: cropped/classifier
|
||||
|
||||
- [ ] Train both facecrop+aug configs with 5-fold stratified group CV
|
||||
- [ ] Compare `p2c_resnet18_facecrop` (facecrop only) vs `p2e_resnet18_facecrop_aug` for each model
|
||||
- Facecrop-only baseline is `p2c_*_facecrop` (comparison mapping in notebook)
|
||||
|
||||
### 2.6 Phase 2 Analysis
|
||||
- [ ] Use `classifier/notebooks/04_phase2_analysis.ipynb` for Phase 2 analysis
|
||||
- [ ] For each experiment (2A-2E):
|
||||
- [ ] Load 5-fold stratified group CV results (mean ± std and confidence intervals)
|
||||
- [ ] Generate overall metrics (AUC, Accuracy, F1)
|
||||
- [ ] Generate per-source metrics (text2img, inpainting, insight)
|
||||
- [ ] Calculate train/val gap
|
||||
- [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
|
||||
- [ ] Statistical significance testing vs baseline
|
||||
- [ ] Generate comparison visualizations (bar charts, heatmaps)
|
||||
- [ ] For 2C (Shortcut Analysis):
|
||||
- [ ] Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
|
||||
- [ ] Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
|
||||
- [ ] Load source holdout results (3 configs)
|
||||
- [ ] Calculate held-out source AUC vs in-source AUC for each holdout experiment
|
||||
- [ ] Compute Δ (in-source AUC - held-out AUC)
|
||||
- [ ] If Δ > 0.05-0.10, model is learning source-specific features
|
||||
- [ ] Generate source holdout comparison table
|
||||
- [ ] For each model/condition:
|
||||
- [ ] Generate Grad-CAM visualizations (10-20 images per condition)
|
||||
- [ ] Organize by experiment, prediction type, and source
|
||||
- [ ] Answer key questions:
|
||||
- [ ] Which preprocessing choices are statistically significant?
|
||||
- [ ] Do certain sources benefit more from specific preprocessing?
|
||||
- [ ] Is there an interaction between facecrop and augmentation?
|
||||
- [ ] Are shortcuts being learned (resolution, color distribution)?
|
||||
- [ ] Is the model learning source-specific features (source holdout)?
|
||||
- [ ] Does augmentation remove shortcuts or over-regularize?
|
||||
- [ ] What features do models focus on (based on Grad-CAM)?
|
||||
- [ ] Generate comprehensive metrics comparison table
|
||||
- [ ] Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
|
||||
- [ ] Provide evidence-based conclusions for each experiment
|
||||
- [ ] Provide recommendations for Phase 3 (best preprocessing settings)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Extended Architecture Exploration
|
||||
|
||||
### 3.1 Experiment Configs
|
||||
Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.
|
||||
|
||||
- [x] Create `classifier/configs/phase3/p3_resnet34.json`
|
||||
- backbone: resnet34
|
||||
- pretrained: true
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 1e-4
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 224
|
||||
- augment: false (placeholder until Phase 2 results confirm)
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- early_stopping_patience: 5
|
||||
|
||||
- [x] Create `classifier/configs/phase3/p3_resnet50.json`
|
||||
- backbone: resnet50
|
||||
- pretrained: true
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 1e-4
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 224
|
||||
- augment: false (placeholder until Phase 2 results confirm)
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- early_stopping_patience: 5
|
||||
|
||||
- [x] Create `classifier/configs/phase3/p3_efficientnet_b0.json`
|
||||
- backbone: efficientnet_b0
|
||||
- pretrained: true
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 1e-4
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 224
|
||||
- augment: false (placeholder until Phase 2 results confirm)
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- early_stopping_patience: 5
|
||||
|
||||
- [x] Create `classifier/configs/phase3/p3_convnext_tiny.json`
|
||||
- backbone: convnext_tiny
|
||||
- pretrained: true
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 5e-5 (reduced for ConvNeXt stability)
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 224
|
||||
- augment: false (placeholder until Phase 2 results confirm)
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- early_stopping_patience: 5
|
||||
|
||||
- [x] Create `classifier/configs/phase3/p3_mobilenetv3_small.json`
|
||||
- backbone: mobilenet_v3_small
|
||||
- pretrained: true
|
||||
- epochs: 15
|
||||
- batch_size: 32
|
||||
- lr: 1e-4
|
||||
- weight_decay: 1e-4
|
||||
- image_size: 224
|
||||
- augment: false (placeholder until Phase 2 results confirm)
|
||||
- subsample: 0.2
|
||||
- seed: 42
|
||||
- early_stopping_patience: 5
|
||||
|
||||
- [x] Remove `p3a_mobilenet_v3_large.json` (not in plan, MobileNet V3 Large fills no distinct niche)
|
||||
|
||||
### 3.2 Model Implementation
|
||||
- [x] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
|
||||
- [x] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
|
||||
- [x] Register both models in `classifier/src/models/__init__.py`
|
||||
|
||||
### 3.3 Training
|
||||
- [ ] Train ResNet34 with 5-fold stratified group CV
|
||||
- [ ] Train ResNet50 with 5-fold stratified group CV
|
||||
- [ ] Train EfficientNet-B0 with 5-fold stratified group CV
|
||||
- [ ] Train ConvNeXt-Tiny with 5-fold stratified group CV
|
||||
- [ ] Train MobileNetV3-Small with 5-fold stratified group CV
|
||||
- [ ] Save all checkpoints and metrics
|
||||
|
||||
### 3.4 Analysis
|
||||
- [ ] Use `classifier/notebooks/05_phase3_analysis.ipynb` for Phase 3 analysis
|
||||
- [ ] Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
|
||||
- [ ] Generate overall metrics for each model
|
||||
- [ ] Generate per-source metrics for each model
|
||||
- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
|
||||
- [ ] Statistical significance testing vs baselines
|
||||
- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
|
||||
- [ ] Parameter count vs performance analysis
|
||||
- [ ] Conclusions: Which architectures work best and why
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Final Analysis on Best Models
|
||||
|
||||
### 4.1 Select Top Models
|
||||
- [ ] Based on Phases 1-3 results, select top 3-4 models
|
||||
- [ ] Document selection criteria (e.g., top AUC, balanced performance, efficiency)
|
||||
|
||||
### 4.2 Data Quantity Scaling (4A)
|
||||
- [ ] For each selected model, create configs for different data sizes:
|
||||
- [ ] `classifier/configs/phase4/p4a_<model>_20pct.json` (subsample: 0.2)
|
||||
- [ ] `classifier/configs/phase4/p4a_<model>_50pct.json` (subsample: 0.5)
|
||||
- [ ] `classifier/configs/phase4/p4a_<model>_100pct.json` (subsample: 1.0)
|
||||
- [ ] In every 4A config, explicitly set the best Phase 2 preprocessing choices:
|
||||
- image_size: best from Phase 2A
|
||||
- face_crop: best from Phase 2B/E
|
||||
- augment: best from Phase 2D/E
|
||||
- [ ] Train each model with 5-fold stratified group CV at all three data sizes
|
||||
- [ ] Compare how each model scales with more data
|
||||
|
||||
### 4.3 Full Dataset Evaluation (4B)
|
||||
- [ ] For each selected model, create config for full dataset:
|
||||
- `classifier/configs/phase4/p4b_<model>_full.json` (subsample: 1.0)
|
||||
- [ ] In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
|
||||
- [ ] Train each model on full dataset with 5-fold stratified group CV
|
||||
- [ ] Generate detailed per-source metrics
|
||||
- [ ] Generate Grad-CAM visualizations (10-20 images each)
|
||||
- [ ] Perform hard example analysis (false positives/negatives) with visualizations
|
||||
- [ ] Generate confidence distribution histograms
|
||||
- [ ] Cross-validation results (mean ± std with confidence intervals)
|
||||
|
||||
### 4.4 Analysis
|
||||
- [ ] Use `classifier/notebooks/06_phase4_analysis.ipynb` for Phase 4 analysis
|
||||
- [ ] Load data quantity scaling results
|
||||
- [ ] Load full dataset evaluation results
|
||||
- [ ] Generate comprehensive metrics comparison table
|
||||
- [ ] Generate per-source metrics for final models
|
||||
- [ ] Generate Grad-CAM galleries for final models
|
||||
- [ ] Perform hard example analysis with visualizations
|
||||
- [ ] Generate confidence distribution histograms
|
||||
- [ ] Final model comparison and selection
|
||||
- [ ] Conclusions and recommendations
|
||||
|
||||
---
|
||||
|
||||
## Notebooks and Analysis
|
||||
|
||||
This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.
|
||||
|
||||
### 5.1 Exploratory Data Analysis
|
||||
- [x] Create `classifier/notebooks/01_eda.ipynb`
|
||||
- [x] Dataset overview (real vs fake distribution, sources)
|
||||
- [x] Image resolution/aspect ratio analysis (identify potential shortcuts)
|
||||
- [x] Color distribution analysis (identify potential shortcuts)
|
||||
- [x] Sample visualization from each source
|
||||
- [x] Statistical summary of the dataset
|
||||
- [x] Data quality checks
|
||||
|
||||
### 5.2 Preprocessing Pipeline
|
||||
- [x] Create `classifier/notebooks/02_preprocessing.ipynb`
|
||||
- [x] Square crop and resize implementation demonstration
|
||||
- [x] Face crop (MTCNN) demonstration and effectiveness analysis
|
||||
- [x] Augmentation pipeline visualization (before/after examples)
|
||||
- [x] Z-score normalization comparison (ImageNet vs real-image-only)
|
||||
- [x] Data split verification (group-aware by basename, no overlap)
|
||||
- [x] Preprocessing impact visualization
|
||||
|
||||
### 5.3 Phase 1 Analysis
|
||||
- [x] Create `classifier/notebooks/03_phase1_analysis.ipynb`
|
||||
- [x] Load Phase 1 training results
|
||||
- [x] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
|
||||
- [x] Generate per-source metrics for each model
|
||||
- [x] Generate train/val/test performance curves
|
||||
- [x] Generate confusion matrices
|
||||
- [x] Perform statistical significance testing between models
|
||||
- [x] Generate Grad-CAM visualizations (10-20 images each)
|
||||
- [x] Document conclusions: Which baseline is better and why
|
||||
|
||||
### 5.4 Phase 2 Analysis
|
||||
- [x] Create `classifier/notebooks/04_phase2_analysis.ipynb`
|
||||
- [ ] Load all Phase 2 experiment results
|
||||
- [ ] For each experiment (2A-2E):
|
||||
- [ ] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
|
||||
- [ ] Generate overall metrics
|
||||
- [ ] Generate per-source metrics
|
||||
- [ ] Calculate train/val gap
|
||||
- [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
|
||||
- [ ] Perform statistical significance testing
|
||||
- [ ] Generate comparison tables across all Phase 2 experiments
|
||||
- [ ] Generate comparison visualizations (bar charts, heatmaps)
|
||||
- [ ] For each model/condition, generate Grad-CAM visualizations (10-20 images)
|
||||
- [ ] Organize visualizations by experiment, model, prediction type, and source
|
||||
- [ ] Answer key analysis questions
|
||||
- [ ] Generate comprehensive metrics comparison table
|
||||
- [ ] Provide evidence-based conclusions for each experiment
|
||||
- [ ] Provide recommendations for Phase 3
|
||||
|
||||
### 5.5 Phase 3 Analysis
|
||||
- [ ] Create `classifier/notebooks/05_phase3_analysis.ipynb`
|
||||
- [ ] Load Phase 3 training results
|
||||
- [ ] Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
|
||||
- [ ] Generate per-source metrics for each model
|
||||
- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
|
||||
- [ ] Perform statistical significance testing vs baselines
|
||||
- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
|
||||
- [ ] Parameter count vs performance analysis
|
||||
- [ ] Conclusions: Which architectures work best and why
|
||||
|
||||
### 5.6 Phase 4 Analysis
|
||||
- [ ] Create `classifier/notebooks/06_phase4_analysis.ipynb`
|
||||
- [ ] Load data quantity scaling results
|
||||
- [ ] Load full dataset evaluation results
|
||||
- [ ] Generate comprehensive metrics comparison table
|
||||
- [ ] Generate per-source metrics for final models
|
||||
- [ ] Generate Grad-CAM galleries for final models
|
||||
- [ ] Perform hard example analysis with visualizations
|
||||
- [ ] Generate confidence distribution histograms
|
||||
- [ ] Final model comparison and selection
|
||||
- [ ] Conclusions and recommendations
|
||||
|
||||
### 5.7 Grad-CAM Deep Dive (Optional)
|
||||
- [ ] Create `classifier/notebooks/07_gradcam_deep_dive.ipynb`
|
||||
- [ ] Load Grad-CAM results from all phases
|
||||
- [ ] Comprehensive Grad-CAM analysis across all phases and models
|
||||
- [ ] Feature visualization for different model architectures
|
||||
- [ ] CNN vs EfficientNet vs ConvNeXt comparison
|
||||
- [ ] What regions do different architectures focus on?
|
||||
- [ ] Are there systematic differences in attention patterns?
|
||||
- [ ] Evidence of shortcut removal analysis across phases
|
||||
- [ ] Temporal analysis: does model attention change with different preprocessing?
|
||||
- [ ] Generate visual explanations suitable for presentation
|
||||
|
||||
---
|
||||
|
||||
## Code Implementation Tasks
|
||||
|
||||
### Cross-Validation Implementation
|
||||
- [x] Update `classifier/src/training/trainer.py` to support 5-fold stratified group CV by basename
|
||||
- [x] Update `classifier/src/evaluation/evaluate.py` to support grouped CV splits
|
||||
- [x] Implement metric aggregation across folds (mean ± std)
|
||||
- [x] Ensure all metrics report confidence intervals
|
||||
- [x] Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
|
||||
- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
|
||||
- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
|
||||
|
||||
### Model Implementations
|
||||
- [x] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
|
||||
- [x] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
|
||||
- [x] Register both models in `classifier/src/models/__init__.py`
|
||||
|
||||
### Normalization Implementation
|
||||
- [ ] Implement function to calculate mean/std from real training images only
|
||||
- [ ] Update `classifier/src/preprocessing/pipeline.py` to support custom normalization stats
|
||||
- [ ] Test ImageNet normalization vs real-image-only normalization
|
||||
|
||||
### Evaluation Improvements
|
||||
- [ ] Ensure test set uses `train=False` to disable augmentation
|
||||
- [ ] Ensure diagnostic evaluation transforms never change the training data
|
||||
- [ ] Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
|
||||
- [ ] Implement per-source metrics with detection rate and false alarm rate
|
||||
- [ ] Implement pairwise AUC calculations
|
||||
- [ ] Implement train/val gap calculations
|
||||
- [ ] Implement pairwise source AUC variance calculations
|
||||
|
||||
### Grad-CAM Improvements
|
||||
- [x] Ensure Grad-CAM works for all model types (CNN-based)
|
||||
- [x] Implement Grad-CAM for ConvNeXt (last Conv2d found automatically by `find_conv()`)
|
||||
- [x] Implement Grad-CAM for MobileNetV3 (last Conv2d found automatically by `find_conv()`)
|
||||
- [ ] Organize Grad-CAM outputs by experiment, model, prediction type, source
|
||||
|
||||
---
|
||||
|
||||
## Final Report Preparation
|
||||
- [ ] Compile results from all phases
|
||||
- [ ] Create presentation slides (PDF format)
|
||||
- [ ] Brief description of deep learning solutions (discriminative + generative)
|
||||
- [ ] Description of implementation steps and improvements
|
||||
- [ ] Motivate choices for architecture, training strategy, etc.
|
||||
- [ ] Show intermediate results
|
||||
- [ ] Interpret results and what changed
|
||||
- [ ] What was decided to improve results
|
||||
- [ ] Classification performance results
|
||||
- [ ] Experimental setup
|
||||
- [ ] Train/val/test splits
|
||||
- [ ] Performance metrics chosen
|
||||
- [ ] Data generation performance results
|
||||
- [ ] Experimental setup
|
||||
- [ ] Performance metrics chosen
|
||||
- [ ] Discussion and conclusions
|
||||
- [ ] Comments on performance
|
||||
- [ ] Final remarks
|
||||
- [ ] Fill auto-evaluation file
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Total tasks: ~150+
|
||||
|
||||
This implementation plan covers:
|
||||
- ✅ All 4 phases with comprehensive experiments
|
||||
- ✅ 5-fold stratified group cross-validation for all experiments
|
||||
- ✅ 7 analysis notebooks for robust validation
|
||||
- ✅ Shortcut analysis (resolution/ratio + color distribution + source holdout)
|
||||
- ✅ Source holdout experiments to detect source-specific feature learning
|
||||
- ✅ Grad-CAM visualizations for explainability
|
||||
- ✅ Statistical analysis with confidence intervals
|
||||
- ✅ Per-source metrics for all experiments
|
||||
- ✅ Data quantity scaling analysis
|
||||
- ✅ Full dataset evaluation on best models
|
||||
- ✅ Comprehensive documentation and reporting
|
||||
|
||||
**Key Features:**
|
||||
- Reproducible experiments with fixed seeds
|
||||
- Stratified group CV keeps basename groups together while balancing class distribution
|
||||
- Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
|
||||
- Source holdout experiments to test generalization to unseen sources
|
||||
- Grad-CAM for explainability
|
||||
- Statistical rigor with confidence intervals
|
||||
- Per-source analysis to understand model behavior
|
||||
- Clear progression from baselines -> preprocessing -> architectures -> final evaluation
|
||||
@@ -1,449 +0,0 @@
|
||||
# Classifier Reorganization Plan (v2)
|
||||
|
||||
## Analysis of Current Phasing Issues
|
||||
|
||||
Your current phasing has several problems that make it difficult to present a rigorous, explainable report:
|
||||
|
||||
### Current Problems
|
||||
|
||||
1. **Inconsistent comparison conditions**:
|
||||
- SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4
|
||||
- SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES)
|
||||
- Makes direct comparisons invalid
|
||||
|
||||
2. **No cross-validation**:
|
||||
- Only a single 80/10/10 split
|
||||
- Results may be split-dependent
|
||||
- No confidence intervals on metrics
|
||||
|
||||
3. **Augmentation testing is incomplete**:
|
||||
- Only tested on ResNet18 (Phase 3), not across architectures
|
||||
- Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad)
|
||||
- No way to distinguish these cases
|
||||
|
||||
4. **Facecrop impact not generalized**:
|
||||
- Only ResNet18 tested with facecrop
|
||||
- Don't know if EfficientNet or ViT benefit similarly
|
||||
|
||||
5. **Full dataset only on one model**:
|
||||
- Only ResNet18 tested on full dataset
|
||||
- Don't know if data quantity helps all models equally
|
||||
|
||||
6. **Test set integrity**:
|
||||
- Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary)
|
||||
- Need to ensure same train/val/test splits across all model comparisons
|
||||
- Need central config for shared parameters across phases
|
||||
|
||||
---
|
||||
|
||||
## Recommended Reorganization
|
||||
|
||||
I suggest reorganizing into **4 phases** with clear, isolated variables. All phases use **5-fold stratified cross-validation** as standard practice to ensure balanced class distribution across folds.
|
||||
|
||||
### Phase 1: Controlled Baseline Comparison
|
||||
|
||||
**Goal**: Compare simple architectures under identical conditions to establish baselines
|
||||
|
||||
**Fixed conditions for ALL models**:
|
||||
- Data: 20% subsample
|
||||
- Resolution: 128×128
|
||||
- No face crop
|
||||
- No augmentation
|
||||
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
|
||||
- Scheduler: CosineAnnealingLR (T_max=15)
|
||||
- Epochs: 15 with early stopping (patience=5)
|
||||
- Batch size: 32
|
||||
- 5-fold stratified cross-validation (report mean ± std)
|
||||
|
||||
| Model | Params | Expected AUC (mean ± std) |
|
||||
|-------|--------|---------------------------|
|
||||
| SimpleCNN | ~400k | ? |
|
||||
| ResNet18 | ~11.7M | ? |
|
||||
|
||||
**This gives you**: Clean, comparable baseline for simple architectures with confidence intervals
|
||||
|
||||
**These same 2 models will be used in Phase 2 for preprocessing experiments.**
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Preprocessing Impact (Same 2 Models from Phase 1)
|
||||
|
||||
**Goal**: Test each preprocessing change on the SAME 2 models from Phase 1
|
||||
|
||||
**Experimental questions**:
|
||||
- Does higher resolution improve performance?
|
||||
- Does face cropping improve performance?
|
||||
- Does augmentation improve or hurt performance?
|
||||
- Does augmentation interact with face cropping?
|
||||
- Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
|
||||
|
||||
#### 2A: Shortcut Analysis
|
||||
**Goal**: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments.
|
||||
|
||||
**Test 1: Resolution/Ratio Shortcuts (Letterboxing)**
|
||||
- Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue
|
||||
- Models: **ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
- Resolution: 224×224
|
||||
- No facecrop, no augmentation
|
||||
|
||||
| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
|
||||
|------------|-----|---------------|-------------------------|
|
||||
| Original images (standard eval) | ? | ? | ? |
|
||||
| Matched geometry (letterboxed real images) | ? | ? | ? |
|
||||
|
||||
**Test 2: Color Distribution Shortcuts**
|
||||
- Compare: Train with ImageNet normalization stats vs real-image-only normalization stats
|
||||
- Models: **ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
- Resolution: 224×224
|
||||
- No facecrop, no augmentation
|
||||
- ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
|
||||
- Real-image stats: Calculate mean/std from real training images only, apply to all
|
||||
|
||||
| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
|
||||
|------------|-----|---------------|-------------------------|
|
||||
| ImageNet normalization | ? | ? | ? |
|
||||
| Real-image-only normalization | ? | ? | ? |
|
||||
|
||||
**Test 3: Source-Specific Feature Learning (Source Holdout)**
|
||||
- Compare: Train on all sources vs train with one source held out
|
||||
- Models: **ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
- Resolution: 224×224
|
||||
- No facecrop, no augmentation
|
||||
- Hold out each fake source (text2img, inpainting, insight) separately
|
||||
|
||||
| Experiment | Held-out Source | Train Sources | Held-out AUC | In-Source AUC | Δ (In-Source - Held-out) |
|
||||
|------------|-----------------|---------------|--------------|---------------|--------------------------|
|
||||
| Baseline | None | All | - | ? | - |
|
||||
| Holdout text2img | text2img | wiki, inpainting, insight | ? | ? | ? |
|
||||
| Holdout inpainting | inpainting | wiki, text2img, insight | ? | ? | ? |
|
||||
| Holdout insight | insight | wiki, text2img, inpainting | ? | ? | ? |
|
||||
|
||||
**Interpretation**: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results.
|
||||
|
||||
#### 2B: Resolution Impact (no facecrop, no augmentation)
|
||||
- Test: 128×128 vs 224×224
|
||||
- Models: **SimpleCNN, ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
|
||||
| Model | 128×128 AUC | 224×224 AUC | Δ |
|
||||
|-------|-------------|-------------|---|
|
||||
| SimpleCNN | ? | ? | ? |
|
||||
| ResNet18 | ? | ? | ? |
|
||||
|
||||
#### 2C: Facecrop Impact (224×224, no augmentation)
|
||||
- Test: No facecrop vs MTCNN facecrop
|
||||
- Models: **SimpleCNN, ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
|
||||
| Model | No Facecrop AUC | Facecrop AUC | Δ |
|
||||
|-------|-----------------|--------------|---|
|
||||
| SimpleCNN | ? | ? | ? |
|
||||
| ResNet18 | ? | ? | ? |
|
||||
|
||||
#### 2D: Augmentation Impact (224×224, without facecrop)
|
||||
- Test: No augmentation vs augmentation
|
||||
- Models: **SimpleCNN, ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
- **Verify test set has no augmentation** (code inspection of `get_transforms(train=False, ...)`)
|
||||
- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
|
||||
|
||||
| Model | No Aug AUC | With Aug AUC | Δ | Train/Val Gap (No Aug) | Train/Val Gap (With Aug) |
|
||||
|-------|------------|--------------|---|------------------------|--------------------------|
|
||||
| SimpleCNN | ? | ? | ? | ? | ? |
|
||||
| ResNet18 | ? | ? | ? | ? | ? |
|
||||
|
||||
**Experimental question**: Does augmentation without facecrop improve or hurt performance?
|
||||
|
||||
#### 2E: Augmentation + Facecrop Combined (224×224)
|
||||
- Test: Facecrop only vs Facecrop + augmentation
|
||||
- Models: **SimpleCNN, ResNet18**
|
||||
- Data: 20% subsample
|
||||
- 5-fold stratified CV (balanced class distribution)
|
||||
- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
|
||||
|
||||
| Model | Facecrop Only AUC | Facecrop + Aug AUC | Δ | Train/Val Gap (Only) | Train/Val Gap (With Aug) |
|
||||
|-------|-------------------|--------------------|---|----------------------|--------------------------|
|
||||
| SimpleCNN | ? | ? | ? | ? | ? |
|
||||
| ResNet18 | ? | ? | ? | ? | ? |
|
||||
|
||||
**Experimental question**: Does augmentation with facecrop improve or hurt performance compared to facecrop alone?
|
||||
|
||||
**This gives you**:
|
||||
- Isolated impact of each preprocessing choice on SimpleCNN and ResNet18
|
||||
- Verification that the model is not learning shortcuts
|
||||
- Understanding of how augmentation interacts with face cropping
|
||||
- Shortcut removal analysis through train/val gap and per-source AUC metrics
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Extended Architecture Exploration
|
||||
|
||||
**Goal**: Test additional architectures to find the best performing models
|
||||
|
||||
**Fixed conditions** (based on best findings from Phase 2):
|
||||
- Data: 20% subsample
|
||||
- Resolution: Best from Phase 2A (likely 224×224)
|
||||
- Facecrop: Best from Phase 2B/E (likely Yes)
|
||||
- Augmentation: Best from Phase 2D/E (depends on experimental results)
|
||||
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
|
||||
- Scheduler: CosineAnnealingLR (T_max=15)
|
||||
- Epochs: 15 with early stopping (patience=5)
|
||||
- Batch size: 32
|
||||
- 5-fold stratified cross-validation (balanced class distribution)
|
||||
|
||||
| Model | Params | Rationale |
|
||||
|-------|--------|-----------|
|
||||
| ResNet34 | ~21.8M | Deeper ResNet - test if more capacity helps |
|
||||
| ResNet50 | ~25.6M | Even deeper with bottleneck blocks |
|
||||
| EfficientNet-B0 | ~4.0M | Efficient compound scaling |
|
||||
| ConvNeXt-Tiny | ~29M | Modern CNN, different architecture family |
|
||||
| MobileNetV3-Small | ~2.5M | Lightweight efficiency comparison |
|
||||
|
||||
**This gives you**: Extended architecture exploration to identify top-performing models for Phase 4
|
||||
- ResNet depth progression (18 -> 34 -> 50)
|
||||
- Efficient architectures (EfficientNet-B0, MobileNetV3-Small)
|
||||
- Modern CNN with different inductive bias (ConvNeXt-Tiny)
|
||||
- Size range (2.5M to 29M parameters)
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Final Analysis on Best Models
|
||||
|
||||
**Goal**: Comprehensive evaluation of top-performing models from Phases 1-3
|
||||
|
||||
**Select top 3-4 models** based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny)
|
||||
|
||||
#### 4A: Data Quantity Scaling
|
||||
Test how each best model scales with more data:
|
||||
|
||||
| Model | 20% Data AUC | 50% Data AUC | 100% Data AUC | Δ (100% - 20%) |
|
||||
|-------|--------------|--------------|---------------|----------------|
|
||||
| Model 1 | ? | ? | ? | ? |
|
||||
| Model 2 | ? | ? | ? | ? |
|
||||
| Model 3 | ? | ? | ? | ? |
|
||||
| Model 4 | ? | ? | ? | ? |
|
||||
|
||||
**Fixed conditions**:
|
||||
- Resolution: Best from Phase 2A
|
||||
- Facecrop: Best from Phase 2B/E
|
||||
- Augmentation: Best from Phase 2D/E
|
||||
- 5-fold stratified cross-validation (balanced class distribution)
|
||||
|
||||
#### 4B: Comprehensive Evaluation on Full Dataset
|
||||
- Train best models on **full dataset** (100%)
|
||||
- Detailed per-source metrics (text2img, inpainting, insight)
|
||||
- Grad-CAM visualizations for explainability
|
||||
- Hard example analysis (false positives/negatives)
|
||||
- Confidence distribution analysis
|
||||
- Cross-validation results (mean ± std)
|
||||
|
||||
**This gives you**: Final, comprehensive evaluation of the best models with full explainability
|
||||
|
||||
---
|
||||
|
||||
### Notebooks and Analysis
|
||||
|
||||
**Goal**: Use Jupyter notebooks for comprehensive analysis and validation of each phase
|
||||
|
||||
#### **01_eda.ipynb** - Exploratory Data Analysis
|
||||
- Dataset overview (real vs fake distribution, sources)
|
||||
- Image resolution/aspect ratio analysis (identify potential shortcuts)
|
||||
- Color distribution analysis (identify potential shortcuts)
|
||||
- Sample visualization from each source (text2img, inpainting, insight, wiki)
|
||||
- Statistical summary of the dataset
|
||||
- Data quality checks
|
||||
|
||||
#### **02_preprocessing.ipynb** - Preprocessing Pipeline
|
||||
- Square crop and resize implementation demonstration
|
||||
- Face crop (MTCNN) demonstration and effectiveness analysis
|
||||
- Augmentation pipeline visualization (before/after examples)
|
||||
- Z-score normalization comparison (ImageNet vs real-image-only)
|
||||
- Data split verification (group-aware by basename, no overlap)
|
||||
- Preprocessing impact visualization
|
||||
|
||||
#### **03_phase1_analysis.ipynb** - Phase 1: Architecture Baseline
|
||||
- SimpleCNN vs ResNet18 comparison
|
||||
- 5-fold stratified CV results (mean ± std with confidence intervals)
|
||||
- Per-source metrics for each model (text2img, inpainting, insight)
|
||||
- Train/val/test performance curves across epochs
|
||||
- Confusion matrices for each model
|
||||
- Statistical significance testing between models
|
||||
- Grad-CAM visualizations for both models (10-20 images each)
|
||||
- Conclusions: Which baseline is better and why
|
||||
|
||||
#### **04_phase2_analysis.ipynb** - Phase 2: Preprocessing Impact
|
||||
- **2A**: Resolution impact (128×128 vs 224×224)
|
||||
- **2B**: Facecrop impact
|
||||
- **2C**: Shortcut analysis (resolution/ratio + color distribution)
|
||||
- **2D**: Augmentation impact (without facecrop)
|
||||
- **2E**: Augmentation + facecrop combined
|
||||
|
||||
For each experiment:
|
||||
- 5-fold CV results (mean ± std with confidence intervals)
|
||||
- Per-source metrics (text2img, inpainting, insight)
|
||||
- Statistical significance testing vs baseline
|
||||
- Comparison tables across all Phase 2 experiments
|
||||
- Grad-CAM visualizations (10-20 images per condition)
|
||||
- Analysis of train/val gap changes
|
||||
- Analysis of per-source AUC variance changes
|
||||
|
||||
**Overall Phase 2 conclusions**:
|
||||
- Which preprocessing choices work best and why
|
||||
- Are shortcuts being learned (resolution, color distribution)?
|
||||
- Does augmentation remove shortcuts or over-regularize?
|
||||
- Recommendations for Phase 3 (best preprocessing settings)
|
||||
|
||||
#### **05_phase3_analysis.ipynb** - Phase 3: Extended Architecture Exploration
|
||||
- ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small
|
||||
- 5-fold CV results (mean ± std) for each model
|
||||
- Per-source metrics for each model
|
||||
- Comparison with Phase 1 baselines (ResNet18, SimpleCNN)
|
||||
- Statistical significance testing vs baselines
|
||||
- Grad-CAM visualizations for top models (10-20 images each)
|
||||
- Parameter count vs performance analysis
|
||||
- Conclusions: Which architectures work best and why
|
||||
|
||||
#### **06_phase4_analysis.ipynb** - Phase 4: Final Analysis
|
||||
- **4A**: Data quantity scaling (20%, 50%, 100%) on top 3-4 models
|
||||
- **4B**: Comprehensive evaluation on full dataset
|
||||
- Detailed per-source metrics for final models
|
||||
- Grad-CAM visualizations for final models (10-20 images each)
|
||||
- Hard example analysis (false positives/negatives) with visualizations
|
||||
- Confidence distribution analysis (histograms)
|
||||
- Cross-validation results (mean ± std with confidence intervals)
|
||||
- Final model comparison and selection
|
||||
- Conclusions and recommendations
|
||||
|
||||
#### **07_gradcam_deep_dive.ipynb** - Grad-CAM Deep Dive (optional)
|
||||
- Comprehensive Grad-CAM analysis across all phases and models
|
||||
- Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt)
|
||||
- Comparison of what different models focus on (face regions, backgrounds, artifacts)
|
||||
- Evidence of shortcut removal (or lack thereof) across phases
|
||||
- Temporal analysis: does model attention change with different preprocessing?
|
||||
- Visual explanations suitable for presentation
|
||||
|
||||
**Notebook requirements**:
|
||||
- Each notebook should be self-contained and reproducible
|
||||
- Include statistical analysis with confidence intervals
|
||||
- Generate publication-ready visualizations
|
||||
- Address all experimental questions and hypotheses
|
||||
- Provide clear conclusions for each phase
|
||||
- Use consistent formatting and style across all notebooks
|
||||
- Save all results (metrics, figures, tables) for easy reference
|
||||
|
||||
---
|
||||
|
||||
## Key Improvements
|
||||
|
||||
### 1. Stratified Cross-Validation Implementation
|
||||
```python
|
||||
# Use sklearn's StratifiedKFold to ensure balanced class distribution across folds
|
||||
from sklearn.model_selection import StratifiedKFold
|
||||
|
||||
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||||
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
|
||||
# Train on train_idx, validate on val_idx
|
||||
# Store metrics per fold
|
||||
```
|
||||
|
||||
### 2. Augmentation Shortcut Removal Analysis (Phase 2D/2E)
|
||||
Track these metrics with/without augmentation:
|
||||
|
||||
| Metric | Without Aug | With Aug | Interpretation |
|
||||
|--------|-------------|----------|----------------|
|
||||
| Train AUC | 0.99 | 0.95 | ↓ Expected |
|
||||
| Val AUC | 0.90 | 0.89 | ↓ Slight |
|
||||
| **Train/Val Gap** | **0.09** | **0.06** | **↓ Good!** |
|
||||
| text2img AUC | 0.98 | 0.96 | ↓ Slight |
|
||||
| InsightFace AUC | 0.82 | 0.85 | **↑ Good!** |
|
||||
| **AUC Variance** | **0.08** | **0.06** | **↓ Good!** |
|
||||
|
||||
**Interpretation**: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts.
|
||||
|
||||
### 3. Consistent Hyperparameters
|
||||
- Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN)
|
||||
- Same epochs, ES patience, batch size
|
||||
- Only vary the architecture being tested
|
||||
|
||||
### 4. Test Set Integrity and Reproducibility
|
||||
|
||||
**Test set from original source**:
|
||||
- Verify that test set uses original images with minimal preprocessing
|
||||
- Test set should use `get_transforms(train=False, ...)` to disable augmentation
|
||||
- Ensure test images are not preprocessed in a way that could affect model comparisons
|
||||
|
||||
**Reproducible splits across models**:
|
||||
- The code already uses `cfg.get("seed", 42)` for reproducible splits
|
||||
- All experiments should use the same seed (42) to ensure identical train/val/test splits
|
||||
- This ensures fair comparison between models
|
||||
|
||||
**Central config for shared parameters**:
|
||||
- Create a central config file (`classifier/configs/shared.json`) with parameters common across all phases
|
||||
- This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc.
|
||||
- Individual experiment configs can override these defaults
|
||||
|
||||
Example shared config:
|
||||
```json
|
||||
{
|
||||
"seed": 42,
|
||||
"val_ratio": 0.1,
|
||||
"test_ratio": 0.1,
|
||||
"batch_size": 32,
|
||||
"optimizer": {
|
||||
"type": "adamw",
|
||||
"lr": 1e-4,
|
||||
"weight_decay": 1e-4
|
||||
},
|
||||
"scheduler": {
|
||||
"type": "cosine_annealing",
|
||||
"T_max": 15
|
||||
},
|
||||
"early_stopping_patience": 5,
|
||||
"num_workers": 4
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary Table for Report
|
||||
|
||||
| Phase | Variable Tested | Models | Data | Resolution | Facecrop | Augment | CV |
|
||||
|-------|-----------------|--------|------|------------|----------|---------|----|
|
||||
| 1 | Architecture Baseline | SimpleCNN, ResNet18 | 20% | 128 | No | No | 5-fold stratified |
|
||||
| 2A | Shortcut Analysis | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
|
||||
| 2A-Holdout | Source Holdout | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
|
||||
| 2B | Resolution | SimpleCNN, ResNet18 | 20% | 128/224 | No | No | 5-fold stratified |
|
||||
| 2C | Facecrop | SimpleCNN, ResNet18 | 20% | 224 | ± | No | 5-fold stratified |
|
||||
| 2D | Augmentation (no facecrop) | SimpleCNN, ResNet18 | 20% | 224 | No | ± | 5-fold stratified |
|
||||
| 2E | Augmentation + Facecrop | SimpleCNN, ResNet18 | 20% | 224 | Yes | ± | 5-fold stratified |
|
||||
| 3 | Extended Architectures | ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small | 20% | Best | Best | Best | 5-fold stratified |
|
||||
| 4A | Data Quantity | Top 3-4 models | 20/50/100% | Best | Best | Best | 5-fold stratified |
|
||||
| 4B | Final Evaluation | Top 3-4 models | 100% | Best | Best | Best | 5-fold stratified |
|
||||
|
||||
This structure gives you:
|
||||
- ✅ Identical comparison conditions across all phases
|
||||
- ✅ 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution)
|
||||
- ✅ Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2)
|
||||
- ✅ Shortcut analysis to verify no bias (Phase 2C)
|
||||
- ✅ Experimental questions about augmentation impact (Phase 2D/2E)
|
||||
- ✅ Shortcut removal analysis via train/val gap and per-source AUC metrics
|
||||
- ✅ Facecrop tested on baseline models (Phase 2B)
|
||||
- ✅ Extended architecture exploration with proven models (Phase 3)
|
||||
- ✅ Final comprehensive analysis on best models (Phase 4)
|
||||
- ✅ Data quantity scaling on multiple best models (Phase 4A)
|
||||
- ✅ Clear, isolated variables per phase
|
||||
- ✅ Explainable progression for report
|
||||
|
||||
**Key Experimental Questions in Phase 2**:
|
||||
- **2C (Shortcut Analysis)**: Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
|
||||
- **2D (Augmentation without facecrop)**: Does augmentation improve or hurt performance?
|
||||
- **2E (Augmentation with facecrop)**: Does augmentation improve or hurt performance compared to facecrop alone?
|
||||
@@ -1,279 +0,0 @@
|
||||
# Generator Plan
|
||||
|
||||
The assignment rewards *iterative improvement with intermediate results*. This plan is structured around **model evolution as the spine**: each step has a *because* tied to an observed failure of the previous step. Pipeline ablations are honest but de-emphasized — they clear the table for the real story.
|
||||
|
||||
---
|
||||
|
||||
## Standard Settings (Applied Everywhere Unless Noted)
|
||||
|
||||
| Setting | Value | Reason |
|
||||
|---------|-------|--------|
|
||||
| Batch size | 64 | Consistent across experiments |
|
||||
| Mixed precision | float16 + GradScaler | Speed |
|
||||
| EMA decay | 0.9999 | Sample from EMA weights for GANs |
|
||||
| FID evaluation | Every 25 epochs | Objective quality tracking |
|
||||
| FID n_real | 5000 | Held-out real images |
|
||||
| Default epochs | 100 | Best-of-each in Phase 4 retrains to 200 |
|
||||
|
||||
Per-model optimizer/hyperparameters are listed inside each phase.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Pipeline Selection *(quick, one figure)*
|
||||
|
||||
**Goal**: Pick the data pipeline used for every downstream experiment. Don't dwell here — this is clearing the table, not the story.
|
||||
|
||||
Fixed model: **DCGAN at 64×64** (cheapest baseline, fast iteration). One variable per experiment.
|
||||
|
||||
| Experiment | Variable | Variants | Decision |
|
||||
|---|---|---|---|
|
||||
| 1A | Resolution | 64×64 vs 128×128 | Pick by FID — assumed transferable |
|
||||
| 1B | Face crop + alignment | Full image vs MTCNN-aligned | Pick by FID — assumed transferable |
|
||||
| 1C | Augmentation | H-flip only vs H-flip + rotation ±5° + mild color jitter | Per-family: validate inside Phase 2 for GAN, default to H-flip-only for VAE/DDPM |
|
||||
| 1D | Combined dataset | Aligned only vs aligned + raw mixed | Pick by FID — expected to underperform aligned-only |
|
||||
|
||||
**Caveat on transferability**: Phase 1 uses DCGAN as a proxy to choose the pipeline cheaply, then assumes the choice transfers to VAE and DDPM. Resolution and alignment are largely architecture-invariant (more pixels help everyone; structural consistency helps any spatial prior). Augmentation is *not* — diffusion models benefit less from aug, and MSE-VAE may even be hurt by color jitter. So 1C is treated as an **indicative** result for GANs and re-checked per family rather than baked in globally.
|
||||
|
||||
**1D — combined dataset rationale**: Mixing aligned + raw doubles the variance the generator must model (face anywhere/any scale + face fixed) and dilutes the geometric prior. Hypothesis: combined < aligned-only. Cheap to test (one extra DCGAN run). Included for completeness so the report shows we considered it rather than asserting it.
|
||||
|
||||
**MTCNN alignment** (one-time preprocessing, cached to disk):
|
||||
|
||||
```python
|
||||
from facenet_pytorch import MTCNN
|
||||
from skimage.transform import SimilarityTransform, warp
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
|
||||
mtcnn = MTCNN(keep_all=False, device='cuda')
|
||||
|
||||
REF_LANDMARKS = np.array([ # reference positions in 128×128
|
||||
[38.0, 51.0], # left eye
|
||||
[90.0, 51.0], # right eye
|
||||
[64.0, 71.0], # nose
|
||||
[45.0, 95.0], # left mouth
|
||||
[83.0, 95.0], # right mouth
|
||||
], dtype=np.float32)
|
||||
|
||||
def align_face(img: Image.Image, out_size: int = 128):
|
||||
boxes, _, landmarks = mtcnn.detect(img, landmarks=True)
|
||||
if boxes is None:
|
||||
return None
|
||||
tform = SimilarityTransform()
|
||||
tform.estimate(landmarks[0], REF_LANDMARKS)
|
||||
aligned = warp(np.array(img), tform.inverse,
|
||||
output_shape=(out_size, out_size),
|
||||
order=3, preserve_range=True).astype(np.uint8)
|
||||
return Image.fromarray(aligned)
|
||||
```
|
||||
|
||||
**Augmentation philosophy** — only structure-preserving transforms (face-aligned crops are consistent by design):
|
||||
|
||||
| Transform | Apply? | Reason |
|
||||
|---|---|---|
|
||||
| Horizontal flip | Yes, p=0.5 | Faces are symmetric |
|
||||
| Rotation | Yes, ±5° | Residual head tilt post-alignment |
|
||||
| Color jitter | Yes, mild | brightness ±0.1, contrast ±0.1, saturation ±0.05 |
|
||||
| Translation | No | Breaks alignment |
|
||||
| Vertical flip | No | Meaningless for faces |
|
||||
| Strong blur / noise | No | Teaches the model to generate blur |
|
||||
|
||||
**Output**: ~1 page in the report. Best pipeline carries forward to all phases.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — GAN Evolution *(main spine)*
|
||||
|
||||
**Goal**: The richest narrative — each step has a clear *because* from observed failure. This is the strongest part of the storyline; keep it front and center.
|
||||
|
||||
Best pipeline from Phase 1 fixed throughout.
|
||||
|
||||
---
|
||||
|
||||
### 2.1 — DCGAN *(baseline)*
|
||||
|
||||
Simplest GAN baseline. BCE loss, no gradient penalty.
|
||||
|
||||
- Adam β1=0.5, β2=0.999, lr=2e-4
|
||||
- ngf=ndf=64, latent_dim=100
|
||||
- Resolution: 64×64
|
||||
|
||||
**Expected failure**: mode collapse, training instability, oscillating losses. Document these explicitly — they motivate 2.2.
|
||||
|
||||
---
|
||||
|
||||
### 2.2 — WGAN-GP
|
||||
|
||||
**Because**: DCGAN showed mode collapse and instability → Wasserstein loss + gradient penalty.
|
||||
|
||||
- Adam β1=0.0, β2=0.9, lr_g=lr_d=1e-4
|
||||
- ngf=ndf=64, latent_dim=128, n_critic=2, gp_lambda=10
|
||||
- Resolution: 64×64
|
||||
|
||||
**Expected**: more stable training, better diversity. Likely remaining issues: texture artifacts, limited global coherence at higher resolution.
|
||||
|
||||
---
|
||||
|
||||
### 2.3 — WGAN-GP + Spectral Norm + GroupNorm + Self-Attention
|
||||
|
||||
**Because**: WGAN-GP showed texture artifacts / limited coherence → principled Lipschitz constraint and long-range dependencies.
|
||||
|
||||
- Generator: BatchNorm → GroupNorm (no batch-size coupling)
|
||||
- Critic: InstanceNorm → Spectral Normalization (principled Lipschitz constraint)
|
||||
- Self-attention at 16×16 in both generator and critic
|
||||
|
||||
```python
|
||||
class SelfAttention(nn.Module):
|
||||
def __init__(self, in_ch):
|
||||
super().__init__()
|
||||
mid = max(in_ch // 8, 1)
|
||||
self.q = nn.Conv2d(in_ch, mid, 1, bias=False)
|
||||
self.k = nn.Conv2d(in_ch, mid, 1, bias=False)
|
||||
self.v = nn.Conv2d(in_ch, in_ch, 1, bias=False)
|
||||
self.gamma = nn.Parameter(torch.zeros(1))
|
||||
self._mid = mid
|
||||
|
||||
def forward(self, x):
|
||||
b, c, h, w = x.shape
|
||||
q = self.q(x).view(b, self._mid, -1).transpose(-2, -1)
|
||||
k = self.k(x).view(b, self._mid, -1)
|
||||
v = self.v(x).view(b, c, -1)
|
||||
attn = torch.softmax(q @ k * self._mid ** -0.5, dim=-1)
|
||||
return x + self.gamma * (v @ attn.transpose(-2, -1)).view(b, c, h, w)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2.4 — Scale to 128×128 *(if 2.3 looks coherent at 64×64)*
|
||||
|
||||
**Because**: 2.3 produces coherent samples at 64×64 → does the architecture hold up at higher resolution?
|
||||
|
||||
Same architecture as 2.3, retrained at 128×128. Add attention at 32×32 if memory permits.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 Results
|
||||
|
||||
| Step | Model | FID @ 100ep ↓ | Main observed failure | Motivates next step |
|
||||
|---|---|---|---|---|
|
||||
| 2.1 | DCGAN | ? | ? | ? |
|
||||
| 2.2 | WGAN-GP | ? | ? | ? |
|
||||
| 2.3 | WGAN-GP + SN + Attn | ? | ? | ? |
|
||||
| 2.4 | + 128×128 | ? | ? | — |
|
||||
|
||||
For each step: FID curve, 16-sample grid, one paragraph on what failed and why the next change addresses it.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — VAE Track
|
||||
|
||||
**Goal**: A self-contained evolution story for the likelihood-based family. Every step motivated by a known limitation of the previous.
|
||||
|
||||
| Step | Model | Because |
|
||||
|---|---|---|
|
||||
| 3.1 | Vanilla VAE (MSE) | Baseline — expect blur |
|
||||
| 3.2 | + Perceptual loss (VGG) | MSE blur is fundamental to pixel-space reconstruction |
|
||||
| 3.3 | + PatchGAN discriminator (VQGAN-lite) | Perceptual loss still lacks local texture realism |
|
||||
|
||||
**3.1 — Vanilla VAE**: Adam lr=1e-3, latent_dim=256, β=1.0. Plain convolutional encoder/decoder, MSE reconstruction.
|
||||
|
||||
**3.2 — Perceptual loss**: VGG-16 feature matching at relu1_2, relu2_2, relu3_3.
|
||||
|
||||
**3.3 — Patch discriminator**: PatchGAN adversarial loss targeting local texture realism.
|
||||
|
||||
```
|
||||
L = L_mse + λ_perc·L_vgg + λ_adv·L_adv + β·L_kl
|
||||
λ_perc=0.1, λ_adv=0.1, β=0.0001
|
||||
```
|
||||
|
||||
**Decoder fix** (applied from 3.1 onward): replace `ConvTranspose2d` with `Upsample(nearest) + Conv2d` — eliminates checkerboard artifacts.
|
||||
|
||||
| Step | Model | FID ↓ | Main observed failure |
|
||||
|---|---|---|---|
|
||||
| 3.1 | VAE MSE | ? | ? |
|
||||
| 3.2 | + Perceptual | ? | ? |
|
||||
| 3.3 | + PatchGAN | ? | ? |
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — DDPM Track
|
||||
|
||||
**Goal**: A self-contained evolution story for the diffusion family.
|
||||
|
||||
| Step | Model | Because |
|
||||
|---|---|---|
|
||||
| 4.1 | DDPM linear + ε-pred | Baseline |
|
||||
| 4.2 | + cosine schedule | Linear schedule wastes capacity at low timesteps |
|
||||
| 4.3 | + v-prediction | ε-prediction is unstable across the full trajectory |
|
||||
| 4.4 | + wider U-Net / more attention | If 4.3 still underfits |
|
||||
|
||||
**4.1 — Baseline**: AdamW lr=2e-4, base_ch=128, T=1000, attention at 8×8 and 16×16. DDIM sampling, 100 steps.
|
||||
|
||||
**4.2 — Cosine schedule**:
|
||||
|
||||
```python
|
||||
def cosine_betas(T: int, s: float = 0.008):
|
||||
t = torch.linspace(0, T, T + 1)
|
||||
f = torch.cos((t / T + s) / (1 + s) * math.pi / 2) ** 2
|
||||
alpha_bar = f / f[0]
|
||||
betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
|
||||
return betas.clamp(0, 0.999)
|
||||
```
|
||||
|
||||
**4.3 — v-prediction**: replaces ε target with `v = √ᾱ·ε − √(1−ᾱ)·x₀`.
|
||||
|
||||
**4.4 — Wider U-Net**: base_ch 128 → 192, attention at 8×8, 16×16, 32×32.
|
||||
|
||||
| Step | Model | FID ↓ | Main observed failure |
|
||||
|---|---|---|---|
|
||||
| 4.1 | DDPM linear + ε | ? | ? |
|
||||
| 4.2 | + cosine | ? | ? |
|
||||
| 4.3 | + v-pred | ? | ? |
|
||||
| 4.4 | + wider | ? | ? |
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Cross-Family Comparison
|
||||
|
||||
**Goal**: Side-by-side comparison of the best from each family (2.4, 3.3, 4.4) under identical conditions.
|
||||
|
||||
Best-of-each retrained for 200 epochs at the same resolution and pipeline.
|
||||
|
||||
### 5A — Quantitative
|
||||
|
||||
| Model | FID ↓ | IS ↑ | LPIPS diversity ↑ | Params | Train time |
|
||||
|---|---|---|---|:---:|:---:|
|
||||
| Best GAN (2.4) | ? | ? | ? | ? | ? |
|
||||
| Best VAE (3.3) | ? | ? | ? | ? | ? |
|
||||
| Best DDPM (4.4) | ? | ? | ? | ? | ? |
|
||||
|
||||
### 5B — Qualitative
|
||||
|
||||
- **Visual grids**: 16-image sample grids per finalist
|
||||
- **Progression**: epoch 10 → 50 → 100 → 200 side by side
|
||||
- **Latent interpolation**: smooth transitions between two latent codes (GAN, VAE)
|
||||
- **Diversity**: average pairwise LPIPS distance across 100 generated images
|
||||
- **Failure modes**: worst-generated images per model
|
||||
|
||||
---
|
||||
|
||||
## Compute Budget Notes
|
||||
|
||||
Three families × multiple steps is a lot of runs. If compute is tight:
|
||||
|
||||
- **Keep the GAN track complete** (2.1 → 2.4) — it carries the strongest narrative.
|
||||
- **VAE and DDPM can drop the last step each** (stop at 3.2 and 4.3) without hurting the story.
|
||||
- Phase 1 ablations can use 50 epochs instead of 100 — pipeline deltas show up early.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Phase | Purpose | Models | Output |
|
||||
|---|---|---|---|
|
||||
| 1 | Pipeline selection | DCGAN @ 64×64 across data variants | Best pipeline |
|
||||
| 2 | GAN evolution (main spine) | DCGAN → WGAN-GP → +SN+Attn → 128×128 | GAN failure→fix narrative |
|
||||
| 3 | VAE evolution | VAE → +Perceptual → +PatchGAN | VAE failure→fix narrative |
|
||||
| 4 | DDPM evolution | DDPM → cosine → v-pred → wider | DDPM failure→fix narrative |
|
||||
| 5 | Cross-family comparison | Best of each, retrained 200ep | Final FID + IS + qualitative |
|
||||
|
||||
**The narrative**: baseline fails in a specific way → fix targets that failure → new failure emerges → next fix targets that → repeat per family → compare families on equal footing.
|
||||
Reference in New Issue
Block a user