648 lines
28 KiB
Markdown
648 lines
28 KiB
Markdown
# Deepfake Detection Classifier - Implementation Plan
|
||
|
||
## Overview
|
||
This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.
|
||
|
||
---
|
||
|
||
## Phase 0: Pre-Implementation Setup
|
||
|
||
### Infrastructure and Configuration
|
||
- [x] Create `classifier/configs/shared.json` with shared parameters:
|
||
- seed: 42
|
||
- val_ratio: 0.1
|
||
- test_ratio: 0.1
|
||
- batch_size: 32
|
||
- optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
|
||
- scheduler: {type: "cosine_annealing", T_max: 15}
|
||
- early_stopping_patience: 5
|
||
- num_workers: 4
|
||
- cv_folds: 5
|
||
- data_dir: "data"
|
||
- face_crop_margin: 0.6
|
||
|
||
- [x] Implement config loading/merging so experiment configs inherit `shared.json` defaults and override only the variables under test
|
||
- [x] Resolve shared nested fields such as `optimizer.lr`, `optimizer.weight_decay`, and `scheduler.T_max` into the training arguments used by the runner
|
||
- [x] Update existing configs to reference `shared.json` or otherwise document which shared defaults they intentionally override
|
||
- [x] Define one CV protocol for all phases:
|
||
- outer fold: held-out test fold
|
||
- inner validation split: group-aware split from the remaining training folds for early stopping/model selection
|
||
- final reported metrics: aggregate held-out test-fold results across the 5 outer folds
|
||
|
||
### Data Preparation
|
||
- [x] Verify dataset structure and integrity
|
||
- [x] Check that real and fake images are properly organized by source
|
||
- [x] Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)
|
||
|
||
### Cleanup
|
||
- [x] Remove `classifier/tools/ensemble.py` (not part of reorganization plan, conflicts with explainability goals)
|
||
- [x] Remove robustness evaluation from `classifier/tools/analyze.py` (lines 51-104, 82-104, 144) - not part of experimental plan
|
||
- [x] Remove any unused or obsolete config files from previous experiments (see detailed list below)
|
||
- [X] Clean up old output directories if needed (keep important results for reference)
|
||
|
||
#### Config Files to Remove (39 total)
|
||
|
||
**Root configs (6):**
|
||
- [x] `classifier/configs/resnet18_quick.json`
|
||
- [x] `classifier/configs/resnet18.json`
|
||
- [x] `classifier/configs/simple_cnn_large.json`
|
||
- [x] `classifier/configs/simple_cnn_micro.json`
|
||
- [x] `classifier/configs/simple_cnn_small.json`
|
||
- [x] `classifier/configs/simple_cnn.json`
|
||
|
||
**Phase 1 old configs (7):**
|
||
- [x] `classifier/configs/phase1/p1_cnn_base.json` (uses lr=1e-3, epochs=20 - should be 1e-4, 15)
|
||
- [x] `classifier/configs/phase1/p1_cnn_aug.json`
|
||
- [x] `classifier/configs/phase1/p1_resnet18_base.json` (duplicate of new baseline)
|
||
- [x] `classifier/configs/phase1/p1_resnet18_aug.json`
|
||
- [x] `classifier/configs/phase1/holdout/` (entire directory - 6 configs, source holdout not in new plan)
|
||
|
||
**Phase 2 old configs (7):**
|
||
- [x] `classifier/configs/phase2/p2_resnet18_224.json` (should be p2a_resnet18_224.json)
|
||
- [x] `classifier/configs/phase2/p2_resnet18_facecrop.json` (should be p2b_resnet18_facecrop.json)
|
||
- [x] `classifier/configs/phase2/p2_resnet18_frozen.json` (frozen backbone not in new plan)
|
||
- [x] `classifier/configs/phase2/p2_resnet34_224.json` (ResNet34 should be in Phase 3)
|
||
- [x] `classifier/configs/phase2/p2_resnet34.json` (ResNet34 should be in Phase 3)
|
||
- [x] `classifier/configs/phase2/p2_resnet50_frozen.json` (ResNet50 should be in Phase 3)
|
||
- [x] `classifier/configs/phase2/p2_resnet50.json` (ResNet50 should be in Phase 3)
|
||
|
||
**Phase 3 old configs (4):**
|
||
- [x] `classifier/configs/phase3/p3_efficientnet_b2.json` (EfficientNet-B2 not in new plan, only B0)
|
||
- [x] `classifier/configs/phase3/p3_resnet18_facecrop_full.json` (ResNet18 full dataset should be Phase 4)
|
||
- [x] `classifier/configs/phase3/p3_resnet18_freqaug.json` (frequency augmentation not in new plan)
|
||
- [x] `classifier/configs/phase3/p3_vit_b16.json` (ViT not in new plan, replaced with ConvNeXt/MobileNet)
|
||
- Note: `p3_efficientnet_b0.json` - REMOVED (will be recreated after Phase2 with correct settings)
|
||
|
||
**Source holdout (6):**
|
||
- [x] `classifier/configs/source_holdout/` (entire directory - 6 configs, source holdout not in new plan)
|
||
|
||
**Ablation (3):**
|
||
- [x] `classifier/configs/ablation/` (entire directory - 3 configs, ablation studies not in new plan)
|
||
|
||
**Configs to KEEP (3):**
|
||
- ✅ `classifier/configs/shared.json`
|
||
- ✅ `classifier/configs/phase1/p1_simplecnn_baseline.json`
|
||
- ✅ `classifier/configs/phase1/p1_resnet18_baseline.json`
|
||
|
||
**Phase 2 alias configs removed (8):**
|
||
- [x] `classifier/configs/phase2/p2b_resnet18_128.json` (alias for p1_resnet18_baseline)
|
||
- [x] `classifier/configs/phase2/p2b_simplecnn_128.json` (alias for p1_simplecnn_baseline)
|
||
- [x] `classifier/configs/phase2/p2c_resnet18_nofacecrop.json` (alias for p2b_resnet18_224)
|
||
- [x] `classifier/configs/phase2/p2c_simplecnn_nofacecrop.json` (alias for p2b_simplecnn_224)
|
||
- [x] `classifier/configs/phase2/p2d_resnet18_noaug.json` (alias for p2b_resnet18_224)
|
||
- [x] `classifier/configs/phase2/p2d_simplecnn_noaug.json` (alias for p2b_simplecnn_224)
|
||
- [x] `classifier/configs/phase2/p2e_resnet18_facecrop_only.json` (alias for p2c_resnet18_facecrop)
|
||
- [x] `classifier/configs/phase2/p2e_simplecnn_facecrop_only.json` (alias for p2c_simplecnn_facecrop)
|
||
|
||
Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.
|
||
|
||
---
|
||
|
||
## Phase 1: Architecture Baseline
|
||
|
||
### 1.1 Experiment Configs
|
||
- [x] Create `classifier/configs/phase1/p1_simplecnn_baseline.json`
|
||
- backbone: simple_cnn
|
||
- cnn_preset: medium
|
||
- dropout: 0.0
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4 (consistent with ResNet)
|
||
- weight_decay: 1e-4
|
||
- image_size: 128
|
||
- data_dir: data
|
||
- early_stopping_patience: 5
|
||
- subsample: 0.2
|
||
- face_crop: false
|
||
- augment: false
|
||
- seed: 42
|
||
|
||
- [x] Create `classifier/configs/phase1/p1_resnet18_baseline.json`
|
||
- backbone: resnet18
|
||
- pretrained: true
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4
|
||
- weight_decay: 1e-4
|
||
- image_size: 128
|
||
- data_dir: data
|
||
- early_stopping_patience: 5
|
||
- subsample: 0.2
|
||
- face_crop: false
|
||
- augment: false
|
||
- seed: 42
|
||
|
||
### 1.2 Code Updates
|
||
- [x] Implement 5-fold stratified group cross-validation by basename in training pipeline
|
||
- [x] Update `classifier/src/training/trainer.py` to support CV
|
||
- [x] Update `classifier/src/evaluation/evaluate.py` to support CV
|
||
- [x] Ensure all metrics report mean ± std and confidence intervals across folds
|
||
|
||
### 1.3 Training
|
||
- [x] Train SimpleCNN with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json`)
|
||
- [x] Train ResNet18 with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json`)
|
||
- [x] Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)
|
||
|
||
### 1.4 Analysis
|
||
- [x] Use `classifier/notebooks/03_phase1_analysis.ipynb` for Phase 1 analysis
|
||
- [x] Compare SimpleCNN vs ResNet18 performance
|
||
- [x] Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
|
||
- [x] Per-source metrics (text2img, inpainting, insight)
|
||
- [x] Train/val/test performance curves
|
||
- [x] Confusion matrices
|
||
- [x] Statistical significance testing
|
||
- [x] Generate Grad-CAM visualizations (10-20 images per model)
|
||
- [x] Document conclusions: Which baseline is better and why
|
||
|
||
---
|
||
|
||
## Phase 2: Preprocessing Impact
|
||
|
||
### 2.1 Shortcut Analysis (2A)
|
||
- [x] Create `classifier/configs/phase2/p2a_t1_original.json`
|
||
- backbone: resnet18
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- augment: false
|
||
- normalization: imagenet
|
||
- data_dir: data
|
||
|
||
- [x] Create `classifier/configs/phase2/p2a_t2_real_norm.json`
|
||
- extends: p2a_t1_original.json
|
||
- normalization: real_norm
|
||
- **Normalization**: Calculate mean/std from real training images only within each fold
|
||
|
||
- [x] Geometry diagnostic was explored and then removed from the codebase (`src/evaluation/geometry.py` no longer exists):
|
||
- Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
|
||
- Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
|
||
|
||
- [ ] Train the 2 shortcut configs with 5-fold stratified group CV
|
||
- [ ] Compare results:
|
||
- Standard vs matched-geometry eval for `p2a_t1_original` (letterboxing impact)
|
||
- `p2a_t1_original` vs `p2a_t2_real_norm` (color distribution shortcut)
|
||
|
||
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_text2img.json`
|
||
- extends: p2a_t1_original.json
|
||
- train_sources: ["wiki", "inpainting", "insight"]
|
||
- eval_sources: ["wiki", "inpainting", "insight", "text2img"]
|
||
|
||
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_inpainting.json`
|
||
- extends: p2a_t1_original.json
|
||
- train_sources: ["wiki", "text2img", "insight"]
|
||
- eval_sources: ["wiki", "text2img", "insight", "inpainting"]
|
||
|
||
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_insight.json`
|
||
- extends: p2a_t1_original.json
|
||
- train_sources: ["wiki", "text2img", "inpainting"]
|
||
- eval_sources: ["wiki", "text2img", "inpainting", "insight"]
|
||
|
||
- [ ] Train the 3 source holdout configs with 5-fold stratified group CV
|
||
- [ ] Compare held-out source performance vs in-source performance:
|
||
- Calculate AUC for held-out source (text2img, inpainting, insight)
|
||
- Compute Δ (in-source AUC - held-out AUC)
|
||
- If Δ > 0.05-0.10, model is learning source-specific features
|
||
|
||
### 2.2 Resolution Impact (2B)
|
||
- [x] Create `classifier/configs/phase2/p2b_simplecnn_224.json`
|
||
- backbone: simple_cnn
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- augment: false
|
||
- seed: 42
|
||
- data_dir: data
|
||
|
||
- [x] Create `classifier/configs/phase2/p2b_resnet18_224.json`
|
||
- backbone: resnet18
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- augment: false
|
||
- seed: 42
|
||
- data_dir: data
|
||
|
||
- [ ] Train both 224 configs with 5-fold stratified group CV
|
||
- [ ] Compare 128×128 vs 224×224 for each model
|
||
- 128 baseline is `p1_*_baseline` (comparison mapping in notebook)
|
||
|
||
### 2.3 Facecrop Impact (2C)
|
||
- [x] Create `classifier/configs/phase2/p2c_simplecnn_facecrop.json`
|
||
- backbone: simple_cnn
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- augment: false
|
||
- seed: 42
|
||
- data_dir: cropped/classifier
|
||
|
||
- [x] Create `classifier/configs/phase2/p2c_resnet18_facecrop.json`
|
||
- backbone: resnet18
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- augment: false
|
||
- seed: 42
|
||
- data_dir: cropped/classifier
|
||
|
||
- [ ] Train both facecrop configs with 5-fold stratified group CV
|
||
- [ ] Compare `p2b_resnet18_224` (no facecrop) vs `p2c_resnet18_facecrop` for each model
|
||
- No-facecrop baseline is `p2b_*_224` (comparison mapping in notebook)
|
||
|
||
### 2.4 Augmentation Impact (2D)
|
||
- [x] Create `classifier/configs/phase2/p2d_simplecnn_aug.json`
|
||
- backbone: simple_cnn
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||
- data_dir: data
|
||
|
||
- [x] Create `classifier/configs/phase2/p2d_resnet18_aug.json`
|
||
- backbone: resnet18
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||
- data_dir: data
|
||
|
||
- [ ] Train both augmentation configs with 5-fold stratified group CV
|
||
- [ ] Compare `p2b_resnet18_224` (no aug) vs `p2d_resnet18_aug` for each model
|
||
- No-aug baseline is `p2b_*_224` (comparison mapping in notebook)
|
||
|
||
### 2.5 Augmentation + Facecrop (2E)
|
||
- [x] Create `classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json`
|
||
- backbone: simple_cnn
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||
- data_dir: cropped/classifier
|
||
|
||
- [x] Create `classifier/configs/phase2/p2e_resnet18_facecrop_aug.json`
|
||
- backbone: resnet18
|
||
- image_size: 224
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
|
||
- data_dir: cropped/classifier
|
||
|
||
- [ ] Train both facecrop+aug configs with 5-fold stratified group CV
|
||
- [ ] Compare `p2c_resnet18_facecrop` (facecrop only) vs `p2e_resnet18_facecrop_aug` for each model
|
||
- Facecrop-only baseline is `p2c_*_facecrop` (comparison mapping in notebook)
|
||
|
||
### 2.6 Phase 2 Analysis
|
||
- [ ] Use `classifier/notebooks/04_phase2_analysis.ipynb` for Phase 2 analysis
|
||
- [ ] For each experiment (2A-2E):
|
||
- [ ] Load 5-fold stratified group CV results (mean ± std and confidence intervals)
|
||
- [ ] Generate overall metrics (AUC, Accuracy, F1)
|
||
- [ ] Generate per-source metrics (text2img, inpainting, insight)
|
||
- [ ] Calculate train/val gap
|
||
- [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
|
||
- [ ] Statistical significance testing vs baseline
|
||
- [ ] Generate comparison visualizations (bar charts, heatmaps)
|
||
- [ ] For 2C (Shortcut Analysis):
|
||
- [ ] Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
|
||
- [ ] Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
|
||
- [ ] Load source holdout results (3 configs)
|
||
- [ ] Calculate held-out source AUC vs in-source AUC for each holdout experiment
|
||
- [ ] Compute Δ (in-source AUC - held-out AUC)
|
||
- [ ] If Δ > 0.05-0.10, model is learning source-specific features
|
||
- [ ] Generate source holdout comparison table
|
||
- [ ] For each model/condition:
|
||
- [ ] Generate Grad-CAM visualizations (10-20 images per condition)
|
||
- [ ] Organize by experiment, prediction type, and source
|
||
- [ ] Answer key questions:
|
||
- [ ] Which preprocessing choices are statistically significant?
|
||
- [ ] Do certain sources benefit more from specific preprocessing?
|
||
- [ ] Is there an interaction between facecrop and augmentation?
|
||
- [ ] Are shortcuts being learned (resolution, color distribution)?
|
||
- [ ] Is the model learning source-specific features (source holdout)?
|
||
- [ ] Does augmentation remove shortcuts or over-regularize?
|
||
- [ ] What features do models focus on (based on Grad-CAM)?
|
||
- [ ] Generate comprehensive metrics comparison table
|
||
- [ ] Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
|
||
- [ ] Provide evidence-based conclusions for each experiment
|
||
- [ ] Provide recommendations for Phase 3 (best preprocessing settings)
|
||
|
||
---
|
||
|
||
## Phase 3: Extended Architecture Exploration
|
||
|
||
### 3.1 Experiment Configs
|
||
Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.
|
||
|
||
- [ ] Create `classifier/configs/phase3/p3_resnet34.json`
|
||
- backbone: resnet34
|
||
- pretrained: true
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4
|
||
- weight_decay: 1e-4
|
||
- image_size: 224
|
||
- face_crop: true (or best from Phase 2B/E)
|
||
- face_crop_margin: 0.6
|
||
- augment: false (or best from Phase 2D/E)
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- early_stopping_patience: 5
|
||
|
||
- [ ] Create `classifier/configs/phase3/p3_resnet50.json`
|
||
- backbone: resnet50
|
||
- pretrained: true
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4
|
||
- weight_decay: 1e-4
|
||
- image_size: 224
|
||
- face_crop: true (or best from Phase 2B/E)
|
||
- face_crop_margin: 0.6
|
||
- augment: false (or best from Phase 2D/E)
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- early_stopping_patience: 5
|
||
|
||
- [ ] Create `classifier/configs/phase3/p3_efficientnet_b0.json`
|
||
- backbone: efficientnet_b0
|
||
- pretrained: true
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4
|
||
- weight_decay: 1e-4
|
||
- image_size: 224
|
||
- face_crop: true (or best from Phase 2B/E)
|
||
- augment: false (or best from Phase 2D/E)
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- early_stopping_patience: 5
|
||
|
||
- [ ] Create `classifier/configs/phase3/p3_convnext_tiny.json`
|
||
- backbone: convnext_tiny
|
||
- pretrained: true
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4
|
||
- weight_decay: 1e-4
|
||
- image_size: 224
|
||
- face_crop: true (or best from Phase 2B/E)
|
||
- augment: false (or best from Phase 2D/E)
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- early_stopping_patience: 5
|
||
|
||
- [ ] Create `classifier/configs/phase3/p3_mobilenetv3_small.json`
|
||
- backbone: mobilenetv3_small
|
||
- pretrained: true
|
||
- epochs: 15
|
||
- batch_size: 32
|
||
- lr: 1e-4
|
||
- weight_decay: 1e-4
|
||
- image_size: 224
|
||
- face_crop: true (or best from Phase 2B/E)
|
||
- augment: false (or best from Phase 2D/E)
|
||
- subsample: 0.2
|
||
- seed: 42
|
||
- early_stopping_patience: 5
|
||
|
||
### 3.2 Model Implementation
|
||
- [ ] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
|
||
- [ ] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
|
||
- [ ] Register both models in `classifier/src/models/__init__.py`
|
||
|
||
### 3.3 Training
|
||
- [ ] Train ResNet34 with 5-fold stratified group CV
|
||
- [ ] Train ResNet50 with 5-fold stratified group CV
|
||
- [ ] Train EfficientNet-B0 with 5-fold stratified group CV
|
||
- [ ] Train ConvNeXt-Tiny with 5-fold stratified group CV
|
||
- [ ] Train MobileNetV3-Small with 5-fold stratified group CV
|
||
- [ ] Save all checkpoints and metrics
|
||
|
||
### 3.4 Analysis
|
||
- [ ] Use `classifier/notebooks/05_phase3_analysis.ipynb` for Phase 3 analysis
|
||
- [ ] Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
|
||
- [ ] Generate overall metrics for each model
|
||
- [ ] Generate per-source metrics for each model
|
||
- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
|
||
- [ ] Statistical significance testing vs baselines
|
||
- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
|
||
- [ ] Parameter count vs performance analysis
|
||
- [ ] Conclusions: Which architectures work best and why
|
||
|
||
---
|
||
|
||
## Phase 4: Final Analysis on Best Models
|
||
|
||
### 4.1 Select Top Models
|
||
- [ ] Based on Phases 1-3 results, select top 3-4 models
|
||
- [ ] Document selection criteria (e.g., top AUC, balanced performance, efficiency)
|
||
|
||
### 4.2 Data Quantity Scaling (4A)
|
||
- [ ] For each selected model, create configs for different data sizes:
|
||
- [ ] `classifier/configs/phase4/p4a_<model>_20pct.json` (subsample: 0.2)
|
||
- [ ] `classifier/configs/phase4/p4a_<model>_50pct.json` (subsample: 0.5)
|
||
- [ ] `classifier/configs/phase4/p4a_<model>_100pct.json` (subsample: 1.0)
|
||
- [ ] In every 4A config, explicitly set the best Phase 2 preprocessing choices:
|
||
- image_size: best from Phase 2A
|
||
- face_crop: best from Phase 2B/E
|
||
- augment: best from Phase 2D/E
|
||
- [ ] Train each model with 5-fold stratified group CV at all three data sizes
|
||
- [ ] Compare how each model scales with more data
|
||
|
||
### 4.3 Full Dataset Evaluation (4B)
|
||
- [ ] For each selected model, create config for full dataset:
|
||
- `classifier/configs/phase4/p4b_<model>_full.json` (subsample: 1.0)
|
||
- [ ] In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
|
||
- [ ] Train each model on full dataset with 5-fold stratified group CV
|
||
- [ ] Generate detailed per-source metrics
|
||
- [ ] Generate Grad-CAM visualizations (10-20 images each)
|
||
- [ ] Perform hard example analysis (false positives/negatives) with visualizations
|
||
- [ ] Generate confidence distribution histograms
|
||
- [ ] Cross-validation results (mean ± std with confidence intervals)
|
||
|
||
### 4.4 Analysis
|
||
- [ ] Use `classifier/notebooks/06_phase4_analysis.ipynb` for Phase 4 analysis
|
||
- [ ] Load data quantity scaling results
|
||
- [ ] Load full dataset evaluation results
|
||
- [ ] Generate comprehensive metrics comparison table
|
||
- [ ] Generate per-source metrics for final models
|
||
- [ ] Generate Grad-CAM galleries for final models
|
||
- [ ] Perform hard example analysis with visualizations
|
||
- [ ] Generate confidence distribution histograms
|
||
- [ ] Final model comparison and selection
|
||
- [ ] Conclusions and recommendations
|
||
|
||
---
|
||
|
||
## Notebooks and Analysis
|
||
|
||
This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.
|
||
|
||
### 5.1 Exploratory Data Analysis
|
||
- [x] Create `classifier/notebooks/01_eda.ipynb`
|
||
- [x] Dataset overview (real vs fake distribution, sources)
|
||
- [x] Image resolution/aspect ratio analysis (identify potential shortcuts)
|
||
- [x] Color distribution analysis (identify potential shortcuts)
|
||
- [x] Sample visualization from each source
|
||
- [x] Statistical summary of the dataset
|
||
- [x] Data quality checks
|
||
|
||
### 5.2 Preprocessing Pipeline
|
||
- [x] Create `classifier/notebooks/02_preprocessing.ipynb`
|
||
- [x] Square crop and resize implementation demonstration
|
||
- [x] Face crop (MTCNN) demonstration and effectiveness analysis
|
||
- [x] Augmentation pipeline visualization (before/after examples)
|
||
- [x] Z-score normalization comparison (ImageNet vs real-image-only)
|
||
- [x] Data split verification (group-aware by basename, no overlap)
|
||
- [x] Preprocessing impact visualization
|
||
|
||
### 5.3 Phase 1 Analysis
|
||
- [x] Create `classifier/notebooks/03_phase1_analysis.ipynb`
|
||
- [x] Load Phase 1 training results
|
||
- [x] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
|
||
- [x] Generate per-source metrics for each model
|
||
- [x] Generate train/val/test performance curves
|
||
- [x] Generate confusion matrices
|
||
- [x] Perform statistical significance testing between models
|
||
- [x] Generate Grad-CAM visualizations (10-20 images each)
|
||
- [x] Document conclusions: Which baseline is better and why
|
||
|
||
### 5.4 Phase 2 Analysis
|
||
- [x] Create `classifier/notebooks/04_phase2_analysis.ipynb`
|
||
- [ ] Load all Phase 2 experiment results
|
||
- [ ] For each experiment (2A-2E):
|
||
- [ ] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
|
||
- [ ] Generate overall metrics
|
||
- [ ] Generate per-source metrics
|
||
- [ ] Calculate train/val gap
|
||
- [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
|
||
- [ ] Perform statistical significance testing
|
||
- [ ] Generate comparison tables across all Phase 2 experiments
|
||
- [ ] Generate comparison visualizations (bar charts, heatmaps)
|
||
- [ ] For each model/condition, generate Grad-CAM visualizations (10-20 images)
|
||
- [ ] Organize visualizations by experiment, model, prediction type, and source
|
||
- [ ] Answer key analysis questions
|
||
- [ ] Generate comprehensive metrics comparison table
|
||
- [ ] Provide evidence-based conclusions for each experiment
|
||
- [ ] Provide recommendations for Phase 3
|
||
|
||
### 5.5 Phase 3 Analysis
|
||
- [ ] Create `classifier/notebooks/05_phase3_analysis.ipynb`
|
||
- [ ] Load Phase 3 training results
|
||
- [ ] Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
|
||
- [ ] Generate per-source metrics for each model
|
||
- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
|
||
- [ ] Perform statistical significance testing vs baselines
|
||
- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
|
||
- [ ] Parameter count vs performance analysis
|
||
- [ ] Conclusions: Which architectures work best and why
|
||
|
||
### 5.6 Phase 4 Analysis
|
||
- [ ] Create `classifier/notebooks/06_phase4_analysis.ipynb`
|
||
- [ ] Load data quantity scaling results
|
||
- [ ] Load full dataset evaluation results
|
||
- [ ] Generate comprehensive metrics comparison table
|
||
- [ ] Generate per-source metrics for final models
|
||
- [ ] Generate Grad-CAM galleries for final models
|
||
- [ ] Perform hard example analysis with visualizations
|
||
- [ ] Generate confidence distribution histograms
|
||
- [ ] Final model comparison and selection
|
||
- [ ] Conclusions and recommendations
|
||
|
||
### 5.7 Grad-CAM Deep Dive (Optional)
|
||
- [ ] Create `classifier/notebooks/07_gradcam_deep_dive.ipynb`
|
||
- [ ] Load Grad-CAM results from all phases
|
||
- [ ] Comprehensive Grad-CAM analysis across all phases and models
|
||
- [ ] Feature visualization for different model architectures
|
||
- [ ] CNN vs EfficientNet vs ConvNeXt comparison
|
||
- [ ] What regions do different architectures focus on?
|
||
- [ ] Are there systematic differences in attention patterns?
|
||
- [ ] Evidence of shortcut removal analysis across phases
|
||
- [ ] Temporal analysis: does model attention change with different preprocessing?
|
||
- [ ] Generate visual explanations suitable for presentation
|
||
|
||
---
|
||
|
||
## Code Implementation Tasks
|
||
|
||
### Cross-Validation Implementation
|
||
- [x] Update `classifier/src/training/trainer.py` to support 5-fold stratified group CV by basename
|
||
- [x] Update `classifier/src/evaluation/evaluate.py` to support grouped CV splits
|
||
- [x] Implement metric aggregation across folds (mean ± std)
|
||
- [x] Ensure all metrics report confidence intervals
|
||
- [x] Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
|
||
- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
|
||
- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
|
||
|
||
### Model Implementations
|
||
- [ ] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
|
||
- [ ] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
|
||
- [ ] Register both models in `classifier/src/models/__init__.py`
|
||
|
||
### Normalization Implementation
|
||
- [ ] Implement function to calculate mean/std from real training images only
|
||
- [ ] Update `classifier/src/preprocessing/pipeline.py` to support custom normalization stats
|
||
- [ ] Test ImageNet normalization vs real-image-only normalization
|
||
|
||
### Evaluation Improvements
|
||
- [ ] Ensure test set uses `train=False` to disable augmentation
|
||
- [ ] Ensure diagnostic evaluation transforms never change the training data
|
||
- [ ] Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
|
||
- [ ] Implement per-source metrics with detection rate and false alarm rate
|
||
- [ ] Implement pairwise AUC calculations
|
||
- [ ] Implement train/val gap calculations
|
||
- [ ] Implement pairwise source AUC variance calculations
|
||
|
||
### Grad-CAM Improvements
|
||
- [ ] Ensure Grad-CAM works for all model types (CNN-based)
|
||
- [ ] Implement Grad-CAM for ConvNeXt
|
||
- [ ] Implement Grad-CAM for MobileNetV3
|
||
- [ ] Organize Grad-CAM outputs by experiment, model, prediction type, source
|
||
|
||
---
|
||
|
||
## Final Report Preparation
|
||
- [ ] Compile results from all phases
|
||
- [ ] Create presentation slides (PDF format)
|
||
- [ ] Brief description of deep learning solutions (discriminative + generative)
|
||
- [ ] Description of implementation steps and improvements
|
||
- [ ] Motivate choices for architecture, training strategy, etc.
|
||
- [ ] Show intermediate results
|
||
- [ ] Interpret results and what changed
|
||
- [ ] What was decided to improve results
|
||
- [ ] Classification performance results
|
||
- [ ] Experimental setup
|
||
- [ ] Train/val/test splits
|
||
- [ ] Performance metrics chosen
|
||
- [ ] Data generation performance results
|
||
- [ ] Experimental setup
|
||
- [ ] Performance metrics chosen
|
||
- [ ] Discussion and conclusions
|
||
- [ ] Comments on performance
|
||
- [ ] Final remarks
|
||
- [ ] Fill auto-evaluation file
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
Total tasks: ~150+
|
||
|
||
This implementation plan covers:
|
||
- ✅ All 4 phases with comprehensive experiments
|
||
- ✅ 5-fold stratified group cross-validation for all experiments
|
||
- ✅ 7 analysis notebooks for robust validation
|
||
- ✅ Shortcut analysis (resolution/ratio + color distribution + source holdout)
|
||
- ✅ Source holdout experiments to detect source-specific feature learning
|
||
- ✅ Grad-CAM visualizations for explainability
|
||
- ✅ Statistical analysis with confidence intervals
|
||
- ✅ Per-source metrics for all experiments
|
||
- ✅ Data quantity scaling analysis
|
||
- ✅ Full dataset evaluation on best models
|
||
- ✅ Comprehensive documentation and reporting
|
||
|
||
**Key Features:**
|
||
- Reproducible experiments with fixed seeds
|
||
- Stratified group CV keeps basename groups together while balancing class distribution
|
||
- Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
|
||
- Source holdout experiments to test generalization to unseen sources
|
||
- Grad-CAM for explainability
|
||
- Statistical rigor with confidence intervals
|
||
- Per-source analysis to understand model behavior
|
||
- Clear progression from baselines -> preprocessing -> architectures -> final evaluation
|