28 KiB
Deepfake Detection Classifier - Implementation Plan
Overview
This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.
Phase 0: Pre-Implementation Setup
Infrastructure and Configuration
-
Create
classifier/configs/shared.jsonwith shared parameters:- seed: 42
- val_ratio: 0.1
- test_ratio: 0.1
- batch_size: 32
- optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
- scheduler: {type: "cosine_annealing", T_max: 15}
- early_stopping_patience: 5
- num_workers: 4
- cv_folds: 5
- data_dir: "data"
- face_crop_margin: 0.6
-
Implement config loading/merging so experiment configs inherit
shared.jsondefaults and override only the variables under test -
Resolve shared nested fields such as
optimizer.lr,optimizer.weight_decay, andscheduler.T_maxinto the training arguments used by the runner -
Update existing configs to reference
shared.jsonor otherwise document which shared defaults they intentionally override -
Define one CV protocol for all phases:
- outer fold: held-out test fold
- inner validation split: group-aware split from the remaining training folds for early stopping/model selection
- final reported metrics: aggregate held-out test-fold results across the 5 outer folds
Data Preparation
- Verify dataset structure and integrity
- Check that real and fake images are properly organized by source
- Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)
Cleanup
- Remove
classifier/tools/ensemble.py(not part of reorganization plan, conflicts with explainability goals) - Remove robustness evaluation from
classifier/tools/analyze.py(lines 51-104, 82-104, 144) - not part of experimental plan - Remove any unused or obsolete config files from previous experiments (see detailed list below)
- Clean up old output directories if needed (keep important results for reference)
Config Files to Remove (39 total)
Root configs (6):
classifier/configs/resnet18_quick.jsonclassifier/configs/resnet18.jsonclassifier/configs/simple_cnn_large.jsonclassifier/configs/simple_cnn_micro.jsonclassifier/configs/simple_cnn_small.jsonclassifier/configs/simple_cnn.json
Phase 1 old configs (7):
classifier/configs/phase1/p1_cnn_base.json(uses lr=1e-3, epochs=20 - should be 1e-4, 15)classifier/configs/phase1/p1_cnn_aug.jsonclassifier/configs/phase1/p1_resnet18_base.json(duplicate of new baseline)classifier/configs/phase1/p1_resnet18_aug.jsonclassifier/configs/phase1/holdout/(entire directory - 6 configs, source holdout not in new plan)
Phase 2 old configs (7):
classifier/configs/phase2/p2_resnet18_224.json(should be p2a_resnet18_224.json)classifier/configs/phase2/p2_resnet18_facecrop.json(should be p2b_resnet18_facecrop.json)classifier/configs/phase2/p2_resnet18_frozen.json(frozen backbone not in new plan)classifier/configs/phase2/p2_resnet34_224.json(ResNet34 should be in Phase 3)classifier/configs/phase2/p2_resnet34.json(ResNet34 should be in Phase 3)classifier/configs/phase2/p2_resnet50_frozen.json(ResNet50 should be in Phase 3)classifier/configs/phase2/p2_resnet50.json(ResNet50 should be in Phase 3)
Phase 3 old configs (4):
classifier/configs/phase3/p3_efficientnet_b2.json(EfficientNet-B2 not in new plan, only B0)classifier/configs/phase3/p3_resnet18_facecrop_full.json(ResNet18 full dataset should be Phase 4)classifier/configs/phase3/p3_resnet18_freqaug.json(frequency augmentation not in new plan)classifier/configs/phase3/p3_vit_b16.json(ViT not in new plan, replaced with ConvNeXt/MobileNet)- Note:
p3_efficientnet_b0.json- REMOVED (will be recreated after Phase2 with correct settings)
Source holdout (6):
classifier/configs/source_holdout/(entire directory - 6 configs, source holdout not in new plan)
Ablation (3):
classifier/configs/ablation/(entire directory - 3 configs, ablation studies not in new plan)
Configs to KEEP (3):
- ✅
classifier/configs/shared.json - ✅
classifier/configs/phase1/p1_simplecnn_baseline.json - ✅
classifier/configs/phase1/p1_resnet18_baseline.json
Phase 2 alias configs removed (8):
classifier/configs/phase2/p2b_resnet18_128.json(alias for p1_resnet18_baseline)classifier/configs/phase2/p2b_simplecnn_128.json(alias for p1_simplecnn_baseline)classifier/configs/phase2/p2c_resnet18_nofacecrop.json(alias for p2b_resnet18_224)classifier/configs/phase2/p2c_simplecnn_nofacecrop.json(alias for p2b_simplecnn_224)classifier/configs/phase2/p2d_resnet18_noaug.json(alias for p2b_resnet18_224)classifier/configs/phase2/p2d_simplecnn_noaug.json(alias for p2b_simplecnn_224)classifier/configs/phase2/p2e_resnet18_facecrop_only.json(alias for p2c_resnet18_facecrop)classifier/configs/phase2/p2e_simplecnn_facecrop_only.json(alias for p2c_simplecnn_facecrop)
Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.
Phase 1: Architecture Baseline
1.1 Experiment Configs
-
Create
classifier/configs/phase1/p1_simplecnn_baseline.json- backbone: simple_cnn
- cnn_preset: medium
- dropout: 0.0
- epochs: 15
- batch_size: 32
- lr: 1e-4 (consistent with ResNet)
- weight_decay: 1e-4
- image_size: 128
- data_dir: data
- early_stopping_patience: 5
- subsample: 0.2
- face_crop: false
- augment: false
- seed: 42
-
Create
classifier/configs/phase1/p1_resnet18_baseline.json- backbone: resnet18
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 128
- data_dir: data
- early_stopping_patience: 5
- subsample: 0.2
- face_crop: false
- augment: false
- seed: 42
1.2 Code Updates
- Implement 5-fold stratified group cross-validation by basename in training pipeline
- Update
classifier/src/training/trainer.pyto support CV - Update
classifier/src/evaluation/evaluate.pyto support CV - Ensure all metrics report mean ± std and confidence intervals across folds
1.3 Training
- Train SimpleCNN with 5-fold stratified group CV (via pipeline:
python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json) - Train ResNet18 with 5-fold stratified group CV (via pipeline:
python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json) - Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)
1.4 Analysis
- Use
classifier/notebooks/03_phase1_analysis.ipynbfor Phase 1 analysis - Compare SimpleCNN vs ResNet18 performance
- Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
- Per-source metrics (text2img, inpainting, insight)
- Train/val/test performance curves
- Confusion matrices
- Statistical significance testing
- Generate Grad-CAM visualizations (10-20 images per model)
- Document conclusions: Which baseline is better and why
Phase 2: Preprocessing Impact
2.1 Shortcut Analysis (2A)
-
Create
classifier/configs/phase2/p2a_t1_original.json- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: false
- normalization: imagenet
- data_dir: data
-
Create
classifier/configs/phase2/p2a_t2_real_norm.json- extends: p2a_t1_original.json
- normalization: real_norm
- Normalization: Calculate mean/std from real training images only within each fold
-
Geometry diagnostic was explored and then removed from the codebase (
src/evaluation/geometry.pyno longer exists):- Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
- Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
-
Train the 2 shortcut configs with 5-fold stratified group CV
-
Compare results:
- Standard vs matched-geometry eval for
p2a_t1_original(letterboxing impact) p2a_t1_originalvsp2a_t2_real_norm(color distribution shortcut)
- Standard vs matched-geometry eval for
-
Create
classifier/configs/phase2/p2a_t3_holdout_text2img.json- extends: p2a_t1_original.json
- train_sources: ["wiki", "inpainting", "insight"]
- eval_sources: ["wiki", "inpainting", "insight", "text2img"]
-
Create
classifier/configs/phase2/p2a_t3_holdout_inpainting.json- extends: p2a_t1_original.json
- train_sources: ["wiki", "text2img", "insight"]
- eval_sources: ["wiki", "text2img", "insight", "inpainting"]
-
Create
classifier/configs/phase2/p2a_t3_holdout_insight.json- extends: p2a_t1_original.json
- train_sources: ["wiki", "text2img", "inpainting"]
- eval_sources: ["wiki", "text2img", "inpainting", "insight"]
-
Train the 3 source holdout configs with 5-fold stratified group CV
-
Compare held-out source performance vs in-source performance:
- Calculate AUC for held-out source (text2img, inpainting, insight)
- Compute Δ (in-source AUC - held-out AUC)
- If Δ > 0.05-0.10, model is learning source-specific features
2.2 Resolution Impact (2B)
-
Create
classifier/configs/phase2/p2b_simplecnn_224.json- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: data
-
Create
classifier/configs/phase2/p2b_resnet18_224.json- backbone: resnet18
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: data
-
Train both 224 configs with 5-fold stratified group CV
-
Compare 128×128 vs 224×224 for each model
- 128 baseline is
p1_*_baseline(comparison mapping in notebook)
- 128 baseline is
2.3 Facecrop Impact (2C)
-
Create
classifier/configs/phase2/p2c_simplecnn_facecrop.json- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: cropped/classifier
-
Create
classifier/configs/phase2/p2c_resnet18_facecrop.json- backbone: resnet18
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: cropped/classifier
-
Train both facecrop configs with 5-fold stratified group CV
-
Compare
p2b_resnet18_224(no facecrop) vsp2c_resnet18_facecropfor each model- No-facecrop baseline is
p2b_*_224(comparison mapping in notebook)
- No-facecrop baseline is
2.4 Augmentation Impact (2D)
-
Create
classifier/configs/phase2/p2d_simplecnn_aug.json- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: data
-
Create
classifier/configs/phase2/p2d_resnet18_aug.json- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: data
-
Train both augmentation configs with 5-fold stratified group CV
-
Compare
p2b_resnet18_224(no aug) vsp2d_resnet18_augfor each model- No-aug baseline is
p2b_*_224(comparison mapping in notebook)
- No-aug baseline is
2.5 Augmentation + Facecrop (2E)
-
Create
classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: cropped/classifier
-
Create
classifier/configs/phase2/p2e_resnet18_facecrop_aug.json- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: cropped/classifier
-
Train both facecrop+aug configs with 5-fold stratified group CV
-
Compare
p2c_resnet18_facecrop(facecrop only) vsp2e_resnet18_facecrop_augfor each model- Facecrop-only baseline is
p2c_*_facecrop(comparison mapping in notebook)
- Facecrop-only baseline is
2.6 Phase 2 Analysis
- Use
classifier/notebooks/04_phase2_analysis.ipynbfor Phase 2 analysis - For each experiment (2A-2E):
- Load 5-fold stratified group CV results (mean ± std and confidence intervals)
- Generate overall metrics (AUC, Accuracy, F1)
- Generate per-source metrics (text2img, inpainting, insight)
- Calculate train/val gap
- Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
- Statistical significance testing vs baseline
- Generate comparison visualizations (bar charts, heatmaps)
- For 2C (Shortcut Analysis):
- Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
- Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
- Load source holdout results (3 configs)
- Calculate held-out source AUC vs in-source AUC for each holdout experiment
- Compute Δ (in-source AUC - held-out AUC)
- If Δ > 0.05-0.10, model is learning source-specific features
- Generate source holdout comparison table
- For each model/condition:
- Generate Grad-CAM visualizations (10-20 images per condition)
- Organize by experiment, prediction type, and source
- Answer key questions:
- Which preprocessing choices are statistically significant?
- Do certain sources benefit more from specific preprocessing?
- Is there an interaction between facecrop and augmentation?
- Are shortcuts being learned (resolution, color distribution)?
- Is the model learning source-specific features (source holdout)?
- Does augmentation remove shortcuts or over-regularize?
- What features do models focus on (based on Grad-CAM)?
- Generate comprehensive metrics comparison table
- Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
- Provide evidence-based conclusions for each experiment
- Provide recommendations for Phase 3 (best preprocessing settings)
Phase 3: Extended Architecture Exploration
3.1 Experiment Configs
Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.
-
Create
classifier/configs/phase3/p3_resnet34.json- backbone: resnet34
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- face_crop_margin: 0.6
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
-
Create
classifier/configs/phase3/p3_resnet50.json- backbone: resnet50
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- face_crop_margin: 0.6
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
-
Create
classifier/configs/phase3/p3_efficientnet_b0.json- backbone: efficientnet_b0
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
-
Create
classifier/configs/phase3/p3_convnext_tiny.json- backbone: convnext_tiny
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
-
Create
classifier/configs/phase3/p3_mobilenetv3_small.json- backbone: mobilenetv3_small
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
3.2 Model Implementation
- Implement ConvNeXt-Tiny in
classifier/src/models/convnext.py - Implement MobileNetV3-Small in
classifier/src/models/mobilenet.py - Register both models in
classifier/src/models/__init__.py
3.3 Training
- Train ResNet34 with 5-fold stratified group CV
- Train ResNet50 with 5-fold stratified group CV
- Train EfficientNet-B0 with 5-fold stratified group CV
- Train ConvNeXt-Tiny with 5-fold stratified group CV
- Train MobileNetV3-Small with 5-fold stratified group CV
- Save all checkpoints and metrics
3.4 Analysis
- Use
classifier/notebooks/05_phase3_analysis.ipynbfor Phase 3 analysis - Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
- Generate overall metrics for each model
- Generate per-source metrics for each model
- Compare with Phase 1 baselines (ResNet18, SimpleCNN)
- Statistical significance testing vs baselines
- Generate Grad-CAM visualizations for top models (10-20 images each)
- Parameter count vs performance analysis
- Conclusions: Which architectures work best and why
Phase 4: Final Analysis on Best Models
4.1 Select Top Models
- Based on Phases 1-3 results, select top 3-4 models
- Document selection criteria (e.g., top AUC, balanced performance, efficiency)
4.2 Data Quantity Scaling (4A)
- For each selected model, create configs for different data sizes:
classifier/configs/phase4/p4a_<model>_20pct.json(subsample: 0.2)classifier/configs/phase4/p4a_<model>_50pct.json(subsample: 0.5)classifier/configs/phase4/p4a_<model>_100pct.json(subsample: 1.0)
- In every 4A config, explicitly set the best Phase 2 preprocessing choices:
- image_size: best from Phase 2A
- face_crop: best from Phase 2B/E
- augment: best from Phase 2D/E
- Train each model with 5-fold stratified group CV at all three data sizes
- Compare how each model scales with more data
4.3 Full Dataset Evaluation (4B)
- For each selected model, create config for full dataset:
classifier/configs/phase4/p4b_<model>_full.json(subsample: 1.0)
- In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
- Train each model on full dataset with 5-fold stratified group CV
- Generate detailed per-source metrics
- Generate Grad-CAM visualizations (10-20 images each)
- Perform hard example analysis (false positives/negatives) with visualizations
- Generate confidence distribution histograms
- Cross-validation results (mean ± std with confidence intervals)
4.4 Analysis
- Use
classifier/notebooks/06_phase4_analysis.ipynbfor Phase 4 analysis - Load data quantity scaling results
- Load full dataset evaluation results
- Generate comprehensive metrics comparison table
- Generate per-source metrics for final models
- Generate Grad-CAM galleries for final models
- Perform hard example analysis with visualizations
- Generate confidence distribution histograms
- Final model comparison and selection
- Conclusions and recommendations
Notebooks and Analysis
This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.
5.1 Exploratory Data Analysis
- Create
classifier/notebooks/01_eda.ipynb - Dataset overview (real vs fake distribution, sources)
- Image resolution/aspect ratio analysis (identify potential shortcuts)
- Color distribution analysis (identify potential shortcuts)
- Sample visualization from each source
- Statistical summary of the dataset
- Data quality checks
5.2 Preprocessing Pipeline
- Create
classifier/notebooks/02_preprocessing.ipynb - Square crop and resize implementation demonstration
- Face crop (MTCNN) demonstration and effectiveness analysis
- Augmentation pipeline visualization (before/after examples)
- Z-score normalization comparison (ImageNet vs real-image-only)
- Data split verification (group-aware by basename, no overlap)
- Preprocessing impact visualization
5.3 Phase 1 Analysis
- Create
classifier/notebooks/03_phase1_analysis.ipynb - Load Phase 1 training results
- Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
- Generate per-source metrics for each model
- Generate train/val/test performance curves
- Generate confusion matrices
- Perform statistical significance testing between models
- Generate Grad-CAM visualizations (10-20 images each)
- Document conclusions: Which baseline is better and why
5.4 Phase 2 Analysis
- Create
classifier/notebooks/04_phase2_analysis.ipynb - Load all Phase 2 experiment results
- For each experiment (2A-2E):
- Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
- Generate overall metrics
- Generate per-source metrics
- Calculate train/val gap
- Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
- Perform statistical significance testing
- Generate comparison tables across all Phase 2 experiments
- Generate comparison visualizations (bar charts, heatmaps)
- For each model/condition, generate Grad-CAM visualizations (10-20 images)
- Organize visualizations by experiment, model, prediction type, and source
- Answer key analysis questions
- Generate comprehensive metrics comparison table
- Provide evidence-based conclusions for each experiment
- Provide recommendations for Phase 3
5.5 Phase 3 Analysis
- Create
classifier/notebooks/05_phase3_analysis.ipynb - Load Phase 3 training results
- Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
- Generate per-source metrics for each model
- Compare with Phase 1 baselines (ResNet18, SimpleCNN)
- Perform statistical significance testing vs baselines
- Generate Grad-CAM visualizations for top models (10-20 images each)
- Parameter count vs performance analysis
- Conclusions: Which architectures work best and why
5.6 Phase 4 Analysis
- Create
classifier/notebooks/06_phase4_analysis.ipynb - Load data quantity scaling results
- Load full dataset evaluation results
- Generate comprehensive metrics comparison table
- Generate per-source metrics for final models
- Generate Grad-CAM galleries for final models
- Perform hard example analysis with visualizations
- Generate confidence distribution histograms
- Final model comparison and selection
- Conclusions and recommendations
5.7 Grad-CAM Deep Dive (Optional)
- Create
classifier/notebooks/07_gradcam_deep_dive.ipynb - Load Grad-CAM results from all phases
- Comprehensive Grad-CAM analysis across all phases and models
- Feature visualization for different model architectures
- CNN vs EfficientNet vs ConvNeXt comparison
- What regions do different architectures focus on?
- Are there systematic differences in attention patterns?
- Evidence of shortcut removal analysis across phases
- Temporal analysis: does model attention change with different preprocessing?
- Generate visual explanations suitable for presentation
Code Implementation Tasks
Cross-Validation Implementation
- Update
classifier/src/training/trainer.pyto support 5-fold stratified group CV by basename - Update
classifier/src/evaluation/evaluate.pyto support grouped CV splits - Implement metric aggregation across folds (mean ± std)
- Ensure all metrics report confidence intervals
- Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
- Rename
classifier/run_cv.pytoclassifier/run.py(pipeline expects classifier/run.py) - Rename
classifier/run_cv.pytoclassifier/run.py(pipeline expects classifier/run.py)
Model Implementations
- Implement ConvNeXt-Tiny in
classifier/src/models/convnext.py - Implement MobileNetV3-Small in
classifier/src/models/mobilenet.py - Register both models in
classifier/src/models/__init__.py
Normalization Implementation
- Implement function to calculate mean/std from real training images only
- Update
classifier/src/preprocessing/pipeline.pyto support custom normalization stats - Test ImageNet normalization vs real-image-only normalization
Evaluation Improvements
- Ensure test set uses
train=Falseto disable augmentation - Ensure diagnostic evaluation transforms never change the training data
- Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
- Implement per-source metrics with detection rate and false alarm rate
- Implement pairwise AUC calculations
- Implement train/val gap calculations
- Implement pairwise source AUC variance calculations
Grad-CAM Improvements
- Ensure Grad-CAM works for all model types (CNN-based)
- Implement Grad-CAM for ConvNeXt
- Implement Grad-CAM for MobileNetV3
- Organize Grad-CAM outputs by experiment, model, prediction type, source
Final Report Preparation
- Compile results from all phases
- Create presentation slides (PDF format)
- Brief description of deep learning solutions (discriminative + generative)
- Description of implementation steps and improvements
- Motivate choices for architecture, training strategy, etc.
- Show intermediate results
- Interpret results and what changed
- What was decided to improve results
- Classification performance results
- Experimental setup
- Train/val/test splits
- Performance metrics chosen
- Data generation performance results
- Experimental setup
- Performance metrics chosen
- Discussion and conclusions
- Comments on performance
- Final remarks
- Fill auto-evaluation file
Summary
Total tasks: ~150+
This implementation plan covers:
- ✅ All 4 phases with comprehensive experiments
- ✅ 5-fold stratified group cross-validation for all experiments
- ✅ 7 analysis notebooks for robust validation
- ✅ Shortcut analysis (resolution/ratio + color distribution + source holdout)
- ✅ Source holdout experiments to detect source-specific feature learning
- ✅ Grad-CAM visualizations for explainability
- ✅ Statistical analysis with confidence intervals
- ✅ Per-source metrics for all experiments
- ✅ Data quantity scaling analysis
- ✅ Full dataset evaluation on best models
- ✅ Comprehensive documentation and reporting
Key Features:
- Reproducible experiments with fixed seeds
- Stratified group CV keeps basename groups together while balancing class distribution
- Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
- Source holdout experiments to test generalization to unseen sources
- Grad-CAM for explainability
- Statistical rigor with confidence intervals
- Per-source analysis to understand model behavior
- Clear progression from baselines -> preprocessing -> architectures -> final evaluation