Files
DRL_PROJ/docs/classifier_impl.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

648 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deepfake Detection Classifier - Implementation Plan
## Overview
This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.
---
## Phase 0: Pre-Implementation Setup
### Infrastructure and Configuration
- [x] Create `classifier/configs/shared.json` with shared parameters:
- seed: 42
- val_ratio: 0.1
- test_ratio: 0.1
- batch_size: 32
- optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
- scheduler: {type: "cosine_annealing", T_max: 15}
- early_stopping_patience: 5
- num_workers: 4
- cv_folds: 5
- data_dir: "data"
- face_crop_margin: 0.6
- [x] Implement config loading/merging so experiment configs inherit `shared.json` defaults and override only the variables under test
- [x] Resolve shared nested fields such as `optimizer.lr`, `optimizer.weight_decay`, and `scheduler.T_max` into the training arguments used by the runner
- [x] Update existing configs to reference `shared.json` or otherwise document which shared defaults they intentionally override
- [x] Define one CV protocol for all phases:
- outer fold: held-out test fold
- inner validation split: group-aware split from the remaining training folds for early stopping/model selection
- final reported metrics: aggregate held-out test-fold results across the 5 outer folds
### Data Preparation
- [x] Verify dataset structure and integrity
- [x] Check that real and fake images are properly organized by source
- [x] Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)
### Cleanup
- [x] Remove `classifier/tools/ensemble.py` (not part of reorganization plan, conflicts with explainability goals)
- [x] Remove robustness evaluation from `classifier/tools/analyze.py` (lines 51-104, 82-104, 144) - not part of experimental plan
- [x] Remove any unused or obsolete config files from previous experiments (see detailed list below)
- [X] Clean up old output directories if needed (keep important results for reference)
#### Config Files to Remove (39 total)
**Root configs (6):**
- [x] `classifier/configs/resnet18_quick.json`
- [x] `classifier/configs/resnet18.json`
- [x] `classifier/configs/simple_cnn_large.json`
- [x] `classifier/configs/simple_cnn_micro.json`
- [x] `classifier/configs/simple_cnn_small.json`
- [x] `classifier/configs/simple_cnn.json`
**Phase 1 old configs (7):**
- [x] `classifier/configs/phase1/p1_cnn_base.json` (uses lr=1e-3, epochs=20 - should be 1e-4, 15)
- [x] `classifier/configs/phase1/p1_cnn_aug.json`
- [x] `classifier/configs/phase1/p1_resnet18_base.json` (duplicate of new baseline)
- [x] `classifier/configs/phase1/p1_resnet18_aug.json`
- [x] `classifier/configs/phase1/holdout/` (entire directory - 6 configs, source holdout not in new plan)
**Phase 2 old configs (7):**
- [x] `classifier/configs/phase2/p2_resnet18_224.json` (should be p2a_resnet18_224.json)
- [x] `classifier/configs/phase2/p2_resnet18_facecrop.json` (should be p2b_resnet18_facecrop.json)
- [x] `classifier/configs/phase2/p2_resnet18_frozen.json` (frozen backbone not in new plan)
- [x] `classifier/configs/phase2/p2_resnet34_224.json` (ResNet34 should be in Phase 3)
- [x] `classifier/configs/phase2/p2_resnet34.json` (ResNet34 should be in Phase 3)
- [x] `classifier/configs/phase2/p2_resnet50_frozen.json` (ResNet50 should be in Phase 3)
- [x] `classifier/configs/phase2/p2_resnet50.json` (ResNet50 should be in Phase 3)
**Phase 3 old configs (4):**
- [x] `classifier/configs/phase3/p3_efficientnet_b2.json` (EfficientNet-B2 not in new plan, only B0)
- [x] `classifier/configs/phase3/p3_resnet18_facecrop_full.json` (ResNet18 full dataset should be Phase 4)
- [x] `classifier/configs/phase3/p3_resnet18_freqaug.json` (frequency augmentation not in new plan)
- [x] `classifier/configs/phase3/p3_vit_b16.json` (ViT not in new plan, replaced with ConvNeXt/MobileNet)
- Note: `p3_efficientnet_b0.json` - REMOVED (will be recreated after Phase2 with correct settings)
**Source holdout (6):**
- [x] `classifier/configs/source_holdout/` (entire directory - 6 configs, source holdout not in new plan)
**Ablation (3):**
- [x] `classifier/configs/ablation/` (entire directory - 3 configs, ablation studies not in new plan)
**Configs to KEEP (3):**
-`classifier/configs/shared.json`
-`classifier/configs/phase1/p1_simplecnn_baseline.json`
-`classifier/configs/phase1/p1_resnet18_baseline.json`
**Phase 2 alias configs removed (8):**
- [x] `classifier/configs/phase2/p2b_resnet18_128.json` (alias for p1_resnet18_baseline)
- [x] `classifier/configs/phase2/p2b_simplecnn_128.json` (alias for p1_simplecnn_baseline)
- [x] `classifier/configs/phase2/p2c_resnet18_nofacecrop.json` (alias for p2b_resnet18_224)
- [x] `classifier/configs/phase2/p2c_simplecnn_nofacecrop.json` (alias for p2b_simplecnn_224)
- [x] `classifier/configs/phase2/p2d_resnet18_noaug.json` (alias for p2b_resnet18_224)
- [x] `classifier/configs/phase2/p2d_simplecnn_noaug.json` (alias for p2b_simplecnn_224)
- [x] `classifier/configs/phase2/p2e_resnet18_facecrop_only.json` (alias for p2c_resnet18_facecrop)
- [x] `classifier/configs/phase2/p2e_simplecnn_facecrop_only.json` (alias for p2c_simplecnn_facecrop)
Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.
---
## Phase 1: Architecture Baseline
### 1.1 Experiment Configs
- [x] Create `classifier/configs/phase1/p1_simplecnn_baseline.json`
- backbone: simple_cnn
- cnn_preset: medium
- dropout: 0.0
- epochs: 15
- batch_size: 32
- lr: 1e-4 (consistent with ResNet)
- weight_decay: 1e-4
- image_size: 128
- data_dir: data
- early_stopping_patience: 5
- subsample: 0.2
- face_crop: false
- augment: false
- seed: 42
- [x] Create `classifier/configs/phase1/p1_resnet18_baseline.json`
- backbone: resnet18
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 128
- data_dir: data
- early_stopping_patience: 5
- subsample: 0.2
- face_crop: false
- augment: false
- seed: 42
### 1.2 Code Updates
- [x] Implement 5-fold stratified group cross-validation by basename in training pipeline
- [x] Update `classifier/src/training/trainer.py` to support CV
- [x] Update `classifier/src/evaluation/evaluate.py` to support CV
- [x] Ensure all metrics report mean ± std and confidence intervals across folds
### 1.3 Training
- [x] Train SimpleCNN with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json`)
- [x] Train ResNet18 with 5-fold stratified group CV (via pipeline: `python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json`)
- [x] Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)
### 1.4 Analysis
- [x] Use `classifier/notebooks/03_phase1_analysis.ipynb` for Phase 1 analysis
- [x] Compare SimpleCNN vs ResNet18 performance
- [x] Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
- [x] Per-source metrics (text2img, inpainting, insight)
- [x] Train/val/test performance curves
- [x] Confusion matrices
- [x] Statistical significance testing
- [x] Generate Grad-CAM visualizations (10-20 images per model)
- [x] Document conclusions: Which baseline is better and why
---
## Phase 2: Preprocessing Impact
### 2.1 Shortcut Analysis (2A)
- [x] Create `classifier/configs/phase2/p2a_t1_original.json`
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: false
- normalization: imagenet
- data_dir: data
- [x] Create `classifier/configs/phase2/p2a_t2_real_norm.json`
- extends: p2a_t1_original.json
- normalization: real_norm
- **Normalization**: Calculate mean/std from real training images only within each fold
- [x] Geometry diagnostic was explored and then removed from the codebase (`src/evaluation/geometry.py` no longer exists):
- Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
- Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
- [ ] Train the 2 shortcut configs with 5-fold stratified group CV
- [ ] Compare results:
- Standard vs matched-geometry eval for `p2a_t1_original` (letterboxing impact)
- `p2a_t1_original` vs `p2a_t2_real_norm` (color distribution shortcut)
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_text2img.json`
- extends: p2a_t1_original.json
- train_sources: ["wiki", "inpainting", "insight"]
- eval_sources: ["wiki", "inpainting", "insight", "text2img"]
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_inpainting.json`
- extends: p2a_t1_original.json
- train_sources: ["wiki", "text2img", "insight"]
- eval_sources: ["wiki", "text2img", "insight", "inpainting"]
- [x] Create `classifier/configs/phase2/p2a_t3_holdout_insight.json`
- extends: p2a_t1_original.json
- train_sources: ["wiki", "text2img", "inpainting"]
- eval_sources: ["wiki", "text2img", "inpainting", "insight"]
- [ ] Train the 3 source holdout configs with 5-fold stratified group CV
- [ ] Compare held-out source performance vs in-source performance:
- Calculate AUC for held-out source (text2img, inpainting, insight)
- Compute Δ (in-source AUC - held-out AUC)
- If Δ > 0.05-0.10, model is learning source-specific features
### 2.2 Resolution Impact (2B)
- [x] Create `classifier/configs/phase2/p2b_simplecnn_224.json`
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: data
- [x] Create `classifier/configs/phase2/p2b_resnet18_224.json`
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: data
- [ ] Train both 224 configs with 5-fold stratified group CV
- [ ] Compare 128×128 vs 224×224 for each model
- 128 baseline is `p1_*_baseline` (comparison mapping in notebook)
### 2.3 Facecrop Impact (2C)
- [x] Create `classifier/configs/phase2/p2c_simplecnn_facecrop.json`
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: cropped/classifier
- [x] Create `classifier/configs/phase2/p2c_resnet18_facecrop.json`
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: cropped/classifier
- [ ] Train both facecrop configs with 5-fold stratified group CV
- [ ] Compare `p2b_resnet18_224` (no facecrop) vs `p2c_resnet18_facecrop` for each model
- No-facecrop baseline is `p2b_*_224` (comparison mapping in notebook)
### 2.4 Augmentation Impact (2D)
- [x] Create `classifier/configs/phase2/p2d_simplecnn_aug.json`
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: data
- [x] Create `classifier/configs/phase2/p2d_resnet18_aug.json`
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: data
- [ ] Train both augmentation configs with 5-fold stratified group CV
- [ ] Compare `p2b_resnet18_224` (no aug) vs `p2d_resnet18_aug` for each model
- No-aug baseline is `p2b_*_224` (comparison mapping in notebook)
### 2.5 Augmentation + Facecrop (2E)
- [x] Create `classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json`
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: cropped/classifier
- [x] Create `classifier/configs/phase2/p2e_resnet18_facecrop_aug.json`
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: cropped/classifier
- [ ] Train both facecrop+aug configs with 5-fold stratified group CV
- [ ] Compare `p2c_resnet18_facecrop` (facecrop only) vs `p2e_resnet18_facecrop_aug` for each model
- Facecrop-only baseline is `p2c_*_facecrop` (comparison mapping in notebook)
### 2.6 Phase 2 Analysis
- [ ] Use `classifier/notebooks/04_phase2_analysis.ipynb` for Phase 2 analysis
- [ ] For each experiment (2A-2E):
- [ ] Load 5-fold stratified group CV results (mean ± std and confidence intervals)
- [ ] Generate overall metrics (AUC, Accuracy, F1)
- [ ] Generate per-source metrics (text2img, inpainting, insight)
- [ ] Calculate train/val gap
- [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
- [ ] Statistical significance testing vs baseline
- [ ] Generate comparison visualizations (bar charts, heatmaps)
- [ ] For 2C (Shortcut Analysis):
- [ ] Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
- [ ] Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
- [ ] Load source holdout results (3 configs)
- [ ] Calculate held-out source AUC vs in-source AUC for each holdout experiment
- [ ] Compute Δ (in-source AUC - held-out AUC)
- [ ] If Δ > 0.05-0.10, model is learning source-specific features
- [ ] Generate source holdout comparison table
- [ ] For each model/condition:
- [ ] Generate Grad-CAM visualizations (10-20 images per condition)
- [ ] Organize by experiment, prediction type, and source
- [ ] Answer key questions:
- [ ] Which preprocessing choices are statistically significant?
- [ ] Do certain sources benefit more from specific preprocessing?
- [ ] Is there an interaction between facecrop and augmentation?
- [ ] Are shortcuts being learned (resolution, color distribution)?
- [ ] Is the model learning source-specific features (source holdout)?
- [ ] Does augmentation remove shortcuts or over-regularize?
- [ ] What features do models focus on (based on Grad-CAM)?
- [ ] Generate comprehensive metrics comparison table
- [ ] Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
- [ ] Provide evidence-based conclusions for each experiment
- [ ] Provide recommendations for Phase 3 (best preprocessing settings)
---
## Phase 3: Extended Architecture Exploration
### 3.1 Experiment Configs
Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.
- [ ] Create `classifier/configs/phase3/p3_resnet34.json`
- backbone: resnet34
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- face_crop_margin: 0.6
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
- [ ] Create `classifier/configs/phase3/p3_resnet50.json`
- backbone: resnet50
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- face_crop_margin: 0.6
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
- [ ] Create `classifier/configs/phase3/p3_efficientnet_b0.json`
- backbone: efficientnet_b0
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
- [ ] Create `classifier/configs/phase3/p3_convnext_tiny.json`
- backbone: convnext_tiny
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
- [ ] Create `classifier/configs/phase3/p3_mobilenetv3_small.json`
- backbone: mobilenetv3_small
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
### 3.2 Model Implementation
- [ ] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
- [ ] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
- [ ] Register both models in `classifier/src/models/__init__.py`
### 3.3 Training
- [ ] Train ResNet34 with 5-fold stratified group CV
- [ ] Train ResNet50 with 5-fold stratified group CV
- [ ] Train EfficientNet-B0 with 5-fold stratified group CV
- [ ] Train ConvNeXt-Tiny with 5-fold stratified group CV
- [ ] Train MobileNetV3-Small with 5-fold stratified group CV
- [ ] Save all checkpoints and metrics
### 3.4 Analysis
- [ ] Use `classifier/notebooks/05_phase3_analysis.ipynb` for Phase 3 analysis
- [ ] Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
- [ ] Generate overall metrics for each model
- [ ] Generate per-source metrics for each model
- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
- [ ] Statistical significance testing vs baselines
- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
- [ ] Parameter count vs performance analysis
- [ ] Conclusions: Which architectures work best and why
---
## Phase 4: Final Analysis on Best Models
### 4.1 Select Top Models
- [ ] Based on Phases 1-3 results, select top 3-4 models
- [ ] Document selection criteria (e.g., top AUC, balanced performance, efficiency)
### 4.2 Data Quantity Scaling (4A)
- [ ] For each selected model, create configs for different data sizes:
- [ ] `classifier/configs/phase4/p4a_<model>_20pct.json` (subsample: 0.2)
- [ ] `classifier/configs/phase4/p4a_<model>_50pct.json` (subsample: 0.5)
- [ ] `classifier/configs/phase4/p4a_<model>_100pct.json` (subsample: 1.0)
- [ ] In every 4A config, explicitly set the best Phase 2 preprocessing choices:
- image_size: best from Phase 2A
- face_crop: best from Phase 2B/E
- augment: best from Phase 2D/E
- [ ] Train each model with 5-fold stratified group CV at all three data sizes
- [ ] Compare how each model scales with more data
### 4.3 Full Dataset Evaluation (4B)
- [ ] For each selected model, create config for full dataset:
- `classifier/configs/phase4/p4b_<model>_full.json` (subsample: 1.0)
- [ ] In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
- [ ] Train each model on full dataset with 5-fold stratified group CV
- [ ] Generate detailed per-source metrics
- [ ] Generate Grad-CAM visualizations (10-20 images each)
- [ ] Perform hard example analysis (false positives/negatives) with visualizations
- [ ] Generate confidence distribution histograms
- [ ] Cross-validation results (mean ± std with confidence intervals)
### 4.4 Analysis
- [ ] Use `classifier/notebooks/06_phase4_analysis.ipynb` for Phase 4 analysis
- [ ] Load data quantity scaling results
- [ ] Load full dataset evaluation results
- [ ] Generate comprehensive metrics comparison table
- [ ] Generate per-source metrics for final models
- [ ] Generate Grad-CAM galleries for final models
- [ ] Perform hard example analysis with visualizations
- [ ] Generate confidence distribution histograms
- [ ] Final model comparison and selection
- [ ] Conclusions and recommendations
---
## Notebooks and Analysis
This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.
### 5.1 Exploratory Data Analysis
- [x] Create `classifier/notebooks/01_eda.ipynb`
- [x] Dataset overview (real vs fake distribution, sources)
- [x] Image resolution/aspect ratio analysis (identify potential shortcuts)
- [x] Color distribution analysis (identify potential shortcuts)
- [x] Sample visualization from each source
- [x] Statistical summary of the dataset
- [x] Data quality checks
### 5.2 Preprocessing Pipeline
- [x] Create `classifier/notebooks/02_preprocessing.ipynb`
- [x] Square crop and resize implementation demonstration
- [x] Face crop (MTCNN) demonstration and effectiveness analysis
- [x] Augmentation pipeline visualization (before/after examples)
- [x] Z-score normalization comparison (ImageNet vs real-image-only)
- [x] Data split verification (group-aware by basename, no overlap)
- [x] Preprocessing impact visualization
### 5.3 Phase 1 Analysis
- [x] Create `classifier/notebooks/03_phase1_analysis.ipynb`
- [x] Load Phase 1 training results
- [x] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
- [x] Generate per-source metrics for each model
- [x] Generate train/val/test performance curves
- [x] Generate confusion matrices
- [x] Perform statistical significance testing between models
- [x] Generate Grad-CAM visualizations (10-20 images each)
- [x] Document conclusions: Which baseline is better and why
### 5.4 Phase 2 Analysis
- [x] Create `classifier/notebooks/04_phase2_analysis.ipynb`
- [ ] Load all Phase 2 experiment results
- [ ] For each experiment (2A-2E):
- [ ] Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
- [ ] Generate overall metrics
- [ ] Generate per-source metrics
- [ ] Calculate train/val gap
- [ ] Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
- [ ] Perform statistical significance testing
- [ ] Generate comparison tables across all Phase 2 experiments
- [ ] Generate comparison visualizations (bar charts, heatmaps)
- [ ] For each model/condition, generate Grad-CAM visualizations (10-20 images)
- [ ] Organize visualizations by experiment, model, prediction type, and source
- [ ] Answer key analysis questions
- [ ] Generate comprehensive metrics comparison table
- [ ] Provide evidence-based conclusions for each experiment
- [ ] Provide recommendations for Phase 3
### 5.5 Phase 3 Analysis
- [ ] Create `classifier/notebooks/05_phase3_analysis.ipynb`
- [ ] Load Phase 3 training results
- [ ] Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
- [ ] Generate per-source metrics for each model
- [ ] Compare with Phase 1 baselines (ResNet18, SimpleCNN)
- [ ] Perform statistical significance testing vs baselines
- [ ] Generate Grad-CAM visualizations for top models (10-20 images each)
- [ ] Parameter count vs performance analysis
- [ ] Conclusions: Which architectures work best and why
### 5.6 Phase 4 Analysis
- [ ] Create `classifier/notebooks/06_phase4_analysis.ipynb`
- [ ] Load data quantity scaling results
- [ ] Load full dataset evaluation results
- [ ] Generate comprehensive metrics comparison table
- [ ] Generate per-source metrics for final models
- [ ] Generate Grad-CAM galleries for final models
- [ ] Perform hard example analysis with visualizations
- [ ] Generate confidence distribution histograms
- [ ] Final model comparison and selection
- [ ] Conclusions and recommendations
### 5.7 Grad-CAM Deep Dive (Optional)
- [ ] Create `classifier/notebooks/07_gradcam_deep_dive.ipynb`
- [ ] Load Grad-CAM results from all phases
- [ ] Comprehensive Grad-CAM analysis across all phases and models
- [ ] Feature visualization for different model architectures
- [ ] CNN vs EfficientNet vs ConvNeXt comparison
- [ ] What regions do different architectures focus on?
- [ ] Are there systematic differences in attention patterns?
- [ ] Evidence of shortcut removal analysis across phases
- [ ] Temporal analysis: does model attention change with different preprocessing?
- [ ] Generate visual explanations suitable for presentation
---
## Code Implementation Tasks
### Cross-Validation Implementation
- [x] Update `classifier/src/training/trainer.py` to support 5-fold stratified group CV by basename
- [x] Update `classifier/src/evaluation/evaluate.py` to support grouped CV splits
- [x] Implement metric aggregation across folds (mean ± std)
- [x] Ensure all metrics report confidence intervals
- [x] Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
- [x] Rename `classifier/run_cv.py` to `classifier/run.py` (pipeline expects classifier/run.py)
### Model Implementations
- [ ] Implement ConvNeXt-Tiny in `classifier/src/models/convnext.py`
- [ ] Implement MobileNetV3-Small in `classifier/src/models/mobilenet.py`
- [ ] Register both models in `classifier/src/models/__init__.py`
### Normalization Implementation
- [ ] Implement function to calculate mean/std from real training images only
- [ ] Update `classifier/src/preprocessing/pipeline.py` to support custom normalization stats
- [ ] Test ImageNet normalization vs real-image-only normalization
### Evaluation Improvements
- [ ] Ensure test set uses `train=False` to disable augmentation
- [ ] Ensure diagnostic evaluation transforms never change the training data
- [ ] Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
- [ ] Implement per-source metrics with detection rate and false alarm rate
- [ ] Implement pairwise AUC calculations
- [ ] Implement train/val gap calculations
- [ ] Implement pairwise source AUC variance calculations
### Grad-CAM Improvements
- [ ] Ensure Grad-CAM works for all model types (CNN-based)
- [ ] Implement Grad-CAM for ConvNeXt
- [ ] Implement Grad-CAM for MobileNetV3
- [ ] Organize Grad-CAM outputs by experiment, model, prediction type, source
---
## Final Report Preparation
- [ ] Compile results from all phases
- [ ] Create presentation slides (PDF format)
- [ ] Brief description of deep learning solutions (discriminative + generative)
- [ ] Description of implementation steps and improvements
- [ ] Motivate choices for architecture, training strategy, etc.
- [ ] Show intermediate results
- [ ] Interpret results and what changed
- [ ] What was decided to improve results
- [ ] Classification performance results
- [ ] Experimental setup
- [ ] Train/val/test splits
- [ ] Performance metrics chosen
- [ ] Data generation performance results
- [ ] Experimental setup
- [ ] Performance metrics chosen
- [ ] Discussion and conclusions
- [ ] Comments on performance
- [ ] Final remarks
- [ ] Fill auto-evaluation file
---
## Summary
Total tasks: ~150+
This implementation plan covers:
- ✅ All 4 phases with comprehensive experiments
- ✅ 5-fold stratified group cross-validation for all experiments
- ✅ 7 analysis notebooks for robust validation
- ✅ Shortcut analysis (resolution/ratio + color distribution + source holdout)
- ✅ Source holdout experiments to detect source-specific feature learning
- ✅ Grad-CAM visualizations for explainability
- ✅ Statistical analysis with confidence intervals
- ✅ Per-source metrics for all experiments
- ✅ Data quantity scaling analysis
- ✅ Full dataset evaluation on best models
- ✅ Comprehensive documentation and reporting
**Key Features:**
- Reproducible experiments with fixed seeds
- Stratified group CV keeps basename groups together while balancing class distribution
- Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
- Source holdout experiments to test generalization to unseen sources
- Grad-CAM for explainability
- Statistical rigor with confidence intervals
- Per-source analysis to understand model behavior
- Clear progression from baselines -> preprocessing -> architectures -> final evaluation