Files
DRL_PROJ/docs/classifier_impl.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

28 KiB
Raw Blame History

Deepfake Detection Classifier - Implementation Plan

Overview

This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.


Phase 0: Pre-Implementation Setup

Infrastructure and Configuration

  • Create classifier/configs/shared.json with shared parameters:

    • seed: 42
    • val_ratio: 0.1
    • test_ratio: 0.1
    • batch_size: 32
    • optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
    • scheduler: {type: "cosine_annealing", T_max: 15}
    • early_stopping_patience: 5
    • num_workers: 4
    • cv_folds: 5
    • data_dir: "data"
    • face_crop_margin: 0.6
  • Implement config loading/merging so experiment configs inherit shared.json defaults and override only the variables under test

  • Resolve shared nested fields such as optimizer.lr, optimizer.weight_decay, and scheduler.T_max into the training arguments used by the runner

  • Update existing configs to reference shared.json or otherwise document which shared defaults they intentionally override

  • Define one CV protocol for all phases:

    • outer fold: held-out test fold
    • inner validation split: group-aware split from the remaining training folds for early stopping/model selection
    • final reported metrics: aggregate held-out test-fold results across the 5 outer folds

Data Preparation

  • Verify dataset structure and integrity
  • Check that real and fake images are properly organized by source
  • Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)

Cleanup

  • Remove classifier/tools/ensemble.py (not part of reorganization plan, conflicts with explainability goals)
  • Remove robustness evaluation from classifier/tools/analyze.py (lines 51-104, 82-104, 144) - not part of experimental plan
  • Remove any unused or obsolete config files from previous experiments (see detailed list below)
  • Clean up old output directories if needed (keep important results for reference)

Config Files to Remove (39 total)

Root configs (6):

  • classifier/configs/resnet18_quick.json
  • classifier/configs/resnet18.json
  • classifier/configs/simple_cnn_large.json
  • classifier/configs/simple_cnn_micro.json
  • classifier/configs/simple_cnn_small.json
  • classifier/configs/simple_cnn.json

Phase 1 old configs (7):

  • classifier/configs/phase1/p1_cnn_base.json (uses lr=1e-3, epochs=20 - should be 1e-4, 15)
  • classifier/configs/phase1/p1_cnn_aug.json
  • classifier/configs/phase1/p1_resnet18_base.json (duplicate of new baseline)
  • classifier/configs/phase1/p1_resnet18_aug.json
  • classifier/configs/phase1/holdout/ (entire directory - 6 configs, source holdout not in new plan)

Phase 2 old configs (7):

  • classifier/configs/phase2/p2_resnet18_224.json (should be p2a_resnet18_224.json)
  • classifier/configs/phase2/p2_resnet18_facecrop.json (should be p2b_resnet18_facecrop.json)
  • classifier/configs/phase2/p2_resnet18_frozen.json (frozen backbone not in new plan)
  • classifier/configs/phase2/p2_resnet34_224.json (ResNet34 should be in Phase 3)
  • classifier/configs/phase2/p2_resnet34.json (ResNet34 should be in Phase 3)
  • classifier/configs/phase2/p2_resnet50_frozen.json (ResNet50 should be in Phase 3)
  • classifier/configs/phase2/p2_resnet50.json (ResNet50 should be in Phase 3)

Phase 3 old configs (4):

  • classifier/configs/phase3/p3_efficientnet_b2.json (EfficientNet-B2 not in new plan, only B0)
  • classifier/configs/phase3/p3_resnet18_facecrop_full.json (ResNet18 full dataset should be Phase 4)
  • classifier/configs/phase3/p3_resnet18_freqaug.json (frequency augmentation not in new plan)
  • classifier/configs/phase3/p3_vit_b16.json (ViT not in new plan, replaced with ConvNeXt/MobileNet)
  • Note: p3_efficientnet_b0.json - REMOVED (will be recreated after Phase2 with correct settings)

Source holdout (6):

  • classifier/configs/source_holdout/ (entire directory - 6 configs, source holdout not in new plan)

Ablation (3):

  • classifier/configs/ablation/ (entire directory - 3 configs, ablation studies not in new plan)

Configs to KEEP (3):

  • classifier/configs/shared.json
  • classifier/configs/phase1/p1_simplecnn_baseline.json
  • classifier/configs/phase1/p1_resnet18_baseline.json

Phase 2 alias configs removed (8):

  • classifier/configs/phase2/p2b_resnet18_128.json (alias for p1_resnet18_baseline)
  • classifier/configs/phase2/p2b_simplecnn_128.json (alias for p1_simplecnn_baseline)
  • classifier/configs/phase2/p2c_resnet18_nofacecrop.json (alias for p2b_resnet18_224)
  • classifier/configs/phase2/p2c_simplecnn_nofacecrop.json (alias for p2b_simplecnn_224)
  • classifier/configs/phase2/p2d_resnet18_noaug.json (alias for p2b_resnet18_224)
  • classifier/configs/phase2/p2d_simplecnn_noaug.json (alias for p2b_simplecnn_224)
  • classifier/configs/phase2/p2e_resnet18_facecrop_only.json (alias for p2c_resnet18_facecrop)
  • classifier/configs/phase2/p2e_simplecnn_facecrop_only.json (alias for p2c_simplecnn_facecrop)

Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.


Phase 1: Architecture Baseline

1.1 Experiment Configs

  • Create classifier/configs/phase1/p1_simplecnn_baseline.json

    • backbone: simple_cnn
    • cnn_preset: medium
    • dropout: 0.0
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4 (consistent with ResNet)
    • weight_decay: 1e-4
    • image_size: 128
    • data_dir: data
    • early_stopping_patience: 5
    • subsample: 0.2
    • face_crop: false
    • augment: false
    • seed: 42
  • Create classifier/configs/phase1/p1_resnet18_baseline.json

    • backbone: resnet18
    • pretrained: true
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4
    • weight_decay: 1e-4
    • image_size: 128
    • data_dir: data
    • early_stopping_patience: 5
    • subsample: 0.2
    • face_crop: false
    • augment: false
    • seed: 42

1.2 Code Updates

  • Implement 5-fold stratified group cross-validation by basename in training pipeline
  • Update classifier/src/training/trainer.py to support CV
  • Update classifier/src/evaluation/evaluate.py to support CV
  • Ensure all metrics report mean ± std and confidence intervals across folds

1.3 Training

  • Train SimpleCNN with 5-fold stratified group CV (via pipeline: python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json)
  • Train ResNet18 with 5-fold stratified group CV (via pipeline: python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json)
  • Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)

1.4 Analysis

  • Use classifier/notebooks/03_phase1_analysis.ipynb for Phase 1 analysis
  • Compare SimpleCNN vs ResNet18 performance
  • Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
  • Per-source metrics (text2img, inpainting, insight)
  • Train/val/test performance curves
  • Confusion matrices
  • Statistical significance testing
  • Generate Grad-CAM visualizations (10-20 images per model)
  • Document conclusions: Which baseline is better and why

Phase 2: Preprocessing Impact

2.1 Shortcut Analysis (2A)

  • Create classifier/configs/phase2/p2a_t1_original.json

    • backbone: resnet18
    • image_size: 224
    • subsample: 0.2
    • seed: 42
    • augment: false
    • normalization: imagenet
    • data_dir: data
  • Create classifier/configs/phase2/p2a_t2_real_norm.json

    • extends: p2a_t1_original.json
    • normalization: real_norm
    • Normalization: Calculate mean/std from real training images only within each fold
  • Geometry diagnostic was explored and then removed from the codebase (src/evaluation/geometry.py no longer exists):

    • Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
    • Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
  • Train the 2 shortcut configs with 5-fold stratified group CV

  • Compare results:

    • Standard vs matched-geometry eval for p2a_t1_original (letterboxing impact)
    • p2a_t1_original vs p2a_t2_real_norm (color distribution shortcut)
  • Create classifier/configs/phase2/p2a_t3_holdout_text2img.json

    • extends: p2a_t1_original.json
    • train_sources: ["wiki", "inpainting", "insight"]
    • eval_sources: ["wiki", "inpainting", "insight", "text2img"]
  • Create classifier/configs/phase2/p2a_t3_holdout_inpainting.json

    • extends: p2a_t1_original.json
    • train_sources: ["wiki", "text2img", "insight"]
    • eval_sources: ["wiki", "text2img", "insight", "inpainting"]
  • Create classifier/configs/phase2/p2a_t3_holdout_insight.json

    • extends: p2a_t1_original.json
    • train_sources: ["wiki", "text2img", "inpainting"]
    • eval_sources: ["wiki", "text2img", "inpainting", "insight"]
  • Train the 3 source holdout configs with 5-fold stratified group CV

  • Compare held-out source performance vs in-source performance:

    • Calculate AUC for held-out source (text2img, inpainting, insight)
    • Compute Δ (in-source AUC - held-out AUC)
    • If Δ > 0.05-0.10, model is learning source-specific features

2.2 Resolution Impact (2B)

  • Create classifier/configs/phase2/p2b_simplecnn_224.json

    • backbone: simple_cnn
    • image_size: 224
    • subsample: 0.2
    • augment: false
    • seed: 42
    • data_dir: data
  • Create classifier/configs/phase2/p2b_resnet18_224.json

    • backbone: resnet18
    • image_size: 224
    • subsample: 0.2
    • augment: false
    • seed: 42
    • data_dir: data
  • Train both 224 configs with 5-fold stratified group CV

  • Compare 128×128 vs 224×224 for each model

    • 128 baseline is p1_*_baseline (comparison mapping in notebook)

2.3 Facecrop Impact (2C)

  • Create classifier/configs/phase2/p2c_simplecnn_facecrop.json

    • backbone: simple_cnn
    • image_size: 224
    • subsample: 0.2
    • augment: false
    • seed: 42
    • data_dir: cropped/classifier
  • Create classifier/configs/phase2/p2c_resnet18_facecrop.json

    • backbone: resnet18
    • image_size: 224
    • subsample: 0.2
    • augment: false
    • seed: 42
    • data_dir: cropped/classifier
  • Train both facecrop configs with 5-fold stratified group CV

  • Compare p2b_resnet18_224 (no facecrop) vs p2c_resnet18_facecrop for each model

    • No-facecrop baseline is p2b_*_224 (comparison mapping in notebook)

2.4 Augmentation Impact (2D)

  • Create classifier/configs/phase2/p2d_simplecnn_aug.json

    • backbone: simple_cnn
    • image_size: 224
    • subsample: 0.2
    • seed: 42
    • augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
    • data_dir: data
  • Create classifier/configs/phase2/p2d_resnet18_aug.json

    • backbone: resnet18
    • image_size: 224
    • subsample: 0.2
    • seed: 42
    • augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
    • data_dir: data
  • Train both augmentation configs with 5-fold stratified group CV

  • Compare p2b_resnet18_224 (no aug) vs p2d_resnet18_aug for each model

    • No-aug baseline is p2b_*_224 (comparison mapping in notebook)

2.5 Augmentation + Facecrop (2E)

  • Create classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json

    • backbone: simple_cnn
    • image_size: 224
    • subsample: 0.2
    • seed: 42
    • augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
    • data_dir: cropped/classifier
  • Create classifier/configs/phase2/p2e_resnet18_facecrop_aug.json

    • backbone: resnet18
    • image_size: 224
    • subsample: 0.2
    • seed: 42
    • augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
    • data_dir: cropped/classifier
  • Train both facecrop+aug configs with 5-fold stratified group CV

  • Compare p2c_resnet18_facecrop (facecrop only) vs p2e_resnet18_facecrop_aug for each model

    • Facecrop-only baseline is p2c_*_facecrop (comparison mapping in notebook)

2.6 Phase 2 Analysis

  • Use classifier/notebooks/04_phase2_analysis.ipynb for Phase 2 analysis
  • For each experiment (2A-2E):
    • Load 5-fold stratified group CV results (mean ± std and confidence intervals)
    • Generate overall metrics (AUC, Accuracy, F1)
    • Generate per-source metrics (text2img, inpainting, insight)
    • Calculate train/val gap
    • Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
    • Statistical significance testing vs baseline
    • Generate comparison visualizations (bar charts, heatmaps)
  • For 2C (Shortcut Analysis):
    • Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
    • Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
    • Load source holdout results (3 configs)
    • Calculate held-out source AUC vs in-source AUC for each holdout experiment
    • Compute Δ (in-source AUC - held-out AUC)
    • If Δ > 0.05-0.10, model is learning source-specific features
    • Generate source holdout comparison table
  • For each model/condition:
    • Generate Grad-CAM visualizations (10-20 images per condition)
    • Organize by experiment, prediction type, and source
  • Answer key questions:
    • Which preprocessing choices are statistically significant?
    • Do certain sources benefit more from specific preprocessing?
    • Is there an interaction between facecrop and augmentation?
    • Are shortcuts being learned (resolution, color distribution)?
    • Is the model learning source-specific features (source holdout)?
    • Does augmentation remove shortcuts or over-regularize?
    • What features do models focus on (based on Grad-CAM)?
  • Generate comprehensive metrics comparison table
  • Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
  • Provide evidence-based conclusions for each experiment
  • Provide recommendations for Phase 3 (best preprocessing settings)

Phase 3: Extended Architecture Exploration

3.1 Experiment Configs

Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.

  • Create classifier/configs/phase3/p3_resnet34.json

    • backbone: resnet34
    • pretrained: true
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4
    • weight_decay: 1e-4
    • image_size: 224
    • face_crop: true (or best from Phase 2B/E)
    • face_crop_margin: 0.6
    • augment: false (or best from Phase 2D/E)
    • subsample: 0.2
    • seed: 42
    • early_stopping_patience: 5
  • Create classifier/configs/phase3/p3_resnet50.json

    • backbone: resnet50
    • pretrained: true
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4
    • weight_decay: 1e-4
    • image_size: 224
    • face_crop: true (or best from Phase 2B/E)
    • face_crop_margin: 0.6
    • augment: false (or best from Phase 2D/E)
    • subsample: 0.2
    • seed: 42
    • early_stopping_patience: 5
  • Create classifier/configs/phase3/p3_efficientnet_b0.json

    • backbone: efficientnet_b0
    • pretrained: true
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4
    • weight_decay: 1e-4
    • image_size: 224
    • face_crop: true (or best from Phase 2B/E)
    • augment: false (or best from Phase 2D/E)
    • subsample: 0.2
    • seed: 42
    • early_stopping_patience: 5
  • Create classifier/configs/phase3/p3_convnext_tiny.json

    • backbone: convnext_tiny
    • pretrained: true
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4
    • weight_decay: 1e-4
    • image_size: 224
    • face_crop: true (or best from Phase 2B/E)
    • augment: false (or best from Phase 2D/E)
    • subsample: 0.2
    • seed: 42
    • early_stopping_patience: 5
  • Create classifier/configs/phase3/p3_mobilenetv3_small.json

    • backbone: mobilenetv3_small
    • pretrained: true
    • epochs: 15
    • batch_size: 32
    • lr: 1e-4
    • weight_decay: 1e-4
    • image_size: 224
    • face_crop: true (or best from Phase 2B/E)
    • augment: false (or best from Phase 2D/E)
    • subsample: 0.2
    • seed: 42
    • early_stopping_patience: 5

3.2 Model Implementation

  • Implement ConvNeXt-Tiny in classifier/src/models/convnext.py
  • Implement MobileNetV3-Small in classifier/src/models/mobilenet.py
  • Register both models in classifier/src/models/__init__.py

3.3 Training

  • Train ResNet34 with 5-fold stratified group CV
  • Train ResNet50 with 5-fold stratified group CV
  • Train EfficientNet-B0 with 5-fold stratified group CV
  • Train ConvNeXt-Tiny with 5-fold stratified group CV
  • Train MobileNetV3-Small with 5-fold stratified group CV
  • Save all checkpoints and metrics

3.4 Analysis

  • Use classifier/notebooks/05_phase3_analysis.ipynb for Phase 3 analysis
  • Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
  • Generate overall metrics for each model
  • Generate per-source metrics for each model
  • Compare with Phase 1 baselines (ResNet18, SimpleCNN)
  • Statistical significance testing vs baselines
  • Generate Grad-CAM visualizations for top models (10-20 images each)
  • Parameter count vs performance analysis
  • Conclusions: Which architectures work best and why

Phase 4: Final Analysis on Best Models

4.1 Select Top Models

  • Based on Phases 1-3 results, select top 3-4 models
  • Document selection criteria (e.g., top AUC, balanced performance, efficiency)

4.2 Data Quantity Scaling (4A)

  • For each selected model, create configs for different data sizes:
    • classifier/configs/phase4/p4a_<model>_20pct.json (subsample: 0.2)
    • classifier/configs/phase4/p4a_<model>_50pct.json (subsample: 0.5)
    • classifier/configs/phase4/p4a_<model>_100pct.json (subsample: 1.0)
  • In every 4A config, explicitly set the best Phase 2 preprocessing choices:
    • image_size: best from Phase 2A
    • face_crop: best from Phase 2B/E
    • augment: best from Phase 2D/E
  • Train each model with 5-fold stratified group CV at all three data sizes
  • Compare how each model scales with more data

4.3 Full Dataset Evaluation (4B)

  • For each selected model, create config for full dataset:
    • classifier/configs/phase4/p4b_<model>_full.json (subsample: 1.0)
  • In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
  • Train each model on full dataset with 5-fold stratified group CV
  • Generate detailed per-source metrics
  • Generate Grad-CAM visualizations (10-20 images each)
  • Perform hard example analysis (false positives/negatives) with visualizations
  • Generate confidence distribution histograms
  • Cross-validation results (mean ± std with confidence intervals)

4.4 Analysis

  • Use classifier/notebooks/06_phase4_analysis.ipynb for Phase 4 analysis
  • Load data quantity scaling results
  • Load full dataset evaluation results
  • Generate comprehensive metrics comparison table
  • Generate per-source metrics for final models
  • Generate Grad-CAM galleries for final models
  • Perform hard example analysis with visualizations
  • Generate confidence distribution histograms
  • Final model comparison and selection
  • Conclusions and recommendations

Notebooks and Analysis

This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.

5.1 Exploratory Data Analysis

  • Create classifier/notebooks/01_eda.ipynb
  • Dataset overview (real vs fake distribution, sources)
  • Image resolution/aspect ratio analysis (identify potential shortcuts)
  • Color distribution analysis (identify potential shortcuts)
  • Sample visualization from each source
  • Statistical summary of the dataset
  • Data quality checks

5.2 Preprocessing Pipeline

  • Create classifier/notebooks/02_preprocessing.ipynb
  • Square crop and resize implementation demonstration
  • Face crop (MTCNN) demonstration and effectiveness analysis
  • Augmentation pipeline visualization (before/after examples)
  • Z-score normalization comparison (ImageNet vs real-image-only)
  • Data split verification (group-aware by basename, no overlap)
  • Preprocessing impact visualization

5.3 Phase 1 Analysis

  • Create classifier/notebooks/03_phase1_analysis.ipynb
  • Load Phase 1 training results
  • Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
  • Generate per-source metrics for each model
  • Generate train/val/test performance curves
  • Generate confusion matrices
  • Perform statistical significance testing between models
  • Generate Grad-CAM visualizations (10-20 images each)
  • Document conclusions: Which baseline is better and why

5.4 Phase 2 Analysis

  • Create classifier/notebooks/04_phase2_analysis.ipynb
  • Load all Phase 2 experiment results
  • For each experiment (2A-2E):
    • Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
    • Generate overall metrics
    • Generate per-source metrics
    • Calculate train/val gap
    • Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
    • Perform statistical significance testing
  • Generate comparison tables across all Phase 2 experiments
  • Generate comparison visualizations (bar charts, heatmaps)
  • For each model/condition, generate Grad-CAM visualizations (10-20 images)
  • Organize visualizations by experiment, model, prediction type, and source
  • Answer key analysis questions
  • Generate comprehensive metrics comparison table
  • Provide evidence-based conclusions for each experiment
  • Provide recommendations for Phase 3

5.5 Phase 3 Analysis

  • Create classifier/notebooks/05_phase3_analysis.ipynb
  • Load Phase 3 training results
  • Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
  • Generate per-source metrics for each model
  • Compare with Phase 1 baselines (ResNet18, SimpleCNN)
  • Perform statistical significance testing vs baselines
  • Generate Grad-CAM visualizations for top models (10-20 images each)
  • Parameter count vs performance analysis
  • Conclusions: Which architectures work best and why

5.6 Phase 4 Analysis

  • Create classifier/notebooks/06_phase4_analysis.ipynb
  • Load data quantity scaling results
  • Load full dataset evaluation results
  • Generate comprehensive metrics comparison table
  • Generate per-source metrics for final models
  • Generate Grad-CAM galleries for final models
  • Perform hard example analysis with visualizations
  • Generate confidence distribution histograms
  • Final model comparison and selection
  • Conclusions and recommendations

5.7 Grad-CAM Deep Dive (Optional)

  • Create classifier/notebooks/07_gradcam_deep_dive.ipynb
  • Load Grad-CAM results from all phases
  • Comprehensive Grad-CAM analysis across all phases and models
  • Feature visualization for different model architectures
  • CNN vs EfficientNet vs ConvNeXt comparison
  • What regions do different architectures focus on?
  • Are there systematic differences in attention patterns?
  • Evidence of shortcut removal analysis across phases
  • Temporal analysis: does model attention change with different preprocessing?
  • Generate visual explanations suitable for presentation

Code Implementation Tasks

Cross-Validation Implementation

  • Update classifier/src/training/trainer.py to support 5-fold stratified group CV by basename
  • Update classifier/src/evaluation/evaluate.py to support grouped CV splits
  • Implement metric aggregation across folds (mean ± std)
  • Ensure all metrics report confidence intervals
  • Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
  • Rename classifier/run_cv.py to classifier/run.py (pipeline expects classifier/run.py)
  • Rename classifier/run_cv.py to classifier/run.py (pipeline expects classifier/run.py)

Model Implementations

  • Implement ConvNeXt-Tiny in classifier/src/models/convnext.py
  • Implement MobileNetV3-Small in classifier/src/models/mobilenet.py
  • Register both models in classifier/src/models/__init__.py

Normalization Implementation

  • Implement function to calculate mean/std from real training images only
  • Update classifier/src/preprocessing/pipeline.py to support custom normalization stats
  • Test ImageNet normalization vs real-image-only normalization

Evaluation Improvements

  • Ensure test set uses train=False to disable augmentation
  • Ensure diagnostic evaluation transforms never change the training data
  • Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
  • Implement per-source metrics with detection rate and false alarm rate
  • Implement pairwise AUC calculations
  • Implement train/val gap calculations
  • Implement pairwise source AUC variance calculations

Grad-CAM Improvements

  • Ensure Grad-CAM works for all model types (CNN-based)
  • Implement Grad-CAM for ConvNeXt
  • Implement Grad-CAM for MobileNetV3
  • Organize Grad-CAM outputs by experiment, model, prediction type, source

Final Report Preparation

  • Compile results from all phases
  • Create presentation slides (PDF format)
  • Brief description of deep learning solutions (discriminative + generative)
  • Description of implementation steps and improvements
    • Motivate choices for architecture, training strategy, etc.
    • Show intermediate results
    • Interpret results and what changed
    • What was decided to improve results
  • Classification performance results
    • Experimental setup
    • Train/val/test splits
    • Performance metrics chosen
  • Data generation performance results
    • Experimental setup
    • Performance metrics chosen
  • Discussion and conclusions
    • Comments on performance
    • Final remarks
  • Fill auto-evaluation file

Summary

Total tasks: ~150+

This implementation plan covers:

  • All 4 phases with comprehensive experiments
  • 5-fold stratified group cross-validation for all experiments
  • 7 analysis notebooks for robust validation
  • Shortcut analysis (resolution/ratio + color distribution + source holdout)
  • Source holdout experiments to detect source-specific feature learning
  • Grad-CAM visualizations for explainability
  • Statistical analysis with confidence intervals
  • Per-source metrics for all experiments
  • Data quantity scaling analysis
  • Full dataset evaluation on best models
  • Comprehensive documentation and reporting

Key Features:

  • Reproducible experiments with fixed seeds
  • Stratified group CV keeps basename groups together while balancing class distribution
  • Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
  • Source holdout experiments to test generalization to unseen sources
  • Grad-CAM for explainability
  • Statistical rigor with confidence intervals
  • Per-source analysis to understand model behavior
  • Clear progression from baselines -> preprocessing -> architectures -> final evaluation