jalf/DRL_PROJ

Fork 0

Files

T

Johnny Fernandes bb3dfb92d5 Clean state

2026-04-30 01:25:39 +01:00

28 KiB

Raw Blame History

Deepfake Detection Classifier - Implementation Plan

Overview

This document provides a comprehensive implementation plan for refactoring the deepfake detection classifier project. Each task includes a checkbox to track completion.

Phase 0: Pre-Implementation Setup

Infrastructure and Configuration

Create classifier/configs/shared.json with shared parameters:
- seed: 42
- val_ratio: 0.1
- test_ratio: 0.1
- batch_size: 32
- optimizer: {type: "adamw", lr: 1e-4, weight_decay: 1e-4}
- scheduler: {type: "cosine_annealing", T_max: 15}
- early_stopping_patience: 5
- num_workers: 4
- cv_folds: 5
- data_dir: "data"
- face_crop_margin: 0.6
Implement config loading/merging so experiment configs inherit shared.json defaults and override only the variables under test
Resolve shared nested fields such as optimizer.lr, optimizer.weight_decay, and scheduler.T_max into the training arguments used by the runner
Update existing configs to reference shared.json or otherwise document which shared defaults they intentionally override
Define one CV protocol for all phases:
- outer fold: held-out test fold
- inner validation split: group-aware split from the remaining training folds for early stopping/model selection
- final reported metrics: aggregate held-out test-fold results across the 5 outer folds

Data Preparation

Verify dataset structure and integrity
Check that real and fake images are properly organized by source
Verify no data leakage between train/val/test splits or CV folds (group-aware by basename)

Cleanup

Remove classifier/tools/ensemble.py (not part of reorganization plan, conflicts with explainability goals)
Remove robustness evaluation from classifier/tools/analyze.py (lines 51-104, 82-104, 144) - not part of experimental plan
Remove any unused or obsolete config files from previous experiments (see detailed list below)
Clean up old output directories if needed (keep important results for reference)

Config Files to Remove (39 total)

Root configs (6):

classifier/configs/resnet18_quick.json
classifier/configs/resnet18.json
classifier/configs/simple_cnn_large.json
classifier/configs/simple_cnn_micro.json
classifier/configs/simple_cnn_small.json
classifier/configs/simple_cnn.json

Phase 1 old configs (7):

classifier/configs/phase1/p1_cnn_base.json (uses lr=1e-3, epochs=20 - should be 1e-4, 15)
classifier/configs/phase1/p1_cnn_aug.json
classifier/configs/phase1/p1_resnet18_base.json (duplicate of new baseline)
classifier/configs/phase1/p1_resnet18_aug.json
classifier/configs/phase1/holdout/ (entire directory - 6 configs, source holdout not in new plan)

Phase 2 old configs (7):

classifier/configs/phase2/p2_resnet18_224.json (should be p2a_resnet18_224.json)
classifier/configs/phase2/p2_resnet18_facecrop.json (should be p2b_resnet18_facecrop.json)
classifier/configs/phase2/p2_resnet18_frozen.json (frozen backbone not in new plan)
classifier/configs/phase2/p2_resnet34_224.json (ResNet34 should be in Phase 3)
classifier/configs/phase2/p2_resnet34.json (ResNet34 should be in Phase 3)
classifier/configs/phase2/p2_resnet50_frozen.json (ResNet50 should be in Phase 3)
classifier/configs/phase2/p2_resnet50.json (ResNet50 should be in Phase 3)

Phase 3 old configs (4):

classifier/configs/phase3/p3_efficientnet_b2.json (EfficientNet-B2 not in new plan, only B0)
classifier/configs/phase3/p3_resnet18_facecrop_full.json (ResNet18 full dataset should be Phase 4)
classifier/configs/phase3/p3_resnet18_freqaug.json (frequency augmentation not in new plan)
classifier/configs/phase3/p3_vit_b16.json (ViT not in new plan, replaced with ConvNeXt/MobileNet)
Note: p3_efficientnet_b0.json - REMOVED (will be recreated after Phase2 with correct settings)

Source holdout (6):

classifier/configs/source_holdout/ (entire directory - 6 configs, source holdout not in new plan)

Ablation (3):

classifier/configs/ablation/ (entire directory - 3 configs, ablation studies not in new plan)

Configs to KEEP (3):

✅ classifier/configs/shared.json
✅ classifier/configs/phase1/p1_simplecnn_baseline.json
✅ classifier/configs/phase1/p1_resnet18_baseline.json

Phase 2 alias configs removed (8):

classifier/configs/phase2/p2b_resnet18_128.json (alias for p1_resnet18_baseline)
classifier/configs/phase2/p2b_simplecnn_128.json (alias for p1_simplecnn_baseline)
classifier/configs/phase2/p2c_resnet18_nofacecrop.json (alias for p2b_resnet18_224)
classifier/configs/phase2/p2c_simplecnn_nofacecrop.json (alias for p2b_simplecnn_224)
classifier/configs/phase2/p2d_resnet18_noaug.json (alias for p2b_resnet18_224)
classifier/configs/phase2/p2d_simplecnn_noaug.json (alias for p2b_simplecnn_224)
classifier/configs/phase2/p2e_resnet18_facecrop_only.json (alias for p2c_resnet18_facecrop)
classifier/configs/phase2/p2e_simplecnn_facecrop_only.json (alias for p2c_simplecnn_facecrop)

Note: Comparison pairs (baseline vs treatment) are defined in the analysis notebook as a mapping dict, not as separate config files.

Phase 1: Architecture Baseline

1.1 Experiment Configs

Create classifier/configs/phase1/p1_simplecnn_baseline.json
- backbone: simple_cnn
- cnn_preset: medium
- dropout: 0.0
- epochs: 15
- batch_size: 32
- lr: 1e-4 (consistent with ResNet)
- weight_decay: 1e-4
- image_size: 128
- data_dir: data
- early_stopping_patience: 5
- subsample: 0.2
- face_crop: false
- augment: false
- seed: 42
Create classifier/configs/phase1/p1_resnet18_baseline.json
- backbone: resnet18
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 128
- data_dir: data
- early_stopping_patience: 5
- subsample: 0.2
- face_crop: false
- augment: false
- seed: 42

1.2 Code Updates

Implement 5-fold stratified group cross-validation by basename in training pipeline
Update classifier/src/training/trainer.py to support CV
Update classifier/src/evaluation/evaluate.py to support CV
Ensure all metrics report mean ± std and confidence intervals across folds

1.3 Training

Train SimpleCNN with 5-fold stratified group CV (via pipeline: python -m pipeline run classifier/configs/phase1/p1_simplecnn_baseline.json)
Train ResNet18 with 5-fold stratified group CV (via pipeline: python -m pipeline run classifier/configs/phase1/p1_resnet18_baseline.json)
Save all checkpoints and metrics (pipeline automatically fetches outputs to classifier/outputs/)

1.4 Analysis

Use classifier/notebooks/03_phase1_analysis.ipynb for Phase 1 analysis
Compare SimpleCNN vs ResNet18 performance
Overall metrics (AUC, Accuracy, F1) with mean ± std and confidence intervals
Per-source metrics (text2img, inpainting, insight)
Train/val/test performance curves
Confusion matrices
Statistical significance testing
Generate Grad-CAM visualizations (10-20 images per model)
Document conclusions: Which baseline is better and why

Phase 2: Preprocessing Impact

2.1 Shortcut Analysis (2A)

Create classifier/configs/phase2/p2a_t1_original.json
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: false
- normalization: imagenet
- data_dir: data
Create classifier/configs/phase2/p2a_t2_real_norm.json
- extends: p2a_t1_original.json
- normalization: real_norm
- Normalization: Calculate mean/std from real training images only within each fold
Geometry diagnostic was explored and then removed from the codebase (src/evaluation/geometry.py no longer exists):
- Current pipeline always square-crops before resize, reducing rectangle-vs-square shortcut risk.
- Shortcut analysis now relies on normalization and held-out-source evidence artifacts.
Train the 2 shortcut configs with 5-fold stratified group CV
Compare results:
- Standard vs matched-geometry eval for p2a_t1_original (letterboxing impact)
- p2a_t1_original vs p2a_t2_real_norm (color distribution shortcut)
Create classifier/configs/phase2/p2a_t3_holdout_text2img.json
- extends: p2a_t1_original.json
- train_sources: ["wiki", "inpainting", "insight"]
- eval_sources: ["wiki", "inpainting", "insight", "text2img"]
Create classifier/configs/phase2/p2a_t3_holdout_inpainting.json
- extends: p2a_t1_original.json
- train_sources: ["wiki", "text2img", "insight"]
- eval_sources: ["wiki", "text2img", "insight", "inpainting"]
Create classifier/configs/phase2/p2a_t3_holdout_insight.json
- extends: p2a_t1_original.json
- train_sources: ["wiki", "text2img", "inpainting"]
- eval_sources: ["wiki", "text2img", "inpainting", "insight"]
Train the 3 source holdout configs with 5-fold stratified group CV
Compare held-out source performance vs in-source performance:
- Calculate AUC for held-out source (text2img, inpainting, insight)
- Compute Δ (in-source AUC - held-out AUC)
- If Δ > 0.05-0.10, model is learning source-specific features

2.2 Resolution Impact (2B)

Create classifier/configs/phase2/p2b_simplecnn_224.json
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: data
Create classifier/configs/phase2/p2b_resnet18_224.json
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: data
Train both 224 configs with 5-fold stratified group CV
Compare 128×128 vs 224×224 for each model
- 128 baseline is p1_*_baseline (comparison mapping in notebook)

2.3 Facecrop Impact (2C)

Create classifier/configs/phase2/p2c_simplecnn_facecrop.json
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: cropped/classifier
Create classifier/configs/phase2/p2c_resnet18_facecrop.json
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- augment: false
- seed: 42
- data_dir: cropped/classifier
Train both facecrop configs with 5-fold stratified group CV
Compare p2b_resnet18_224 (no facecrop) vs p2c_resnet18_facecrop for each model
- No-facecrop baseline is p2b_*_224 (comparison mapping in notebook)

2.4 Augmentation Impact (2D)

Create classifier/configs/phase2/p2d_simplecnn_aug.json
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: data
Create classifier/configs/phase2/p2d_resnet18_aug.json
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: data
Train both augmentation configs with 5-fold stratified group CV
Compare p2b_resnet18_224 (no aug) vs p2d_resnet18_aug for each model
- No-aug baseline is p2b_*_224 (comparison mapping in notebook)

2.5 Augmentation + Facecrop (2E)

Create classifier/configs/phase2/p2e_simplecnn_facecrop_aug.json
- backbone: simple_cnn
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: cropped/classifier
Create classifier/configs/phase2/p2e_resnet18_facecrop_aug.json
- backbone: resnet18
- image_size: 224
- subsample: 0.2
- seed: 42
- augment: {hflip_p: 0.5, rotation_degrees: 10, brightness: 0.2, contrast: 0.2, saturation: 0.1, hue: 0.02, grayscale_p: 0.1, blur_p: 0.1, erase_p: 0.2, noise_p: 0.3, noise_std: 0.04}
- data_dir: cropped/classifier
Train both facecrop+aug configs with 5-fold stratified group CV
Compare p2c_resnet18_facecrop (facecrop only) vs p2e_resnet18_facecrop_aug for each model
- Facecrop-only baseline is p2c_*_facecrop (comparison mapping in notebook)

2.6 Phase 2 Analysis

Use classifier/notebooks/04_phase2_analysis.ipynb for Phase 2 analysis
For each experiment (2A-2E):
- Load 5-fold stratified group CV results (mean ± std and confidence intervals)
- Generate overall metrics (AUC, Accuracy, F1)
- Generate per-source metrics (text2img, inpainting, insight)
- Calculate train/val gap
- Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
- Statistical significance testing vs baseline
- Generate comparison visualizations (bar charts, heatmaps)
For 2C (Shortcut Analysis):
- Compare original-test vs alternative geometry evidence if reintroduced in a dedicated tool/notebook
- Compare ImageNet vs real-image-only normalization (color distribution shortcuts)
- Load source holdout results (3 configs)
- Calculate held-out source AUC vs in-source AUC for each holdout experiment
- Compute Δ (in-source AUC - held-out AUC)
- If Δ > 0.05-0.10, model is learning source-specific features
- Generate source holdout comparison table
For each model/condition:
- Generate Grad-CAM visualizations (10-20 images per condition)
- Organize by experiment, prediction type, and source
Answer key questions:
- Which preprocessing choices are statistically significant?
- Do certain sources benefit more from specific preprocessing?
- Is there an interaction between facecrop and augmentation?
- Are shortcuts being learned (resolution, color distribution)?
- Is the model learning source-specific features (source holdout)?
- Does augmentation remove shortcuts or over-regularize?
- What features do models focus on (based on Grad-CAM)?
Generate comprehensive metrics comparison table
Use paired fold-wise statistical tests for model comparisons, with bootstrap confidence intervals for key metrics where useful
Provide evidence-based conclusions for each experiment
Provide recommendations for Phase 3 (best preprocessing settings)

Phase 3: Extended Architecture Exploration

3.1 Experiment Configs

Use the best preprocessing choices from Phase 2. The placeholders below assume 224×224, face crop enabled, and no augmentation unless Phase 2 results justify different settings.

Create classifier/configs/phase3/p3_resnet34.json
- backbone: resnet34
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- face_crop_margin: 0.6
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
Create classifier/configs/phase3/p3_resnet50.json
- backbone: resnet50
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- face_crop_margin: 0.6
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
Create classifier/configs/phase3/p3_efficientnet_b0.json
- backbone: efficientnet_b0
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
Create classifier/configs/phase3/p3_convnext_tiny.json
- backbone: convnext_tiny
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5
Create classifier/configs/phase3/p3_mobilenetv3_small.json
- backbone: mobilenetv3_small
- pretrained: true
- epochs: 15
- batch_size: 32
- lr: 1e-4
- weight_decay: 1e-4
- image_size: 224
- face_crop: true (or best from Phase 2B/E)
- augment: false (or best from Phase 2D/E)
- subsample: 0.2
- seed: 42
- early_stopping_patience: 5

3.2 Model Implementation

Implement ConvNeXt-Tiny in classifier/src/models/convnext.py
Implement MobileNetV3-Small in classifier/src/models/mobilenet.py
Register both models in classifier/src/models/__init__.py

3.3 Training

Train ResNet34 with 5-fold stratified group CV
Train ResNet50 with 5-fold stratified group CV
Train EfficientNet-B0 with 5-fold stratified group CV
Train ConvNeXt-Tiny with 5-fold stratified group CV
Train MobileNetV3-Small with 5-fold stratified group CV
Save all checkpoints and metrics

3.4 Analysis

Use classifier/notebooks/05_phase3_analysis.ipynb for Phase 3 analysis
Load 5-fold stratified group CV results for all models (mean ± std and confidence intervals)
Generate overall metrics for each model
Generate per-source metrics for each model
Compare with Phase 1 baselines (ResNet18, SimpleCNN)
Statistical significance testing vs baselines
Generate Grad-CAM visualizations for top models (10-20 images each)
Parameter count vs performance analysis
Conclusions: Which architectures work best and why

Phase 4: Final Analysis on Best Models

4.1 Select Top Models

Based on Phases 1-3 results, select top 3-4 models
Document selection criteria (e.g., top AUC, balanced performance, efficiency)

4.2 Data Quantity Scaling (4A)

For each selected model, create configs for different data sizes:
- classifier/configs/phase4/p4a_<model>_20pct.json (subsample: 0.2)
- classifier/configs/phase4/p4a_<model>_50pct.json (subsample: 0.5)
- classifier/configs/phase4/p4a_<model>_100pct.json (subsample: 1.0)
In every 4A config, explicitly set the best Phase 2 preprocessing choices:
- image_size: best from Phase 2A
- face_crop: best from Phase 2B/E
- augment: best from Phase 2D/E
Train each model with 5-fold stratified group CV at all three data sizes
Compare how each model scales with more data

4.3 Full Dataset Evaluation (4B)

For each selected model, create config for full dataset:
- classifier/configs/phase4/p4b_<model>_full.json (subsample: 1.0)
In every 4B config, explicitly set the same best Phase 2 preprocessing choices used in 4A
Train each model on full dataset with 5-fold stratified group CV
Generate detailed per-source metrics
Generate Grad-CAM visualizations (10-20 images each)
Perform hard example analysis (false positives/negatives) with visualizations
Generate confidence distribution histograms
Cross-validation results (mean ± std with confidence intervals)

4.4 Analysis

Use classifier/notebooks/06_phase4_analysis.ipynb for Phase 4 analysis
Load data quantity scaling results
Load full dataset evaluation results
Generate comprehensive metrics comparison table
Generate per-source metrics for final models
Generate Grad-CAM galleries for final models
Perform hard example analysis with visualizations
Generate confidence distribution histograms
Final model comparison and selection
Conclusions and recommendations

Notebooks and Analysis

This section is the consolidated notebook checklist for the notebooks referenced in the phase sections above; do not create duplicate notebooks for the same phase.

5.1 Exploratory Data Analysis

Create classifier/notebooks/01_eda.ipynb
Dataset overview (real vs fake distribution, sources)
Image resolution/aspect ratio analysis (identify potential shortcuts)
Color distribution analysis (identify potential shortcuts)
Sample visualization from each source
Statistical summary of the dataset
Data quality checks

5.2 Preprocessing Pipeline

Create classifier/notebooks/02_preprocessing.ipynb
Square crop and resize implementation demonstration
Face crop (MTCNN) demonstration and effectiveness analysis
Augmentation pipeline visualization (before/after examples)
Z-score normalization comparison (ImageNet vs real-image-only)
Data split verification (group-aware by basename, no overlap)
Preprocessing impact visualization

5.3 Phase 1 Analysis

Create classifier/notebooks/03_phase1_analysis.ipynb
Load Phase 1 training results
Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
Generate per-source metrics for each model
Generate train/val/test performance curves
Generate confusion matrices
Perform statistical significance testing between models
Generate Grad-CAM visualizations (10-20 images each)
Document conclusions: Which baseline is better and why

5.4 Phase 2 Analysis

Create classifier/notebooks/04_phase2_analysis.ipynb
Load all Phase 2 experiment results
For each experiment (2A-2E):
- Generate 5-fold stratified group CV results (mean ± std with confidence intervals)
- Generate overall metrics
- Generate per-source metrics
- Calculate train/val gap
- Calculate pairwise source AUC variance (wiki-vs-source AUC variance)
- Perform statistical significance testing
Generate comparison tables across all Phase 2 experiments
Generate comparison visualizations (bar charts, heatmaps)
For each model/condition, generate Grad-CAM visualizations (10-20 images)
Organize visualizations by experiment, model, prediction type, and source
Answer key analysis questions
Generate comprehensive metrics comparison table
Provide evidence-based conclusions for each experiment
Provide recommendations for Phase 3

5.5 Phase 3 Analysis

Create classifier/notebooks/05_phase3_analysis.ipynb
Load Phase 3 training results
Generate 5-fold stratified group CV results for each model (mean ± std with confidence intervals)
Generate per-source metrics for each model
Compare with Phase 1 baselines (ResNet18, SimpleCNN)
Perform statistical significance testing vs baselines
Generate Grad-CAM visualizations for top models (10-20 images each)
Parameter count vs performance analysis
Conclusions: Which architectures work best and why

5.6 Phase 4 Analysis

Create classifier/notebooks/06_phase4_analysis.ipynb
Load data quantity scaling results
Load full dataset evaluation results
Generate comprehensive metrics comparison table
Generate per-source metrics for final models
Generate Grad-CAM galleries for final models
Perform hard example analysis with visualizations
Generate confidence distribution histograms
Final model comparison and selection
Conclusions and recommendations

5.7 Grad-CAM Deep Dive (Optional)

Create classifier/notebooks/07_gradcam_deep_dive.ipynb
Load Grad-CAM results from all phases
Comprehensive Grad-CAM analysis across all phases and models
Feature visualization for different model architectures
CNN vs EfficientNet vs ConvNeXt comparison
What regions do different architectures focus on?
Are there systematic differences in attention patterns?
Evidence of shortcut removal analysis across phases
Temporal analysis: does model attention change with different preprocessing?
Generate visual explanations suitable for presentation

Code Implementation Tasks

Cross-Validation Implementation

Update classifier/src/training/trainer.py to support 5-fold stratified group CV by basename
Update classifier/src/evaluation/evaluate.py to support grouped CV splits
Implement metric aggregation across folds (mean ± std)
Ensure all metrics report confidence intervals
Reuse the same fold assignments for comparable experiments so paired statistical tests are valid
Rename classifier/run_cv.py to classifier/run.py (pipeline expects classifier/run.py)
Rename classifier/run_cv.py to classifier/run.py (pipeline expects classifier/run.py)

Model Implementations

Implement ConvNeXt-Tiny in classifier/src/models/convnext.py
Implement MobileNetV3-Small in classifier/src/models/mobilenet.py
Register both models in classifier/src/models/__init__.py

Normalization Implementation

Implement function to calculate mean/std from real training images only
Update classifier/src/preprocessing/pipeline.py to support custom normalization stats
Test ImageNet normalization vs real-image-only normalization

Evaluation Improvements

Ensure test set uses train=False to disable augmentation
Ensure diagnostic evaluation transforms never change the training data
Verify CV fold assignments are identical across comparable experiments (same seed and basename grouping)
Implement per-source metrics with detection rate and false alarm rate
Implement pairwise AUC calculations
Implement train/val gap calculations
Implement pairwise source AUC variance calculations

Grad-CAM Improvements

Ensure Grad-CAM works for all model types (CNN-based)
Implement Grad-CAM for ConvNeXt
Implement Grad-CAM for MobileNetV3
Organize Grad-CAM outputs by experiment, model, prediction type, source

Final Report Preparation

Compile results from all phases
Create presentation slides (PDF format)
Brief description of deep learning solutions (discriminative + generative)
Description of implementation steps and improvements
- Motivate choices for architecture, training strategy, etc.
- Show intermediate results
- Interpret results and what changed
- What was decided to improve results
Classification performance results
- Experimental setup
- Train/val/test splits
- Performance metrics chosen
Data generation performance results
- Experimental setup
- Performance metrics chosen
Discussion and conclusions
- Comments on performance
- Final remarks
Fill auto-evaluation file

Summary

Total tasks: ~150+

This implementation plan covers:

✅ All 4 phases with comprehensive experiments
✅ 5-fold stratified group cross-validation for all experiments
✅ 7 analysis notebooks for robust validation
✅ Shortcut analysis (resolution/ratio + color distribution + source holdout)
✅ Source holdout experiments to detect source-specific feature learning
✅ Grad-CAM visualizations for explainability
✅ Statistical analysis with confidence intervals
✅ Per-source metrics for all experiments
✅ Data quantity scaling analysis
✅ Full dataset evaluation on best models
✅ Comprehensive documentation and reporting

Key Features:

Reproducible experiments with fixed seeds
Stratified group CV keeps basename groups together while balancing class distribution
Multiple shortcut analyses to prevent model cheating (resolution, color, source-specific)
Source holdout experiments to test generalization to unseen sources
Grad-CAM for explainability
Statistical rigor with confidence intervals
Per-source analysis to understand model behavior
Clear progression from baselines -> preprocessing -> architectures -> final evaluation

28 KiB Raw Blame History Unescape Escape