Files
DRL_PROJ/docs/classifier_plan.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

19 KiB
Raw Blame History

Classifier Reorganization Plan (v2)

Analysis of Current Phasing Issues

Your current phasing has several problems that make it difficult to present a rigorous, explainable report:

Current Problems

  1. Inconsistent comparison conditions:

    • SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4
    • SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES)
    • Makes direct comparisons invalid
  2. No cross-validation:

    • Only a single 80/10/10 split
    • Results may be split-dependent
    • No confidence intervals on metrics
  3. Augmentation testing is incomplete:

    • Only tested on ResNet18 (Phase 3), not across architectures
    • Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad)
    • No way to distinguish these cases
  4. Facecrop impact not generalized:

    • Only ResNet18 tested with facecrop
    • Don't know if EfficientNet or ViT benefit similarly
  5. Full dataset only on one model:

    • Only ResNet18 tested on full dataset
    • Don't know if data quantity helps all models equally
  6. Test set integrity:

    • Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary)
    • Need to ensure same train/val/test splits across all model comparisons
    • Need central config for shared parameters across phases

I suggest reorganizing into 4 phases with clear, isolated variables. All phases use 5-fold stratified cross-validation as standard practice to ensure balanced class distribution across folds.

Phase 1: Controlled Baseline Comparison

Goal: Compare simple architectures under identical conditions to establish baselines

Fixed conditions for ALL models:

  • Data: 20% subsample
  • Resolution: 128×128
  • No face crop
  • No augmentation
  • Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
  • Scheduler: CosineAnnealingLR (T_max=15)
  • Epochs: 15 with early stopping (patience=5)
  • Batch size: 32
  • 5-fold stratified cross-validation (report mean ± std)
Model Params Expected AUC (mean ± std)
SimpleCNN ~400k ?
ResNet18 ~11.7M ?

This gives you: Clean, comparable baseline for simple architectures with confidence intervals

These same 2 models will be used in Phase 2 for preprocessing experiments.


Phase 2: Preprocessing Impact (Same 2 Models from Phase 1)

Goal: Test each preprocessing change on the SAME 2 models from Phase 1

Experimental questions:

  • Does higher resolution improve performance?
  • Does face cropping improve performance?
  • Does augmentation improve or hurt performance?
  • Does augmentation interact with face cropping?
  • Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?

2A: Shortcut Analysis

Goal: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments.

Test 1: Resolution/Ratio Shortcuts (Letterboxing)

  • Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue
  • Models: ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
  • Resolution: 224×224
  • No facecrop, no augmentation
Experiment AUC Train/Val Gap Per-Source AUC Variance
Original images (standard eval) ? ? ?
Matched geometry (letterboxed real images) ? ? ?

Test 2: Color Distribution Shortcuts

  • Compare: Train with ImageNet normalization stats vs real-image-only normalization stats
  • Models: ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
  • Resolution: 224×224
  • No facecrop, no augmentation
  • ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
  • Real-image stats: Calculate mean/std from real training images only, apply to all
Experiment AUC Train/Val Gap Per-Source AUC Variance
ImageNet normalization ? ? ?
Real-image-only normalization ? ? ?

Test 3: Source-Specific Feature Learning (Source Holdout)

  • Compare: Train on all sources vs train with one source held out
  • Models: ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
  • Resolution: 224×224
  • No facecrop, no augmentation
  • Hold out each fake source (text2img, inpainting, insight) separately
Experiment Held-out Source Train Sources Held-out AUC In-Source AUC Δ (In-Source - Held-out)
Baseline None All - ? -
Holdout text2img text2img wiki, inpainting, insight ? ? ?
Holdout inpainting inpainting wiki, text2img, insight ? ? ?
Holdout insight insight wiki, text2img, inpainting ? ? ?

Interpretation: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results.

2B: Resolution Impact (no facecrop, no augmentation)

  • Test: 128×128 vs 224×224
  • Models: SimpleCNN, ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
Model 128×128 AUC 224×224 AUC Δ
SimpleCNN ? ? ?
ResNet18 ? ? ?

2C: Facecrop Impact (224×224, no augmentation)

  • Test: No facecrop vs MTCNN facecrop
  • Models: SimpleCNN, ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
Model No Facecrop AUC Facecrop AUC Δ
SimpleCNN ? ? ?
ResNet18 ? ? ?

2D: Augmentation Impact (224×224, without facecrop)

  • Test: No augmentation vs augmentation
  • Models: SimpleCNN, ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
  • Verify test set has no augmentation (code inspection of get_transforms(train=False, ...))
  • Analyze shortcut removal: Compare train/val gaps and per-source AUC balance
Model No Aug AUC With Aug AUC Δ Train/Val Gap (No Aug) Train/Val Gap (With Aug)
SimpleCNN ? ? ? ? ?
ResNet18 ? ? ? ? ?

Experimental question: Does augmentation without facecrop improve or hurt performance?

2E: Augmentation + Facecrop Combined (224×224)

  • Test: Facecrop only vs Facecrop + augmentation
  • Models: SimpleCNN, ResNet18
  • Data: 20% subsample
  • 5-fold stratified CV (balanced class distribution)
  • Analyze shortcut removal: Compare train/val gaps and per-source AUC balance
Model Facecrop Only AUC Facecrop + Aug AUC Δ Train/Val Gap (Only) Train/Val Gap (With Aug)
SimpleCNN ? ? ? ? ?
ResNet18 ? ? ? ? ?

Experimental question: Does augmentation with facecrop improve or hurt performance compared to facecrop alone?

This gives you:

  • Isolated impact of each preprocessing choice on SimpleCNN and ResNet18
  • Verification that the model is not learning shortcuts
  • Understanding of how augmentation interacts with face cropping
  • Shortcut removal analysis through train/val gap and per-source AUC metrics

Phase 3: Extended Architecture Exploration

Goal: Test additional architectures to find the best performing models

Fixed conditions (based on best findings from Phase 2):

  • Data: 20% subsample
  • Resolution: Best from Phase 2A (likely 224×224)
  • Facecrop: Best from Phase 2B/E (likely Yes)
  • Augmentation: Best from Phase 2D/E (depends on experimental results)
  • Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
  • Scheduler: CosineAnnealingLR (T_max=15)
  • Epochs: 15 with early stopping (patience=5)
  • Batch size: 32
  • 5-fold stratified cross-validation (balanced class distribution)
Model Params Rationale
ResNet34 ~21.8M Deeper ResNet - test if more capacity helps
ResNet50 ~25.6M Even deeper with bottleneck blocks
EfficientNet-B0 ~4.0M Efficient compound scaling
ConvNeXt-Tiny ~29M Modern CNN, different architecture family
MobileNetV3-Small ~2.5M Lightweight efficiency comparison

This gives you: Extended architecture exploration to identify top-performing models for Phase 4

  • ResNet depth progression (18 -> 34 -> 50)
  • Efficient architectures (EfficientNet-B0, MobileNetV3-Small)
  • Modern CNN with different inductive bias (ConvNeXt-Tiny)
  • Size range (2.5M to 29M parameters)

Phase 4: Final Analysis on Best Models

Goal: Comprehensive evaluation of top-performing models from Phases 1-3

Select top 3-4 models based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny)

4A: Data Quantity Scaling

Test how each best model scales with more data:

Model 20% Data AUC 50% Data AUC 100% Data AUC Δ (100% - 20%)
Model 1 ? ? ? ?
Model 2 ? ? ? ?
Model 3 ? ? ? ?
Model 4 ? ? ? ?

Fixed conditions:

  • Resolution: Best from Phase 2A
  • Facecrop: Best from Phase 2B/E
  • Augmentation: Best from Phase 2D/E
  • 5-fold stratified cross-validation (balanced class distribution)

4B: Comprehensive Evaluation on Full Dataset

  • Train best models on full dataset (100%)
  • Detailed per-source metrics (text2img, inpainting, insight)
  • Grad-CAM visualizations for explainability
  • Hard example analysis (false positives/negatives)
  • Confidence distribution analysis
  • Cross-validation results (mean ± std)

This gives you: Final, comprehensive evaluation of the best models with full explainability


Notebooks and Analysis

Goal: Use Jupyter notebooks for comprehensive analysis and validation of each phase

01_eda.ipynb - Exploratory Data Analysis

  • Dataset overview (real vs fake distribution, sources)
  • Image resolution/aspect ratio analysis (identify potential shortcuts)
  • Color distribution analysis (identify potential shortcuts)
  • Sample visualization from each source (text2img, inpainting, insight, wiki)
  • Statistical summary of the dataset
  • Data quality checks

02_preprocessing.ipynb - Preprocessing Pipeline

  • Square crop and resize implementation demonstration
  • Face crop (MTCNN) demonstration and effectiveness analysis
  • Augmentation pipeline visualization (before/after examples)
  • Z-score normalization comparison (ImageNet vs real-image-only)
  • Data split verification (group-aware by basename, no overlap)
  • Preprocessing impact visualization

03_phase1_analysis.ipynb - Phase 1: Architecture Baseline

  • SimpleCNN vs ResNet18 comparison
  • 5-fold stratified CV results (mean ± std with confidence intervals)
  • Per-source metrics for each model (text2img, inpainting, insight)
  • Train/val/test performance curves across epochs
  • Confusion matrices for each model
  • Statistical significance testing between models
  • Grad-CAM visualizations for both models (10-20 images each)
  • Conclusions: Which baseline is better and why

04_phase2_analysis.ipynb - Phase 2: Preprocessing Impact

  • 2A: Resolution impact (128×128 vs 224×224)
  • 2B: Facecrop impact
  • 2C: Shortcut analysis (resolution/ratio + color distribution)
  • 2D: Augmentation impact (without facecrop)
  • 2E: Augmentation + facecrop combined

For each experiment:

  • 5-fold CV results (mean ± std with confidence intervals)
  • Per-source metrics (text2img, inpainting, insight)
  • Statistical significance testing vs baseline
  • Comparison tables across all Phase 2 experiments
  • Grad-CAM visualizations (10-20 images per condition)
  • Analysis of train/val gap changes
  • Analysis of per-source AUC variance changes

Overall Phase 2 conclusions:

  • Which preprocessing choices work best and why
  • Are shortcuts being learned (resolution, color distribution)?
  • Does augmentation remove shortcuts or over-regularize?
  • Recommendations for Phase 3 (best preprocessing settings)

05_phase3_analysis.ipynb - Phase 3: Extended Architecture Exploration

  • ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small
  • 5-fold CV results (mean ± std) for each model
  • Per-source metrics for each model
  • Comparison with Phase 1 baselines (ResNet18, SimpleCNN)
  • Statistical significance testing vs baselines
  • Grad-CAM visualizations for top models (10-20 images each)
  • Parameter count vs performance analysis
  • Conclusions: Which architectures work best and why

06_phase4_analysis.ipynb - Phase 4: Final Analysis

  • 4A: Data quantity scaling (20%, 50%, 100%) on top 3-4 models
  • 4B: Comprehensive evaluation on full dataset
  • Detailed per-source metrics for final models
  • Grad-CAM visualizations for final models (10-20 images each)
  • Hard example analysis (false positives/negatives) with visualizations
  • Confidence distribution analysis (histograms)
  • Cross-validation results (mean ± std with confidence intervals)
  • Final model comparison and selection
  • Conclusions and recommendations

07_gradcam_deep_dive.ipynb - Grad-CAM Deep Dive (optional)

  • Comprehensive Grad-CAM analysis across all phases and models
  • Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt)
  • Comparison of what different models focus on (face regions, backgrounds, artifacts)
  • Evidence of shortcut removal (or lack thereof) across phases
  • Temporal analysis: does model attention change with different preprocessing?
  • Visual explanations suitable for presentation

Notebook requirements:

  • Each notebook should be self-contained and reproducible
  • Include statistical analysis with confidence intervals
  • Generate publication-ready visualizations
  • Address all experimental questions and hypotheses
  • Provide clear conclusions for each phase
  • Use consistent formatting and style across all notebooks
  • Save all results (metrics, figures, tables) for easy reference

Key Improvements

1. Stratified Cross-Validation Implementation

# Use sklearn's StratifiedKFold to ensure balanced class distribution across folds
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    # Train on train_idx, validate on val_idx
    # Store metrics per fold

2. Augmentation Shortcut Removal Analysis (Phase 2D/2E)

Track these metrics with/without augmentation:

Metric Without Aug With Aug Interpretation
Train AUC 0.99 0.95 ↓ Expected
Val AUC 0.90 0.89 ↓ Slight
Train/Val Gap 0.09 0.06 ↓ Good!
text2img AUC 0.98 0.96 ↓ Slight
InsightFace AUC 0.82 0.85 ↑ Good!
AUC Variance 0.08 0.06 ↓ Good!

Interpretation: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts.

3. Consistent Hyperparameters

  • Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN)
  • Same epochs, ES patience, batch size
  • Only vary the architecture being tested

4. Test Set Integrity and Reproducibility

Test set from original source:

  • Verify that test set uses original images with minimal preprocessing
  • Test set should use get_transforms(train=False, ...) to disable augmentation
  • Ensure test images are not preprocessed in a way that could affect model comparisons

Reproducible splits across models:

  • The code already uses cfg.get("seed", 42) for reproducible splits
  • All experiments should use the same seed (42) to ensure identical train/val/test splits
  • This ensures fair comparison between models

Central config for shared parameters:

  • Create a central config file (classifier/configs/shared.json) with parameters common across all phases
  • This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc.
  • Individual experiment configs can override these defaults

Example shared config:

{
  "seed": 42,
  "val_ratio": 0.1,
  "test_ratio": 0.1,
  "batch_size": 32,
  "optimizer": {
    "type": "adamw",
    "lr": 1e-4,
    "weight_decay": 1e-4
  },
  "scheduler": {
    "type": "cosine_annealing",
    "T_max": 15
  },
  "early_stopping_patience": 5,
  "num_workers": 4
}

Summary Table for Report

Phase Variable Tested Models Data Resolution Facecrop Augment CV
1 Architecture Baseline SimpleCNN, ResNet18 20% 128 No No 5-fold stratified
2A Shortcut Analysis ResNet18 20% 224 No No 5-fold stratified
2A-Holdout Source Holdout ResNet18 20% 224 No No 5-fold stratified
2B Resolution SimpleCNN, ResNet18 20% 128/224 No No 5-fold stratified
2C Facecrop SimpleCNN, ResNet18 20% 224 ± No 5-fold stratified
2D Augmentation (no facecrop) SimpleCNN, ResNet18 20% 224 No ± 5-fold stratified
2E Augmentation + Facecrop SimpleCNN, ResNet18 20% 224 Yes ± 5-fold stratified
3 Extended Architectures ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small 20% Best Best Best 5-fold stratified
4A Data Quantity Top 3-4 models 20/50/100% Best Best Best 5-fold stratified
4B Final Evaluation Top 3-4 models 100% Best Best Best 5-fold stratified

This structure gives you:

  • Identical comparison conditions across all phases
  • 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution)
  • Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2)
  • Shortcut analysis to verify no bias (Phase 2C)
  • Experimental questions about augmentation impact (Phase 2D/2E)
  • Shortcut removal analysis via train/val gap and per-source AUC metrics
  • Facecrop tested on baseline models (Phase 2B)
  • Extended architecture exploration with proven models (Phase 3)
  • Final comprehensive analysis on best models (Phase 4)
  • Data quantity scaling on multiple best models (Phase 4A)
  • Clear, isolated variables per phase
  • Explainable progression for report

Key Experimental Questions in Phase 2:

  • 2C (Shortcut Analysis): Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
  • 2D (Augmentation without facecrop): Does augmentation improve or hurt performance?
  • 2E (Augmentation with facecrop): Does augmentation improve or hurt performance compared to facecrop alone?