jalf/DRL_PROJ

Fork 0

Files

T

Johnny Fernandes bb3dfb92d5 Clean state

2026-04-30 01:25:39 +01:00

19 KiB

Raw Blame History

Classifier Reorganization Plan (v2)

Analysis of Current Phasing Issues

Your current phasing has several problems that make it difficult to present a rigorous, explainable report:

Current Problems

Inconsistent comparison conditions:
- SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4
- SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES)
- Makes direct comparisons invalid
No cross-validation:
- Only a single 80/10/10 split
- Results may be split-dependent
- No confidence intervals on metrics
Augmentation testing is incomplete:
- Only tested on ResNet18 (Phase 3), not across architectures
- Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad)
- No way to distinguish these cases
Facecrop impact not generalized:
- Only ResNet18 tested with facecrop
- Don't know if EfficientNet or ViT benefit similarly
Full dataset only on one model:
- Only ResNet18 tested on full dataset
- Don't know if data quantity helps all models equally
Test set integrity:
- Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary)
- Need to ensure same train/val/test splits across all model comparisons
- Need central config for shared parameters across phases

Recommended Reorganization

I suggest reorganizing into 4 phases with clear, isolated variables. All phases use 5-fold stratified cross-validation as standard practice to ensure balanced class distribution across folds.

Phase 1: Controlled Baseline Comparison

Goal: Compare simple architectures under identical conditions to establish baselines

Fixed conditions for ALL models:

Data: 20% subsample
Resolution: 128×128
No face crop
No augmentation
Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
Scheduler: CosineAnnealingLR (T_max=15)
Epochs: 15 with early stopping (patience=5)
Batch size: 32
5-fold stratified cross-validation (report mean ± std)

Model	Params	Expected AUC (mean ± std)
SimpleCNN	~400k	?
ResNet18	~11.7M	?

This gives you: Clean, comparable baseline for simple architectures with confidence intervals

These same 2 models will be used in Phase 2 for preprocessing experiments.

Phase 2: Preprocessing Impact (Same 2 Models from Phase 1)

Goal: Test each preprocessing change on the SAME 2 models from Phase 1

Experimental questions:

Does higher resolution improve performance?
Does face cropping improve performance?
Does augmentation improve or hurt performance?
Does augmentation interact with face cropping?
Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?

2A: Shortcut Analysis

Goal: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments.

Test 1: Resolution/Ratio Shortcuts (Letterboxing)

Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue
Models: ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)
Resolution: 224×224
No facecrop, no augmentation

Experiment	AUC	Train/Val Gap	Per-Source AUC Variance
Original images (standard eval)	?	?	?
Matched geometry (letterboxed real images)	?	?	?

Test 2: Color Distribution Shortcuts

Compare: Train with ImageNet normalization stats vs real-image-only normalization stats
Models: ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)
Resolution: 224×224
No facecrop, no augmentation
ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
Real-image stats: Calculate mean/std from real training images only, apply to all

Experiment	AUC	Train/Val Gap	Per-Source AUC Variance
ImageNet normalization	?	?	?
Real-image-only normalization	?	?	?

Test 3: Source-Specific Feature Learning (Source Holdout)

Compare: Train on all sources vs train with one source held out
Models: ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)
Resolution: 224×224
No facecrop, no augmentation
Hold out each fake source (text2img, inpainting, insight) separately

Experiment	Held-out Source	Train Sources	Held-out AUC	In-Source AUC	Δ (In-Source - Held-out)
Baseline	None	All	-	?	-
Holdout text2img	text2img	wiki, inpainting, insight	?	?	?
Holdout inpainting	inpainting	wiki, text2img, insight	?	?	?
Holdout insight	insight	wiki, text2img, inpainting	?	?	?

Interpretation: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results.

2B: Resolution Impact (no facecrop, no augmentation)

Test: 128×128 vs 224×224
Models: SimpleCNN, ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)

Model	128×128 AUC	224×224 AUC	Δ
SimpleCNN	?	?	?
ResNet18	?	?	?

2C: Facecrop Impact (224×224, no augmentation)

Test: No facecrop vs MTCNN facecrop
Models: SimpleCNN, ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)

Model	No Facecrop AUC	Facecrop AUC	Δ
SimpleCNN	?	?	?
ResNet18	?	?	?

2D: Augmentation Impact (224×224, without facecrop)

Test: No augmentation vs augmentation
Models: SimpleCNN, ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)
Verify test set has no augmentation (code inspection of get_transforms(train=False, ...))
Analyze shortcut removal: Compare train/val gaps and per-source AUC balance

Model	No Aug AUC	With Aug AUC	Δ	Train/Val Gap (No Aug)	Train/Val Gap (With Aug)
SimpleCNN	?	?	?	?	?
ResNet18	?	?	?	?	?

Experimental question: Does augmentation without facecrop improve or hurt performance?

2E: Augmentation + Facecrop Combined (224×224)

Test: Facecrop only vs Facecrop + augmentation
Models: SimpleCNN, ResNet18
Data: 20% subsample
5-fold stratified CV (balanced class distribution)
Analyze shortcut removal: Compare train/val gaps and per-source AUC balance

Model	Facecrop Only AUC	Facecrop + Aug AUC	Δ	Train/Val Gap (Only)	Train/Val Gap (With Aug)
SimpleCNN	?	?	?	?	?
ResNet18	?	?	?	?	?

Experimental question: Does augmentation with facecrop improve or hurt performance compared to facecrop alone?

This gives you:

Isolated impact of each preprocessing choice on SimpleCNN and ResNet18
Verification that the model is not learning shortcuts
Understanding of how augmentation interacts with face cropping
Shortcut removal analysis through train/val gap and per-source AUC metrics

Phase 3: Extended Architecture Exploration

Goal: Test additional architectures to find the best performing models

Fixed conditions (based on best findings from Phase 2):

Data: 20% subsample
Resolution: Best from Phase 2A (likely 224×224)
Facecrop: Best from Phase 2B/E (likely Yes)
Augmentation: Best from Phase 2D/E (depends on experimental results)
Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
Scheduler: CosineAnnealingLR (T_max=15)
Epochs: 15 with early stopping (patience=5)
Batch size: 32
5-fold stratified cross-validation (balanced class distribution)

Model	Params	Rationale
ResNet34	~21.8M	Deeper ResNet - test if more capacity helps
ResNet50	~25.6M	Even deeper with bottleneck blocks
EfficientNet-B0	~4.0M	Efficient compound scaling
ConvNeXt-Tiny	~29M	Modern CNN, different architecture family
MobileNetV3-Small	~2.5M	Lightweight efficiency comparison

This gives you: Extended architecture exploration to identify top-performing models for Phase 4

ResNet depth progression (18 -> 34 -> 50)
Efficient architectures (EfficientNet-B0, MobileNetV3-Small)
Modern CNN with different inductive bias (ConvNeXt-Tiny)
Size range (2.5M to 29M parameters)

Phase 4: Final Analysis on Best Models

Goal: Comprehensive evaluation of top-performing models from Phases 1-3

Select top 3-4 models based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny)

4A: Data Quantity Scaling

Test how each best model scales with more data:

Model	20% Data AUC	50% Data AUC	100% Data AUC	Δ (100% - 20%)
Model 1	?	?	?	?
Model 2	?	?	?	?
Model 3	?	?	?	?
Model 4	?	?	?	?

Fixed conditions:

Resolution: Best from Phase 2A
Facecrop: Best from Phase 2B/E
Augmentation: Best from Phase 2D/E
5-fold stratified cross-validation (balanced class distribution)

4B: Comprehensive Evaluation on Full Dataset

Train best models on full dataset (100%)
Detailed per-source metrics (text2img, inpainting, insight)
Grad-CAM visualizations for explainability
Hard example analysis (false positives/negatives)
Confidence distribution analysis
Cross-validation results (mean ± std)

This gives you: Final, comprehensive evaluation of the best models with full explainability

Notebooks and Analysis

Goal: Use Jupyter notebooks for comprehensive analysis and validation of each phase

01_eda.ipynb - Exploratory Data Analysis

Dataset overview (real vs fake distribution, sources)
Image resolution/aspect ratio analysis (identify potential shortcuts)
Color distribution analysis (identify potential shortcuts)
Sample visualization from each source (text2img, inpainting, insight, wiki)
Statistical summary of the dataset
Data quality checks

02_preprocessing.ipynb - Preprocessing Pipeline

Square crop and resize implementation demonstration
Face crop (MTCNN) demonstration and effectiveness analysis
Augmentation pipeline visualization (before/after examples)
Z-score normalization comparison (ImageNet vs real-image-only)
Data split verification (group-aware by basename, no overlap)
Preprocessing impact visualization

03_phase1_analysis.ipynb - Phase 1: Architecture Baseline

SimpleCNN vs ResNet18 comparison
5-fold stratified CV results (mean ± std with confidence intervals)
Per-source metrics for each model (text2img, inpainting, insight)
Train/val/test performance curves across epochs
Confusion matrices for each model
Statistical significance testing between models
Grad-CAM visualizations for both models (10-20 images each)
Conclusions: Which baseline is better and why

04_phase2_analysis.ipynb - Phase 2: Preprocessing Impact

2A: Resolution impact (128×128 vs 224×224)
2B: Facecrop impact
2C: Shortcut analysis (resolution/ratio + color distribution)
2D: Augmentation impact (without facecrop)
2E: Augmentation + facecrop combined

For each experiment:

5-fold CV results (mean ± std with confidence intervals)
Per-source metrics (text2img, inpainting, insight)
Statistical significance testing vs baseline
Comparison tables across all Phase 2 experiments
Grad-CAM visualizations (10-20 images per condition)
Analysis of train/val gap changes
Analysis of per-source AUC variance changes

Overall Phase 2 conclusions:

Which preprocessing choices work best and why
Are shortcuts being learned (resolution, color distribution)?
Does augmentation remove shortcuts or over-regularize?
Recommendations for Phase 3 (best preprocessing settings)

05_phase3_analysis.ipynb - Phase 3: Extended Architecture Exploration

ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small
5-fold CV results (mean ± std) for each model
Per-source metrics for each model
Comparison with Phase 1 baselines (ResNet18, SimpleCNN)
Statistical significance testing vs baselines
Grad-CAM visualizations for top models (10-20 images each)
Parameter count vs performance analysis
Conclusions: Which architectures work best and why

06_phase4_analysis.ipynb - Phase 4: Final Analysis

4A: Data quantity scaling (20%, 50%, 100%) on top 3-4 models
4B: Comprehensive evaluation on full dataset
Detailed per-source metrics for final models
Grad-CAM visualizations for final models (10-20 images each)
Hard example analysis (false positives/negatives) with visualizations
Confidence distribution analysis (histograms)
Cross-validation results (mean ± std with confidence intervals)
Final model comparison and selection
Conclusions and recommendations

07_gradcam_deep_dive.ipynb - Grad-CAM Deep Dive (optional)

Comprehensive Grad-CAM analysis across all phases and models
Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt)
Comparison of what different models focus on (face regions, backgrounds, artifacts)
Evidence of shortcut removal (or lack thereof) across phases
Temporal analysis: does model attention change with different preprocessing?
Visual explanations suitable for presentation

Notebook requirements:

Each notebook should be self-contained and reproducible
Include statistical analysis with confidence intervals
Generate publication-ready visualizations
Address all experimental questions and hypotheses
Provide clear conclusions for each phase
Use consistent formatting and style across all notebooks
Save all results (metrics, figures, tables) for easy reference

Key Improvements

1. Stratified Cross-Validation Implementation

# Use sklearn's StratifiedKFold to ensure balanced class distribution across folds
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    # Train on train_idx, validate on val_idx
    # Store metrics per fold

2. Augmentation Shortcut Removal Analysis (Phase 2D/2E)

Track these metrics with/without augmentation:

Metric	Without Aug	With Aug	Interpretation
Train AUC	0.99	0.95	↓ Expected
Val AUC	0.90	0.89	↓ Slight
Train/Val Gap	0.09	0.06	↓ Good!
text2img AUC	0.98	0.96	↓ Slight
InsightFace AUC	0.82	0.85	↑ Good!
AUC Variance	0.08	0.06	↓ Good!

Interpretation: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts.

3. Consistent Hyperparameters

Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN)
Same epochs, ES patience, batch size
Only vary the architecture being tested

4. Test Set Integrity and Reproducibility

Test set from original source:

Verify that test set uses original images with minimal preprocessing
Test set should use get_transforms(train=False, ...) to disable augmentation
Ensure test images are not preprocessed in a way that could affect model comparisons

Reproducible splits across models:

The code already uses cfg.get("seed", 42) for reproducible splits
All experiments should use the same seed (42) to ensure identical train/val/test splits
This ensures fair comparison between models

Central config for shared parameters:

Create a central config file (classifier/configs/shared.json) with parameters common across all phases
This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc.
Individual experiment configs can override these defaults

Example shared config:

{
  "seed": 42,
  "val_ratio": 0.1,
  "test_ratio": 0.1,
  "batch_size": 32,
  "optimizer": {
    "type": "adamw",
    "lr": 1e-4,
    "weight_decay": 1e-4
  },
  "scheduler": {
    "type": "cosine_annealing",
    "T_max": 15
  },
  "early_stopping_patience": 5,
  "num_workers": 4
}

Summary Table for Report

Phase	Variable Tested	Models	Data	Resolution	Facecrop	Augment	CV
1	Architecture Baseline	SimpleCNN, ResNet18	20%	128	No	No	5-fold stratified
2A	Shortcut Analysis	ResNet18	20%	224	No	No	5-fold stratified
2A-Holdout	Source Holdout	ResNet18	20%	224	No	No	5-fold stratified
2B	Resolution	SimpleCNN, ResNet18	20%	128/224	No	No	5-fold stratified
2C	Facecrop	SimpleCNN, ResNet18	20%	224	±	No	5-fold stratified
2D	Augmentation (no facecrop)	SimpleCNN, ResNet18	20%	224	No	±	5-fold stratified
2E	Augmentation + Facecrop	SimpleCNN, ResNet18	20%	224	Yes	±	5-fold stratified
3	Extended Architectures	ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small	20%	Best	Best	Best	5-fold stratified
4A	Data Quantity	Top 3-4 models	20/50/100%	Best	Best	Best	5-fold stratified
4B	Final Evaluation	Top 3-4 models	100%	Best	Best	Best	5-fold stratified

This structure gives you:

✅ Identical comparison conditions across all phases
✅ 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution)
✅ Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2)
✅ Shortcut analysis to verify no bias (Phase 2C)
✅ Experimental questions about augmentation impact (Phase 2D/2E)
✅ Shortcut removal analysis via train/val gap and per-source AUC metrics
✅ Facecrop tested on baseline models (Phase 2B)
✅ Extended architecture exploration with proven models (Phase 3)
✅ Final comprehensive analysis on best models (Phase 4)
✅ Data quantity scaling on multiple best models (Phase 4A)
✅ Clear, isolated variables per phase
✅ Explainable progression for report

Key Experimental Questions in Phase 2:

2C (Shortcut Analysis): Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
2D (Augmentation without facecrop): Does augmentation improve or hurt performance?
2E (Augmentation with facecrop): Does augmentation improve or hurt performance compared to facecrop alone?

19 KiB Raw Blame History Unescape Escape