# Classifier Reorganization Plan (v2) ## Analysis of Current Phasing Issues Your current phasing has several problems that make it difficult to present a rigorous, explainable report: ### Current Problems 1. **Inconsistent comparison conditions**: - SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4 - SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES) - Makes direct comparisons invalid 2. **No cross-validation**: - Only a single 80/10/10 split - Results may be split-dependent - No confidence intervals on metrics 3. **Augmentation testing is incomplete**: - Only tested on ResNet18 (Phase 3), not across architectures - Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad) - No way to distinguish these cases 4. **Facecrop impact not generalized**: - Only ResNet18 tested with facecrop - Don't know if EfficientNet or ViT benefit similarly 5. **Full dataset only on one model**: - Only ResNet18 tested on full dataset - Don't know if data quantity helps all models equally 6. **Test set integrity**: - Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary) - Need to ensure same train/val/test splits across all model comparisons - Need central config for shared parameters across phases --- ## Recommended Reorganization I suggest reorganizing into **4 phases** with clear, isolated variables. All phases use **5-fold stratified cross-validation** as standard practice to ensure balanced class distribution across folds. ### Phase 1: Controlled Baseline Comparison **Goal**: Compare simple architectures under identical conditions to establish baselines **Fixed conditions for ALL models**: - Data: 20% subsample - Resolution: 128×128 - No face crop - No augmentation - Optimizer: AdamW (lr=1e-4, weight_decay=1e-4) - Scheduler: CosineAnnealingLR (T_max=15) - Epochs: 15 with early stopping (patience=5) - Batch size: 32 - 5-fold stratified cross-validation (report mean ± std) | Model | Params | Expected AUC (mean ± std) | |-------|--------|---------------------------| | SimpleCNN | ~400k | ? | | ResNet18 | ~11.7M | ? | **This gives you**: Clean, comparable baseline for simple architectures with confidence intervals **These same 2 models will be used in Phase 2 for preprocessing experiments.** --- ### Phase 2: Preprocessing Impact (Same 2 Models from Phase 1) **Goal**: Test each preprocessing change on the SAME 2 models from Phase 1 **Experimental questions**: - Does higher resolution improve performance? - Does face cropping improve performance? - Does augmentation improve or hurt performance? - Does augmentation interact with face cropping? - Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)? #### 2A: Shortcut Analysis **Goal**: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments. **Test 1: Resolution/Ratio Shortcuts (Letterboxing)** - Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue - Models: **ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) - Resolution: 224×224 - No facecrop, no augmentation | Experiment | AUC | Train/Val Gap | Per-Source AUC Variance | |------------|-----|---------------|-------------------------| | Original images (standard eval) | ? | ? | ? | | Matched geometry (letterboxed real images) | ? | ? | ? | **Test 2: Color Distribution Shortcuts** - Compare: Train with ImageNet normalization stats vs real-image-only normalization stats - Models: **ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) - Resolution: 224×224 - No facecrop, no augmentation - ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225) - Real-image stats: Calculate mean/std from real training images only, apply to all | Experiment | AUC | Train/Val Gap | Per-Source AUC Variance | |------------|-----|---------------|-------------------------| | ImageNet normalization | ? | ? | ? | | Real-image-only normalization | ? | ? | ? | **Test 3: Source-Specific Feature Learning (Source Holdout)** - Compare: Train on all sources vs train with one source held out - Models: **ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) - Resolution: 224×224 - No facecrop, no augmentation - Hold out each fake source (text2img, inpainting, insight) separately | Experiment | Held-out Source | Train Sources | Held-out AUC | In-Source AUC | Δ (In-Source - Held-out) | |------------|-----------------|---------------|--------------|---------------|--------------------------| | Baseline | None | All | - | ? | - | | Holdout text2img | text2img | wiki, inpainting, insight | ? | ? | ? | | Holdout inpainting | inpainting | wiki, text2img, insight | ? | ? | ? | | Holdout insight | insight | wiki, text2img, inpainting | ? | ? | ? | **Interpretation**: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results. #### 2B: Resolution Impact (no facecrop, no augmentation) - Test: 128×128 vs 224×224 - Models: **SimpleCNN, ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) | Model | 128×128 AUC | 224×224 AUC | Δ | |-------|-------------|-------------|---| | SimpleCNN | ? | ? | ? | | ResNet18 | ? | ? | ? | #### 2C: Facecrop Impact (224×224, no augmentation) - Test: No facecrop vs MTCNN facecrop - Models: **SimpleCNN, ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) | Model | No Facecrop AUC | Facecrop AUC | Δ | |-------|-----------------|--------------|---| | SimpleCNN | ? | ? | ? | | ResNet18 | ? | ? | ? | #### 2D: Augmentation Impact (224×224, without facecrop) - Test: No augmentation vs augmentation - Models: **SimpleCNN, ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) - **Verify test set has no augmentation** (code inspection of `get_transforms(train=False, ...)`) - **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance | Model | No Aug AUC | With Aug AUC | Δ | Train/Val Gap (No Aug) | Train/Val Gap (With Aug) | |-------|------------|--------------|---|------------------------|--------------------------| | SimpleCNN | ? | ? | ? | ? | ? | | ResNet18 | ? | ? | ? | ? | ? | **Experimental question**: Does augmentation without facecrop improve or hurt performance? #### 2E: Augmentation + Facecrop Combined (224×224) - Test: Facecrop only vs Facecrop + augmentation - Models: **SimpleCNN, ResNet18** - Data: 20% subsample - 5-fold stratified CV (balanced class distribution) - **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance | Model | Facecrop Only AUC | Facecrop + Aug AUC | Δ | Train/Val Gap (Only) | Train/Val Gap (With Aug) | |-------|-------------------|--------------------|---|----------------------|--------------------------| | SimpleCNN | ? | ? | ? | ? | ? | | ResNet18 | ? | ? | ? | ? | ? | **Experimental question**: Does augmentation with facecrop improve or hurt performance compared to facecrop alone? **This gives you**: - Isolated impact of each preprocessing choice on SimpleCNN and ResNet18 - Verification that the model is not learning shortcuts - Understanding of how augmentation interacts with face cropping - Shortcut removal analysis through train/val gap and per-source AUC metrics --- ### Phase 3: Extended Architecture Exploration **Goal**: Test additional architectures to find the best performing models **Fixed conditions** (based on best findings from Phase 2): - Data: 20% subsample - Resolution: Best from Phase 2A (likely 224×224) - Facecrop: Best from Phase 2B/E (likely Yes) - Augmentation: Best from Phase 2D/E (depends on experimental results) - Optimizer: AdamW (lr=1e-4, weight_decay=1e-4) - Scheduler: CosineAnnealingLR (T_max=15) - Epochs: 15 with early stopping (patience=5) - Batch size: 32 - 5-fold stratified cross-validation (balanced class distribution) | Model | Params | Rationale | |-------|--------|-----------| | ResNet34 | ~21.8M | Deeper ResNet - test if more capacity helps | | ResNet50 | ~25.6M | Even deeper with bottleneck blocks | | EfficientNet-B0 | ~4.0M | Efficient compound scaling | | ConvNeXt-Tiny | ~29M | Modern CNN, different architecture family | | MobileNetV3-Small | ~2.5M | Lightweight efficiency comparison | **This gives you**: Extended architecture exploration to identify top-performing models for Phase 4 - ResNet depth progression (18 -> 34 -> 50) - Efficient architectures (EfficientNet-B0, MobileNetV3-Small) - Modern CNN with different inductive bias (ConvNeXt-Tiny) - Size range (2.5M to 29M parameters) --- ### Phase 4: Final Analysis on Best Models **Goal**: Comprehensive evaluation of top-performing models from Phases 1-3 **Select top 3-4 models** based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny) #### 4A: Data Quantity Scaling Test how each best model scales with more data: | Model | 20% Data AUC | 50% Data AUC | 100% Data AUC | Δ (100% - 20%) | |-------|--------------|--------------|---------------|----------------| | Model 1 | ? | ? | ? | ? | | Model 2 | ? | ? | ? | ? | | Model 3 | ? | ? | ? | ? | | Model 4 | ? | ? | ? | ? | **Fixed conditions**: - Resolution: Best from Phase 2A - Facecrop: Best from Phase 2B/E - Augmentation: Best from Phase 2D/E - 5-fold stratified cross-validation (balanced class distribution) #### 4B: Comprehensive Evaluation on Full Dataset - Train best models on **full dataset** (100%) - Detailed per-source metrics (text2img, inpainting, insight) - Grad-CAM visualizations for explainability - Hard example analysis (false positives/negatives) - Confidence distribution analysis - Cross-validation results (mean ± std) **This gives you**: Final, comprehensive evaluation of the best models with full explainability --- ### Notebooks and Analysis **Goal**: Use Jupyter notebooks for comprehensive analysis and validation of each phase #### **01_eda.ipynb** - Exploratory Data Analysis - Dataset overview (real vs fake distribution, sources) - Image resolution/aspect ratio analysis (identify potential shortcuts) - Color distribution analysis (identify potential shortcuts) - Sample visualization from each source (text2img, inpainting, insight, wiki) - Statistical summary of the dataset - Data quality checks #### **02_preprocessing.ipynb** - Preprocessing Pipeline - Square crop and resize implementation demonstration - Face crop (MTCNN) demonstration and effectiveness analysis - Augmentation pipeline visualization (before/after examples) - Z-score normalization comparison (ImageNet vs real-image-only) - Data split verification (group-aware by basename, no overlap) - Preprocessing impact visualization #### **03_phase1_analysis.ipynb** - Phase 1: Architecture Baseline - SimpleCNN vs ResNet18 comparison - 5-fold stratified CV results (mean ± std with confidence intervals) - Per-source metrics for each model (text2img, inpainting, insight) - Train/val/test performance curves across epochs - Confusion matrices for each model - Statistical significance testing between models - Grad-CAM visualizations for both models (10-20 images each) - Conclusions: Which baseline is better and why #### **04_phase2_analysis.ipynb** - Phase 2: Preprocessing Impact - **2A**: Resolution impact (128×128 vs 224×224) - **2B**: Facecrop impact - **2C**: Shortcut analysis (resolution/ratio + color distribution) - **2D**: Augmentation impact (without facecrop) - **2E**: Augmentation + facecrop combined For each experiment: - 5-fold CV results (mean ± std with confidence intervals) - Per-source metrics (text2img, inpainting, insight) - Statistical significance testing vs baseline - Comparison tables across all Phase 2 experiments - Grad-CAM visualizations (10-20 images per condition) - Analysis of train/val gap changes - Analysis of per-source AUC variance changes **Overall Phase 2 conclusions**: - Which preprocessing choices work best and why - Are shortcuts being learned (resolution, color distribution)? - Does augmentation remove shortcuts or over-regularize? - Recommendations for Phase 3 (best preprocessing settings) #### **05_phase3_analysis.ipynb** - Phase 3: Extended Architecture Exploration - ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small - 5-fold CV results (mean ± std) for each model - Per-source metrics for each model - Comparison with Phase 1 baselines (ResNet18, SimpleCNN) - Statistical significance testing vs baselines - Grad-CAM visualizations for top models (10-20 images each) - Parameter count vs performance analysis - Conclusions: Which architectures work best and why #### **06_phase4_analysis.ipynb** - Phase 4: Final Analysis - **4A**: Data quantity scaling (20%, 50%, 100%) on top 3-4 models - **4B**: Comprehensive evaluation on full dataset - Detailed per-source metrics for final models - Grad-CAM visualizations for final models (10-20 images each) - Hard example analysis (false positives/negatives) with visualizations - Confidence distribution analysis (histograms) - Cross-validation results (mean ± std with confidence intervals) - Final model comparison and selection - Conclusions and recommendations #### **07_gradcam_deep_dive.ipynb** - Grad-CAM Deep Dive (optional) - Comprehensive Grad-CAM analysis across all phases and models - Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt) - Comparison of what different models focus on (face regions, backgrounds, artifacts) - Evidence of shortcut removal (or lack thereof) across phases - Temporal analysis: does model attention change with different preprocessing? - Visual explanations suitable for presentation **Notebook requirements**: - Each notebook should be self-contained and reproducible - Include statistical analysis with confidence intervals - Generate publication-ready visualizations - Address all experimental questions and hypotheses - Provide clear conclusions for each phase - Use consistent formatting and style across all notebooks - Save all results (metrics, figures, tables) for easy reference --- ## Key Improvements ### 1. Stratified Cross-Validation Implementation ```python # Use sklearn's StratifiedKFold to ensure balanced class distribution across folds from sklearn.model_selection import StratifiedKFold skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)): # Train on train_idx, validate on val_idx # Store metrics per fold ``` ### 2. Augmentation Shortcut Removal Analysis (Phase 2D/2E) Track these metrics with/without augmentation: | Metric | Without Aug | With Aug | Interpretation | |--------|-------------|----------|----------------| | Train AUC | 0.99 | 0.95 | ↓ Expected | | Val AUC | 0.90 | 0.89 | ↓ Slight | | **Train/Val Gap** | **0.09** | **0.06** | **↓ Good!** | | text2img AUC | 0.98 | 0.96 | ↓ Slight | | InsightFace AUC | 0.82 | 0.85 | **↑ Good!** | | **AUC Variance** | **0.08** | **0.06** | **↓ Good!** | **Interpretation**: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts. ### 3. Consistent Hyperparameters - Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN) - Same epochs, ES patience, batch size - Only vary the architecture being tested ### 4. Test Set Integrity and Reproducibility **Test set from original source**: - Verify that test set uses original images with minimal preprocessing - Test set should use `get_transforms(train=False, ...)` to disable augmentation - Ensure test images are not preprocessed in a way that could affect model comparisons **Reproducible splits across models**: - The code already uses `cfg.get("seed", 42)` for reproducible splits - All experiments should use the same seed (42) to ensure identical train/val/test splits - This ensures fair comparison between models **Central config for shared parameters**: - Create a central config file (`classifier/configs/shared.json`) with parameters common across all phases - This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc. - Individual experiment configs can override these defaults Example shared config: ```json { "seed": 42, "val_ratio": 0.1, "test_ratio": 0.1, "batch_size": 32, "optimizer": { "type": "adamw", "lr": 1e-4, "weight_decay": 1e-4 }, "scheduler": { "type": "cosine_annealing", "T_max": 15 }, "early_stopping_patience": 5, "num_workers": 4 } ``` --- ## Summary Table for Report | Phase | Variable Tested | Models | Data | Resolution | Facecrop | Augment | CV | |-------|-----------------|--------|------|------------|----------|---------|----| | 1 | Architecture Baseline | SimpleCNN, ResNet18 | 20% | 128 | No | No | 5-fold stratified | | 2A | Shortcut Analysis | ResNet18 | 20% | 224 | No | No | 5-fold stratified | | 2A-Holdout | Source Holdout | ResNet18 | 20% | 224 | No | No | 5-fold stratified | | 2B | Resolution | SimpleCNN, ResNet18 | 20% | 128/224 | No | No | 5-fold stratified | | 2C | Facecrop | SimpleCNN, ResNet18 | 20% | 224 | ± | No | 5-fold stratified | | 2D | Augmentation (no facecrop) | SimpleCNN, ResNet18 | 20% | 224 | No | ± | 5-fold stratified | | 2E | Augmentation + Facecrop | SimpleCNN, ResNet18 | 20% | 224 | Yes | ± | 5-fold stratified | | 3 | Extended Architectures | ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small | 20% | Best | Best | Best | 5-fold stratified | | 4A | Data Quantity | Top 3-4 models | 20/50/100% | Best | Best | Best | 5-fold stratified | | 4B | Final Evaluation | Top 3-4 models | 100% | Best | Best | Best | 5-fold stratified | This structure gives you: - ✅ Identical comparison conditions across all phases - ✅ 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution) - ✅ Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2) - ✅ Shortcut analysis to verify no bias (Phase 2C) - ✅ Experimental questions about augmentation impact (Phase 2D/2E) - ✅ Shortcut removal analysis via train/val gap and per-source AUC metrics - ✅ Facecrop tested on baseline models (Phase 2B) - ✅ Extended architecture exploration with proven models (Phase 3) - ✅ Final comprehensive analysis on best models (Phase 4) - ✅ Data quantity scaling on multiple best models (Phase 4A) - ✅ Clear, isolated variables per phase - ✅ Explainable progression for report **Key Experimental Questions in Phase 2**: - **2C (Shortcut Analysis)**: Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)? - **2D (Augmentation without facecrop)**: Does augmentation improve or hurt performance? - **2E (Augmentation with facecrop)**: Does augmentation improve or hurt performance compared to facecrop alone?