Files
DRL_PROJ/docs/classifier_plan.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

450 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Classifier Reorganization Plan (v2)
## Analysis of Current Phasing Issues
Your current phasing has several problems that make it difficult to present a rigorous, explainable report:
### Current Problems
1. **Inconsistent comparison conditions**:
- SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4
- SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES)
- Makes direct comparisons invalid
2. **No cross-validation**:
- Only a single 80/10/10 split
- Results may be split-dependent
- No confidence intervals on metrics
3. **Augmentation testing is incomplete**:
- Only tested on ResNet18 (Phase 3), not across architectures
- Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad)
- No way to distinguish these cases
4. **Facecrop impact not generalized**:
- Only ResNet18 tested with facecrop
- Don't know if EfficientNet or ViT benefit similarly
5. **Full dataset only on one model**:
- Only ResNet18 tested on full dataset
- Don't know if data quantity helps all models equally
6. **Test set integrity**:
- Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary)
- Need to ensure same train/val/test splits across all model comparisons
- Need central config for shared parameters across phases
---
## Recommended Reorganization
I suggest reorganizing into **4 phases** with clear, isolated variables. All phases use **5-fold stratified cross-validation** as standard practice to ensure balanced class distribution across folds.
### Phase 1: Controlled Baseline Comparison
**Goal**: Compare simple architectures under identical conditions to establish baselines
**Fixed conditions for ALL models**:
- Data: 20% subsample
- Resolution: 128×128
- No face crop
- No augmentation
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
- Scheduler: CosineAnnealingLR (T_max=15)
- Epochs: 15 with early stopping (patience=5)
- Batch size: 32
- 5-fold stratified cross-validation (report mean ± std)
| Model | Params | Expected AUC (mean ± std) |
|-------|--------|---------------------------|
| SimpleCNN | ~400k | ? |
| ResNet18 | ~11.7M | ? |
**This gives you**: Clean, comparable baseline for simple architectures with confidence intervals
**These same 2 models will be used in Phase 2 for preprocessing experiments.**
---
### Phase 2: Preprocessing Impact (Same 2 Models from Phase 1)
**Goal**: Test each preprocessing change on the SAME 2 models from Phase 1
**Experimental questions**:
- Does higher resolution improve performance?
- Does face cropping improve performance?
- Does augmentation improve or hurt performance?
- Does augmentation interact with face cropping?
- Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
#### 2A: Shortcut Analysis
**Goal**: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments.
**Test 1: Resolution/Ratio Shortcuts (Letterboxing)**
- Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue
- Models: **ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
- Resolution: 224×224
- No facecrop, no augmentation
| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
|------------|-----|---------------|-------------------------|
| Original images (standard eval) | ? | ? | ? |
| Matched geometry (letterboxed real images) | ? | ? | ? |
**Test 2: Color Distribution Shortcuts**
- Compare: Train with ImageNet normalization stats vs real-image-only normalization stats
- Models: **ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
- Resolution: 224×224
- No facecrop, no augmentation
- ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
- Real-image stats: Calculate mean/std from real training images only, apply to all
| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
|------------|-----|---------------|-------------------------|
| ImageNet normalization | ? | ? | ? |
| Real-image-only normalization | ? | ? | ? |
**Test 3: Source-Specific Feature Learning (Source Holdout)**
- Compare: Train on all sources vs train with one source held out
- Models: **ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
- Resolution: 224×224
- No facecrop, no augmentation
- Hold out each fake source (text2img, inpainting, insight) separately
| Experiment | Held-out Source | Train Sources | Held-out AUC | In-Source AUC | Δ (In-Source - Held-out) |
|------------|-----------------|---------------|--------------|---------------|--------------------------|
| Baseline | None | All | - | ? | - |
| Holdout text2img | text2img | wiki, inpainting, insight | ? | ? | ? |
| Holdout inpainting | inpainting | wiki, text2img, insight | ? | ? | ? |
| Holdout insight | insight | wiki, text2img, inpainting | ? | ? | ? |
**Interpretation**: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results.
#### 2B: Resolution Impact (no facecrop, no augmentation)
- Test: 128×128 vs 224×224
- Models: **SimpleCNN, ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
| Model | 128×128 AUC | 224×224 AUC | Δ |
|-------|-------------|-------------|---|
| SimpleCNN | ? | ? | ? |
| ResNet18 | ? | ? | ? |
#### 2C: Facecrop Impact (224×224, no augmentation)
- Test: No facecrop vs MTCNN facecrop
- Models: **SimpleCNN, ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
| Model | No Facecrop AUC | Facecrop AUC | Δ |
|-------|-----------------|--------------|---|
| SimpleCNN | ? | ? | ? |
| ResNet18 | ? | ? | ? |
#### 2D: Augmentation Impact (224×224, without facecrop)
- Test: No augmentation vs augmentation
- Models: **SimpleCNN, ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
- **Verify test set has no augmentation** (code inspection of `get_transforms(train=False, ...)`)
- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
| Model | No Aug AUC | With Aug AUC | Δ | Train/Val Gap (No Aug) | Train/Val Gap (With Aug) |
|-------|------------|--------------|---|------------------------|--------------------------|
| SimpleCNN | ? | ? | ? | ? | ? |
| ResNet18 | ? | ? | ? | ? | ? |
**Experimental question**: Does augmentation without facecrop improve or hurt performance?
#### 2E: Augmentation + Facecrop Combined (224×224)
- Test: Facecrop only vs Facecrop + augmentation
- Models: **SimpleCNN, ResNet18**
- Data: 20% subsample
- 5-fold stratified CV (balanced class distribution)
- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
| Model | Facecrop Only AUC | Facecrop + Aug AUC | Δ | Train/Val Gap (Only) | Train/Val Gap (With Aug) |
|-------|-------------------|--------------------|---|----------------------|--------------------------|
| SimpleCNN | ? | ? | ? | ? | ? |
| ResNet18 | ? | ? | ? | ? | ? |
**Experimental question**: Does augmentation with facecrop improve or hurt performance compared to facecrop alone?
**This gives you**:
- Isolated impact of each preprocessing choice on SimpleCNN and ResNet18
- Verification that the model is not learning shortcuts
- Understanding of how augmentation interacts with face cropping
- Shortcut removal analysis through train/val gap and per-source AUC metrics
---
### Phase 3: Extended Architecture Exploration
**Goal**: Test additional architectures to find the best performing models
**Fixed conditions** (based on best findings from Phase 2):
- Data: 20% subsample
- Resolution: Best from Phase 2A (likely 224×224)
- Facecrop: Best from Phase 2B/E (likely Yes)
- Augmentation: Best from Phase 2D/E (depends on experimental results)
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
- Scheduler: CosineAnnealingLR (T_max=15)
- Epochs: 15 with early stopping (patience=5)
- Batch size: 32
- 5-fold stratified cross-validation (balanced class distribution)
| Model | Params | Rationale |
|-------|--------|-----------|
| ResNet34 | ~21.8M | Deeper ResNet - test if more capacity helps |
| ResNet50 | ~25.6M | Even deeper with bottleneck blocks |
| EfficientNet-B0 | ~4.0M | Efficient compound scaling |
| ConvNeXt-Tiny | ~29M | Modern CNN, different architecture family |
| MobileNetV3-Small | ~2.5M | Lightweight efficiency comparison |
**This gives you**: Extended architecture exploration to identify top-performing models for Phase 4
- ResNet depth progression (18 -> 34 -> 50)
- Efficient architectures (EfficientNet-B0, MobileNetV3-Small)
- Modern CNN with different inductive bias (ConvNeXt-Tiny)
- Size range (2.5M to 29M parameters)
---
### Phase 4: Final Analysis on Best Models
**Goal**: Comprehensive evaluation of top-performing models from Phases 1-3
**Select top 3-4 models** based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny)
#### 4A: Data Quantity Scaling
Test how each best model scales with more data:
| Model | 20% Data AUC | 50% Data AUC | 100% Data AUC | Δ (100% - 20%) |
|-------|--------------|--------------|---------------|----------------|
| Model 1 | ? | ? | ? | ? |
| Model 2 | ? | ? | ? | ? |
| Model 3 | ? | ? | ? | ? |
| Model 4 | ? | ? | ? | ? |
**Fixed conditions**:
- Resolution: Best from Phase 2A
- Facecrop: Best from Phase 2B/E
- Augmentation: Best from Phase 2D/E
- 5-fold stratified cross-validation (balanced class distribution)
#### 4B: Comprehensive Evaluation on Full Dataset
- Train best models on **full dataset** (100%)
- Detailed per-source metrics (text2img, inpainting, insight)
- Grad-CAM visualizations for explainability
- Hard example analysis (false positives/negatives)
- Confidence distribution analysis
- Cross-validation results (mean ± std)
**This gives you**: Final, comprehensive evaluation of the best models with full explainability
---
### Notebooks and Analysis
**Goal**: Use Jupyter notebooks for comprehensive analysis and validation of each phase
#### **01_eda.ipynb** - Exploratory Data Analysis
- Dataset overview (real vs fake distribution, sources)
- Image resolution/aspect ratio analysis (identify potential shortcuts)
- Color distribution analysis (identify potential shortcuts)
- Sample visualization from each source (text2img, inpainting, insight, wiki)
- Statistical summary of the dataset
- Data quality checks
#### **02_preprocessing.ipynb** - Preprocessing Pipeline
- Square crop and resize implementation demonstration
- Face crop (MTCNN) demonstration and effectiveness analysis
- Augmentation pipeline visualization (before/after examples)
- Z-score normalization comparison (ImageNet vs real-image-only)
- Data split verification (group-aware by basename, no overlap)
- Preprocessing impact visualization
#### **03_phase1_analysis.ipynb** - Phase 1: Architecture Baseline
- SimpleCNN vs ResNet18 comparison
- 5-fold stratified CV results (mean ± std with confidence intervals)
- Per-source metrics for each model (text2img, inpainting, insight)
- Train/val/test performance curves across epochs
- Confusion matrices for each model
- Statistical significance testing between models
- Grad-CAM visualizations for both models (10-20 images each)
- Conclusions: Which baseline is better and why
#### **04_phase2_analysis.ipynb** - Phase 2: Preprocessing Impact
- **2A**: Resolution impact (128×128 vs 224×224)
- **2B**: Facecrop impact
- **2C**: Shortcut analysis (resolution/ratio + color distribution)
- **2D**: Augmentation impact (without facecrop)
- **2E**: Augmentation + facecrop combined
For each experiment:
- 5-fold CV results (mean ± std with confidence intervals)
- Per-source metrics (text2img, inpainting, insight)
- Statistical significance testing vs baseline
- Comparison tables across all Phase 2 experiments
- Grad-CAM visualizations (10-20 images per condition)
- Analysis of train/val gap changes
- Analysis of per-source AUC variance changes
**Overall Phase 2 conclusions**:
- Which preprocessing choices work best and why
- Are shortcuts being learned (resolution, color distribution)?
- Does augmentation remove shortcuts or over-regularize?
- Recommendations for Phase 3 (best preprocessing settings)
#### **05_phase3_analysis.ipynb** - Phase 3: Extended Architecture Exploration
- ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small
- 5-fold CV results (mean ± std) for each model
- Per-source metrics for each model
- Comparison with Phase 1 baselines (ResNet18, SimpleCNN)
- Statistical significance testing vs baselines
- Grad-CAM visualizations for top models (10-20 images each)
- Parameter count vs performance analysis
- Conclusions: Which architectures work best and why
#### **06_phase4_analysis.ipynb** - Phase 4: Final Analysis
- **4A**: Data quantity scaling (20%, 50%, 100%) on top 3-4 models
- **4B**: Comprehensive evaluation on full dataset
- Detailed per-source metrics for final models
- Grad-CAM visualizations for final models (10-20 images each)
- Hard example analysis (false positives/negatives) with visualizations
- Confidence distribution analysis (histograms)
- Cross-validation results (mean ± std with confidence intervals)
- Final model comparison and selection
- Conclusions and recommendations
#### **07_gradcam_deep_dive.ipynb** - Grad-CAM Deep Dive (optional)
- Comprehensive Grad-CAM analysis across all phases and models
- Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt)
- Comparison of what different models focus on (face regions, backgrounds, artifacts)
- Evidence of shortcut removal (or lack thereof) across phases
- Temporal analysis: does model attention change with different preprocessing?
- Visual explanations suitable for presentation
**Notebook requirements**:
- Each notebook should be self-contained and reproducible
- Include statistical analysis with confidence intervals
- Generate publication-ready visualizations
- Address all experimental questions and hypotheses
- Provide clear conclusions for each phase
- Use consistent formatting and style across all notebooks
- Save all results (metrics, figures, tables) for easy reference
---
## Key Improvements
### 1. Stratified Cross-Validation Implementation
```python
# Use sklearn's StratifiedKFold to ensure balanced class distribution across folds
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
# Train on train_idx, validate on val_idx
# Store metrics per fold
```
### 2. Augmentation Shortcut Removal Analysis (Phase 2D/2E)
Track these metrics with/without augmentation:
| Metric | Without Aug | With Aug | Interpretation |
|--------|-------------|----------|----------------|
| Train AUC | 0.99 | 0.95 | ↓ Expected |
| Val AUC | 0.90 | 0.89 | ↓ Slight |
| **Train/Val Gap** | **0.09** | **0.06** | **↓ Good!** |
| text2img AUC | 0.98 | 0.96 | ↓ Slight |
| InsightFace AUC | 0.82 | 0.85 | **↑ Good!** |
| **AUC Variance** | **0.08** | **0.06** | **↓ Good!** |
**Interpretation**: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts.
### 3. Consistent Hyperparameters
- Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN)
- Same epochs, ES patience, batch size
- Only vary the architecture being tested
### 4. Test Set Integrity and Reproducibility
**Test set from original source**:
- Verify that test set uses original images with minimal preprocessing
- Test set should use `get_transforms(train=False, ...)` to disable augmentation
- Ensure test images are not preprocessed in a way that could affect model comparisons
**Reproducible splits across models**:
- The code already uses `cfg.get("seed", 42)` for reproducible splits
- All experiments should use the same seed (42) to ensure identical train/val/test splits
- This ensures fair comparison between models
**Central config for shared parameters**:
- Create a central config file (`classifier/configs/shared.json`) with parameters common across all phases
- This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc.
- Individual experiment configs can override these defaults
Example shared config:
```json
{
"seed": 42,
"val_ratio": 0.1,
"test_ratio": 0.1,
"batch_size": 32,
"optimizer": {
"type": "adamw",
"lr": 1e-4,
"weight_decay": 1e-4
},
"scheduler": {
"type": "cosine_annealing",
"T_max": 15
},
"early_stopping_patience": 5,
"num_workers": 4
}
```
---
## Summary Table for Report
| Phase | Variable Tested | Models | Data | Resolution | Facecrop | Augment | CV |
|-------|-----------------|--------|------|------------|----------|---------|----|
| 1 | Architecture Baseline | SimpleCNN, ResNet18 | 20% | 128 | No | No | 5-fold stratified |
| 2A | Shortcut Analysis | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
| 2A-Holdout | Source Holdout | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
| 2B | Resolution | SimpleCNN, ResNet18 | 20% | 128/224 | No | No | 5-fold stratified |
| 2C | Facecrop | SimpleCNN, ResNet18 | 20% | 224 | ± | No | 5-fold stratified |
| 2D | Augmentation (no facecrop) | SimpleCNN, ResNet18 | 20% | 224 | No | ± | 5-fold stratified |
| 2E | Augmentation + Facecrop | SimpleCNN, ResNet18 | 20% | 224 | Yes | ± | 5-fold stratified |
| 3 | Extended Architectures | ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small | 20% | Best | Best | Best | 5-fold stratified |
| 4A | Data Quantity | Top 3-4 models | 20/50/100% | Best | Best | Best | 5-fold stratified |
| 4B | Final Evaluation | Top 3-4 models | 100% | Best | Best | Best | 5-fold stratified |
This structure gives you:
- ✅ Identical comparison conditions across all phases
- ✅ 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution)
- ✅ Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2)
- ✅ Shortcut analysis to verify no bias (Phase 2C)
- ✅ Experimental questions about augmentation impact (Phase 2D/2E)
- ✅ Shortcut removal analysis via train/val gap and per-source AUC metrics
- ✅ Facecrop tested on baseline models (Phase 2B)
- ✅ Extended architecture exploration with proven models (Phase 3)
- ✅ Final comprehensive analysis on best models (Phase 4)
- ✅ Data quantity scaling on multiple best models (Phase 4A)
- ✅ Clear, isolated variables per phase
- ✅ Explainable progression for report
**Key Experimental Questions in Phase 2**:
- **2C (Shortcut Analysis)**: Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
- **2D (Augmentation without facecrop)**: Does augmentation improve or hurt performance?
- **2E (Augmentation with facecrop)**: Does augmentation improve or hurt performance compared to facecrop alone?