450 lines
19 KiB
Markdown
450 lines
19 KiB
Markdown
# Classifier Reorganization Plan (v2)
|
||
|
||
## Analysis of Current Phasing Issues
|
||
|
||
Your current phasing has several problems that make it difficult to present a rigorous, explainable report:
|
||
|
||
### Current Problems
|
||
|
||
1. **Inconsistent comparison conditions**:
|
||
- SimpleCNN uses lr=1e-3, ResNet uses lr=1e-4
|
||
- SimpleCNN trains 20 epochs (no ES), ResNet18 trains 15 epochs (with ES)
|
||
- Makes direct comparisons invalid
|
||
|
||
2. **No cross-validation**:
|
||
- Only a single 80/10/10 split
|
||
- Results may be split-dependent
|
||
- No confidence intervals on metrics
|
||
|
||
3. **Augmentation testing is incomplete**:
|
||
- Only tested on ResNet18 (Phase 3), not across architectures
|
||
- Performance drop could mean: (a) removing shortcuts (good) or (b) over-regularization (bad)
|
||
- No way to distinguish these cases
|
||
|
||
4. **Facecrop impact not generalized**:
|
||
- Only ResNet18 tested with facecrop
|
||
- Don't know if EfficientNet or ViT benefit similarly
|
||
|
||
5. **Full dataset only on one model**:
|
||
- Only ResNet18 tested on full dataset
|
||
- Don't know if data quantity helps all models equally
|
||
|
||
6. **Test set integrity**:
|
||
- Need to verify test set uses original images (no augmentation, no preprocessing or minimal if really necessary)
|
||
- Need to ensure same train/val/test splits across all model comparisons
|
||
- Need central config for shared parameters across phases
|
||
|
||
---
|
||
|
||
## Recommended Reorganization
|
||
|
||
I suggest reorganizing into **4 phases** with clear, isolated variables. All phases use **5-fold stratified cross-validation** as standard practice to ensure balanced class distribution across folds.
|
||
|
||
### Phase 1: Controlled Baseline Comparison
|
||
|
||
**Goal**: Compare simple architectures under identical conditions to establish baselines
|
||
|
||
**Fixed conditions for ALL models**:
|
||
- Data: 20% subsample
|
||
- Resolution: 128×128
|
||
- No face crop
|
||
- No augmentation
|
||
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
|
||
- Scheduler: CosineAnnealingLR (T_max=15)
|
||
- Epochs: 15 with early stopping (patience=5)
|
||
- Batch size: 32
|
||
- 5-fold stratified cross-validation (report mean ± std)
|
||
|
||
| Model | Params | Expected AUC (mean ± std) |
|
||
|-------|--------|---------------------------|
|
||
| SimpleCNN | ~400k | ? |
|
||
| ResNet18 | ~11.7M | ? |
|
||
|
||
**This gives you**: Clean, comparable baseline for simple architectures with confidence intervals
|
||
|
||
**These same 2 models will be used in Phase 2 for preprocessing experiments.**
|
||
|
||
---
|
||
|
||
### Phase 2: Preprocessing Impact (Same 2 Models from Phase 1)
|
||
|
||
**Goal**: Test each preprocessing change on the SAME 2 models from Phase 1
|
||
|
||
**Experimental questions**:
|
||
- Does higher resolution improve performance?
|
||
- Does face cropping improve performance?
|
||
- Does augmentation improve or hurt performance?
|
||
- Does augmentation interact with face cropping?
|
||
- Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
|
||
|
||
#### 2A: Shortcut Analysis
|
||
**Goal**: Establish whether the baseline model exploits geometry, colour, or source-specific shortcuts before drawing any conclusions from preprocessing experiments.
|
||
|
||
**Test 1: Resolution/Ratio Shortcuts (Letterboxing)**
|
||
- Train on original images (real=rectangular, fake=square); evaluate the same checkpoint under standard crop vs letterbox-padded real images to confirm geometry is or is not a discriminative cue
|
||
- Models: **ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
- Resolution: 224×224
|
||
- No facecrop, no augmentation
|
||
|
||
| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
|
||
|------------|-----|---------------|-------------------------|
|
||
| Original images (standard eval) | ? | ? | ? |
|
||
| Matched geometry (letterboxed real images) | ? | ? | ? |
|
||
|
||
**Test 2: Color Distribution Shortcuts**
|
||
- Compare: Train with ImageNet normalization stats vs real-image-only normalization stats
|
||
- Models: **ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
- Resolution: 224×224
|
||
- No facecrop, no augmentation
|
||
- ImageNet stats: mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
|
||
- Real-image stats: Calculate mean/std from real training images only, apply to all
|
||
|
||
| Experiment | AUC | Train/Val Gap | Per-Source AUC Variance |
|
||
|------------|-----|---------------|-------------------------|
|
||
| ImageNet normalization | ? | ? | ? |
|
||
| Real-image-only normalization | ? | ? | ? |
|
||
|
||
**Test 3: Source-Specific Feature Learning (Source Holdout)**
|
||
- Compare: Train on all sources vs train with one source held out
|
||
- Models: **ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
- Resolution: 224×224
|
||
- No facecrop, no augmentation
|
||
- Hold out each fake source (text2img, inpainting, insight) separately
|
||
|
||
| Experiment | Held-out Source | Train Sources | Held-out AUC | In-Source AUC | Δ (In-Source - Held-out) |
|
||
|------------|-----------------|---------------|--------------|---------------|--------------------------|
|
||
| Baseline | None | All | - | ? | - |
|
||
| Holdout text2img | text2img | wiki, inpainting, insight | ? | ? | ? |
|
||
| Holdout inpainting | inpainting | wiki, text2img, insight | ? | ? | ? |
|
||
| Holdout insight | insight | wiki, text2img, inpainting | ? | ? | ? |
|
||
|
||
**Interpretation**: If held-out source AUC is significantly lower than in-source AUC (Δ > 0.05-0.10), the model is learning source-specific features. If AUC drop under matched geometry is significant, the model exploits aspect-ratio as a shortcut — this must be known before interpreting resolution or facecrop results.
|
||
|
||
#### 2B: Resolution Impact (no facecrop, no augmentation)
|
||
- Test: 128×128 vs 224×224
|
||
- Models: **SimpleCNN, ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
|
||
| Model | 128×128 AUC | 224×224 AUC | Δ |
|
||
|-------|-------------|-------------|---|
|
||
| SimpleCNN | ? | ? | ? |
|
||
| ResNet18 | ? | ? | ? |
|
||
|
||
#### 2C: Facecrop Impact (224×224, no augmentation)
|
||
- Test: No facecrop vs MTCNN facecrop
|
||
- Models: **SimpleCNN, ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
|
||
| Model | No Facecrop AUC | Facecrop AUC | Δ |
|
||
|-------|-----------------|--------------|---|
|
||
| SimpleCNN | ? | ? | ? |
|
||
| ResNet18 | ? | ? | ? |
|
||
|
||
#### 2D: Augmentation Impact (224×224, without facecrop)
|
||
- Test: No augmentation vs augmentation
|
||
- Models: **SimpleCNN, ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
- **Verify test set has no augmentation** (code inspection of `get_transforms(train=False, ...)`)
|
||
- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
|
||
|
||
| Model | No Aug AUC | With Aug AUC | Δ | Train/Val Gap (No Aug) | Train/Val Gap (With Aug) |
|
||
|-------|------------|--------------|---|------------------------|--------------------------|
|
||
| SimpleCNN | ? | ? | ? | ? | ? |
|
||
| ResNet18 | ? | ? | ? | ? | ? |
|
||
|
||
**Experimental question**: Does augmentation without facecrop improve or hurt performance?
|
||
|
||
#### 2E: Augmentation + Facecrop Combined (224×224)
|
||
- Test: Facecrop only vs Facecrop + augmentation
|
||
- Models: **SimpleCNN, ResNet18**
|
||
- Data: 20% subsample
|
||
- 5-fold stratified CV (balanced class distribution)
|
||
- **Analyze shortcut removal**: Compare train/val gaps and per-source AUC balance
|
||
|
||
| Model | Facecrop Only AUC | Facecrop + Aug AUC | Δ | Train/Val Gap (Only) | Train/Val Gap (With Aug) |
|
||
|-------|-------------------|--------------------|---|----------------------|--------------------------|
|
||
| SimpleCNN | ? | ? | ? | ? | ? |
|
||
| ResNet18 | ? | ? | ? | ? | ? |
|
||
|
||
**Experimental question**: Does augmentation with facecrop improve or hurt performance compared to facecrop alone?
|
||
|
||
**This gives you**:
|
||
- Isolated impact of each preprocessing choice on SimpleCNN and ResNet18
|
||
- Verification that the model is not learning shortcuts
|
||
- Understanding of how augmentation interacts with face cropping
|
||
- Shortcut removal analysis through train/val gap and per-source AUC metrics
|
||
|
||
---
|
||
|
||
### Phase 3: Extended Architecture Exploration
|
||
|
||
**Goal**: Test additional architectures to find the best performing models
|
||
|
||
**Fixed conditions** (based on best findings from Phase 2):
|
||
- Data: 20% subsample
|
||
- Resolution: Best from Phase 2A (likely 224×224)
|
||
- Facecrop: Best from Phase 2B/E (likely Yes)
|
||
- Augmentation: Best from Phase 2D/E (depends on experimental results)
|
||
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
|
||
- Scheduler: CosineAnnealingLR (T_max=15)
|
||
- Epochs: 15 with early stopping (patience=5)
|
||
- Batch size: 32
|
||
- 5-fold stratified cross-validation (balanced class distribution)
|
||
|
||
| Model | Params | Rationale |
|
||
|-------|--------|-----------|
|
||
| ResNet34 | ~21.8M | Deeper ResNet - test if more capacity helps |
|
||
| ResNet50 | ~25.6M | Even deeper with bottleneck blocks |
|
||
| EfficientNet-B0 | ~4.0M | Efficient compound scaling |
|
||
| ConvNeXt-Tiny | ~29M | Modern CNN, different architecture family |
|
||
| MobileNetV3-Small | ~2.5M | Lightweight efficiency comparison |
|
||
|
||
**This gives you**: Extended architecture exploration to identify top-performing models for Phase 4
|
||
- ResNet depth progression (18 -> 34 -> 50)
|
||
- Efficient architectures (EfficientNet-B0, MobileNetV3-Small)
|
||
- Modern CNN with different inductive bias (ConvNeXt-Tiny)
|
||
- Size range (2.5M to 29M parameters)
|
||
|
||
---
|
||
|
||
### Phase 4: Final Analysis on Best Models
|
||
|
||
**Goal**: Comprehensive evaluation of top-performing models from Phases 1-3
|
||
|
||
**Select top 3-4 models** based on Phase 1-3 results (e.g., ResNet18, ResNet34, EfficientNet-B0, ConvNeXt-Tiny)
|
||
|
||
#### 4A: Data Quantity Scaling
|
||
Test how each best model scales with more data:
|
||
|
||
| Model | 20% Data AUC | 50% Data AUC | 100% Data AUC | Δ (100% - 20%) |
|
||
|-------|--------------|--------------|---------------|----------------|
|
||
| Model 1 | ? | ? | ? | ? |
|
||
| Model 2 | ? | ? | ? | ? |
|
||
| Model 3 | ? | ? | ? | ? |
|
||
| Model 4 | ? | ? | ? | ? |
|
||
|
||
**Fixed conditions**:
|
||
- Resolution: Best from Phase 2A
|
||
- Facecrop: Best from Phase 2B/E
|
||
- Augmentation: Best from Phase 2D/E
|
||
- 5-fold stratified cross-validation (balanced class distribution)
|
||
|
||
#### 4B: Comprehensive Evaluation on Full Dataset
|
||
- Train best models on **full dataset** (100%)
|
||
- Detailed per-source metrics (text2img, inpainting, insight)
|
||
- Grad-CAM visualizations for explainability
|
||
- Hard example analysis (false positives/negatives)
|
||
- Confidence distribution analysis
|
||
- Cross-validation results (mean ± std)
|
||
|
||
**This gives you**: Final, comprehensive evaluation of the best models with full explainability
|
||
|
||
---
|
||
|
||
### Notebooks and Analysis
|
||
|
||
**Goal**: Use Jupyter notebooks for comprehensive analysis and validation of each phase
|
||
|
||
#### **01_eda.ipynb** - Exploratory Data Analysis
|
||
- Dataset overview (real vs fake distribution, sources)
|
||
- Image resolution/aspect ratio analysis (identify potential shortcuts)
|
||
- Color distribution analysis (identify potential shortcuts)
|
||
- Sample visualization from each source (text2img, inpainting, insight, wiki)
|
||
- Statistical summary of the dataset
|
||
- Data quality checks
|
||
|
||
#### **02_preprocessing.ipynb** - Preprocessing Pipeline
|
||
- Square crop and resize implementation demonstration
|
||
- Face crop (MTCNN) demonstration and effectiveness analysis
|
||
- Augmentation pipeline visualization (before/after examples)
|
||
- Z-score normalization comparison (ImageNet vs real-image-only)
|
||
- Data split verification (group-aware by basename, no overlap)
|
||
- Preprocessing impact visualization
|
||
|
||
#### **03_phase1_analysis.ipynb** - Phase 1: Architecture Baseline
|
||
- SimpleCNN vs ResNet18 comparison
|
||
- 5-fold stratified CV results (mean ± std with confidence intervals)
|
||
- Per-source metrics for each model (text2img, inpainting, insight)
|
||
- Train/val/test performance curves across epochs
|
||
- Confusion matrices for each model
|
||
- Statistical significance testing between models
|
||
- Grad-CAM visualizations for both models (10-20 images each)
|
||
- Conclusions: Which baseline is better and why
|
||
|
||
#### **04_phase2_analysis.ipynb** - Phase 2: Preprocessing Impact
|
||
- **2A**: Resolution impact (128×128 vs 224×224)
|
||
- **2B**: Facecrop impact
|
||
- **2C**: Shortcut analysis (resolution/ratio + color distribution)
|
||
- **2D**: Augmentation impact (without facecrop)
|
||
- **2E**: Augmentation + facecrop combined
|
||
|
||
For each experiment:
|
||
- 5-fold CV results (mean ± std with confidence intervals)
|
||
- Per-source metrics (text2img, inpainting, insight)
|
||
- Statistical significance testing vs baseline
|
||
- Comparison tables across all Phase 2 experiments
|
||
- Grad-CAM visualizations (10-20 images per condition)
|
||
- Analysis of train/val gap changes
|
||
- Analysis of per-source AUC variance changes
|
||
|
||
**Overall Phase 2 conclusions**:
|
||
- Which preprocessing choices work best and why
|
||
- Are shortcuts being learned (resolution, color distribution)?
|
||
- Does augmentation remove shortcuts or over-regularize?
|
||
- Recommendations for Phase 3 (best preprocessing settings)
|
||
|
||
#### **05_phase3_analysis.ipynb** - Phase 3: Extended Architecture Exploration
|
||
- ResNet34, ResNet50, EfficientNet-B0, ConvNeXt-Tiny, MobileNetV3-Small
|
||
- 5-fold CV results (mean ± std) for each model
|
||
- Per-source metrics for each model
|
||
- Comparison with Phase 1 baselines (ResNet18, SimpleCNN)
|
||
- Statistical significance testing vs baselines
|
||
- Grad-CAM visualizations for top models (10-20 images each)
|
||
- Parameter count vs performance analysis
|
||
- Conclusions: Which architectures work best and why
|
||
|
||
#### **06_phase4_analysis.ipynb** - Phase 4: Final Analysis
|
||
- **4A**: Data quantity scaling (20%, 50%, 100%) on top 3-4 models
|
||
- **4B**: Comprehensive evaluation on full dataset
|
||
- Detailed per-source metrics for final models
|
||
- Grad-CAM visualizations for final models (10-20 images each)
|
||
- Hard example analysis (false positives/negatives) with visualizations
|
||
- Confidence distribution analysis (histograms)
|
||
- Cross-validation results (mean ± std with confidence intervals)
|
||
- Final model comparison and selection
|
||
- Conclusions and recommendations
|
||
|
||
#### **07_gradcam_deep_dive.ipynb** - Grad-CAM Deep Dive (optional)
|
||
- Comprehensive Grad-CAM analysis across all phases and models
|
||
- Feature visualization for different model architectures (CNN vs EfficientNet vs ConvNeXt)
|
||
- Comparison of what different models focus on (face regions, backgrounds, artifacts)
|
||
- Evidence of shortcut removal (or lack thereof) across phases
|
||
- Temporal analysis: does model attention change with different preprocessing?
|
||
- Visual explanations suitable for presentation
|
||
|
||
**Notebook requirements**:
|
||
- Each notebook should be self-contained and reproducible
|
||
- Include statistical analysis with confidence intervals
|
||
- Generate publication-ready visualizations
|
||
- Address all experimental questions and hypotheses
|
||
- Provide clear conclusions for each phase
|
||
- Use consistent formatting and style across all notebooks
|
||
- Save all results (metrics, figures, tables) for easy reference
|
||
|
||
---
|
||
|
||
## Key Improvements
|
||
|
||
### 1. Stratified Cross-Validation Implementation
|
||
```python
|
||
# Use sklearn's StratifiedKFold to ensure balanced class distribution across folds
|
||
from sklearn.model_selection import StratifiedKFold
|
||
|
||
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
||
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
|
||
# Train on train_idx, validate on val_idx
|
||
# Store metrics per fold
|
||
```
|
||
|
||
### 2. Augmentation Shortcut Removal Analysis (Phase 2D/2E)
|
||
Track these metrics with/without augmentation:
|
||
|
||
| Metric | Without Aug | With Aug | Interpretation |
|
||
|--------|-------------|----------|----------------|
|
||
| Train AUC | 0.99 | 0.95 | ↓ Expected |
|
||
| Val AUC | 0.90 | 0.89 | ↓ Slight |
|
||
| **Train/Val Gap** | **0.09** | **0.06** | **↓ Good!** |
|
||
| text2img AUC | 0.98 | 0.96 | ↓ Slight |
|
||
| InsightFace AUC | 0.82 | 0.85 | **↑ Good!** |
|
||
| **AUC Variance** | **0.08** | **0.06** | **↓ Good!** |
|
||
|
||
**Interpretation**: If train/val gap ↓ AND per-source AUC variance ↓, augmentation is removing shortcuts.
|
||
|
||
### 3. Consistent Hyperparameters
|
||
- Same lr for all models (1e-4 is safe for pretrained, may need adjustment for SimpleCNN)
|
||
- Same epochs, ES patience, batch size
|
||
- Only vary the architecture being tested
|
||
|
||
### 4. Test Set Integrity and Reproducibility
|
||
|
||
**Test set from original source**:
|
||
- Verify that test set uses original images with minimal preprocessing
|
||
- Test set should use `get_transforms(train=False, ...)` to disable augmentation
|
||
- Ensure test images are not preprocessed in a way that could affect model comparisons
|
||
|
||
**Reproducible splits across models**:
|
||
- The code already uses `cfg.get("seed", 42)` for reproducible splits
|
||
- All experiments should use the same seed (42) to ensure identical train/val/test splits
|
||
- This ensures fair comparison between models
|
||
|
||
**Central config for shared parameters**:
|
||
- Create a central config file (`classifier/configs/shared.json`) with parameters common across all phases
|
||
- This includes: seed, val_ratio, test_ratio, batch_size, optimizer settings, etc.
|
||
- Individual experiment configs can override these defaults
|
||
|
||
Example shared config:
|
||
```json
|
||
{
|
||
"seed": 42,
|
||
"val_ratio": 0.1,
|
||
"test_ratio": 0.1,
|
||
"batch_size": 32,
|
||
"optimizer": {
|
||
"type": "adamw",
|
||
"lr": 1e-4,
|
||
"weight_decay": 1e-4
|
||
},
|
||
"scheduler": {
|
||
"type": "cosine_annealing",
|
||
"T_max": 15
|
||
},
|
||
"early_stopping_patience": 5,
|
||
"num_workers": 4
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Summary Table for Report
|
||
|
||
| Phase | Variable Tested | Models | Data | Resolution | Facecrop | Augment | CV |
|
||
|-------|-----------------|--------|------|------------|----------|---------|----|
|
||
| 1 | Architecture Baseline | SimpleCNN, ResNet18 | 20% | 128 | No | No | 5-fold stratified |
|
||
| 2A | Shortcut Analysis | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
|
||
| 2A-Holdout | Source Holdout | ResNet18 | 20% | 224 | No | No | 5-fold stratified |
|
||
| 2B | Resolution | SimpleCNN, ResNet18 | 20% | 128/224 | No | No | 5-fold stratified |
|
||
| 2C | Facecrop | SimpleCNN, ResNet18 | 20% | 224 | ± | No | 5-fold stratified |
|
||
| 2D | Augmentation (no facecrop) | SimpleCNN, ResNet18 | 20% | 224 | No | ± | 5-fold stratified |
|
||
| 2E | Augmentation + Facecrop | SimpleCNN, ResNet18 | 20% | 224 | Yes | ± | 5-fold stratified |
|
||
| 3 | Extended Architectures | ResNet34, ResNet50, EffNet-B0, ConvNeXt-Tiny, MobileNetV3-Small | 20% | Best | Best | Best | 5-fold stratified |
|
||
| 4A | Data Quantity | Top 3-4 models | 20/50/100% | Best | Best | Best | 5-fold stratified |
|
||
| 4B | Final Evaluation | Top 3-4 models | 100% | Best | Best | Best | 5-fold stratified |
|
||
|
||
This structure gives you:
|
||
- ✅ Identical comparison conditions across all phases
|
||
- ✅ 5-fold stratified cross-validation with confidence intervals (ensures balanced class distribution)
|
||
- ✅ Same 2 baseline models (SimpleCNN, ResNet18) tested across all preprocessing variations (Phase 2)
|
||
- ✅ Shortcut analysis to verify no bias (Phase 2C)
|
||
- ✅ Experimental questions about augmentation impact (Phase 2D/2E)
|
||
- ✅ Shortcut removal analysis via train/val gap and per-source AUC metrics
|
||
- ✅ Facecrop tested on baseline models (Phase 2B)
|
||
- ✅ Extended architecture exploration with proven models (Phase 3)
|
||
- ✅ Final comprehensive analysis on best models (Phase 4)
|
||
- ✅ Data quantity scaling on multiple best models (Phase 4A)
|
||
- ✅ Clear, isolated variables per phase
|
||
- ✅ Explainable progression for report
|
||
|
||
**Key Experimental Questions in Phase 2**:
|
||
- **2C (Shortcut Analysis)**: Is the model learning any shortcuts (e.g., resolution differences, aspect ratios, etc.)?
|
||
- **2D (Augmentation without facecrop)**: Does augmentation improve or hurt performance?
|
||
- **2E (Augmentation with facecrop)**: Does augmentation improve or hurt performance compared to facecrop alone?
|