Files
DRL_PROJ/docs/generator_plan.md
T
Johnny Fernandes bb3dfb92d5 Clean state
2026-04-30 01:25:39 +01:00

280 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Generator Plan
The assignment rewards *iterative improvement with intermediate results*. This plan is structured around **model evolution as the spine**: each step has a *because* tied to an observed failure of the previous step. Pipeline ablations are honest but de-emphasized — they clear the table for the real story.
---
## Standard Settings (Applied Everywhere Unless Noted)
| Setting | Value | Reason |
|---------|-------|--------|
| Batch size | 64 | Consistent across experiments |
| Mixed precision | float16 + GradScaler | Speed |
| EMA decay | 0.9999 | Sample from EMA weights for GANs |
| FID evaluation | Every 25 epochs | Objective quality tracking |
| FID n_real | 5000 | Held-out real images |
| Default epochs | 100 | Best-of-each in Phase 4 retrains to 200 |
Per-model optimizer/hyperparameters are listed inside each phase.
---
## Phase 1 — Pipeline Selection *(quick, one figure)*
**Goal**: Pick the data pipeline used for every downstream experiment. Don't dwell here — this is clearing the table, not the story.
Fixed model: **DCGAN at 64×64** (cheapest baseline, fast iteration). One variable per experiment.
| Experiment | Variable | Variants | Decision |
|---|---|---|---|
| 1A | Resolution | 64×64 vs 128×128 | Pick by FID — assumed transferable |
| 1B | Face crop + alignment | Full image vs MTCNN-aligned | Pick by FID — assumed transferable |
| 1C | Augmentation | H-flip only vs H-flip + rotation ±5° + mild color jitter | Per-family: validate inside Phase 2 for GAN, default to H-flip-only for VAE/DDPM |
| 1D | Combined dataset | Aligned only vs aligned + raw mixed | Pick by FID — expected to underperform aligned-only |
**Caveat on transferability**: Phase 1 uses DCGAN as a proxy to choose the pipeline cheaply, then assumes the choice transfers to VAE and DDPM. Resolution and alignment are largely architecture-invariant (more pixels help everyone; structural consistency helps any spatial prior). Augmentation is *not* — diffusion models benefit less from aug, and MSE-VAE may even be hurt by color jitter. So 1C is treated as an **indicative** result for GANs and re-checked per family rather than baked in globally.
**1D — combined dataset rationale**: Mixing aligned + raw doubles the variance the generator must model (face anywhere/any scale + face fixed) and dilutes the geometric prior. Hypothesis: combined < aligned-only. Cheap to test (one extra DCGAN run). Included for completeness so the report shows we considered it rather than asserting it.
**MTCNN alignment** (one-time preprocessing, cached to disk):
```python
from facenet_pytorch import MTCNN
from skimage.transform import SimilarityTransform, warp
import numpy as np
from PIL import Image
mtcnn = MTCNN(keep_all=False, device='cuda')
REF_LANDMARKS = np.array([ # reference positions in 128×128
[38.0, 51.0], # left eye
[90.0, 51.0], # right eye
[64.0, 71.0], # nose
[45.0, 95.0], # left mouth
[83.0, 95.0], # right mouth
], dtype=np.float32)
def align_face(img: Image.Image, out_size: int = 128):
boxes, _, landmarks = mtcnn.detect(img, landmarks=True)
if boxes is None:
return None
tform = SimilarityTransform()
tform.estimate(landmarks[0], REF_LANDMARKS)
aligned = warp(np.array(img), tform.inverse,
output_shape=(out_size, out_size),
order=3, preserve_range=True).astype(np.uint8)
return Image.fromarray(aligned)
```
**Augmentation philosophy** — only structure-preserving transforms (face-aligned crops are consistent by design):
| Transform | Apply? | Reason |
|---|---|---|
| Horizontal flip | Yes, p=0.5 | Faces are symmetric |
| Rotation | Yes, ±5° | Residual head tilt post-alignment |
| Color jitter | Yes, mild | brightness ±0.1, contrast ±0.1, saturation ±0.05 |
| Translation | No | Breaks alignment |
| Vertical flip | No | Meaningless for faces |
| Strong blur / noise | No | Teaches the model to generate blur |
**Output**: ~1 page in the report. Best pipeline carries forward to all phases.
---
## Phase 2 — GAN Evolution *(main spine)*
**Goal**: The richest narrative — each step has a clear *because* from observed failure. This is the strongest part of the storyline; keep it front and center.
Best pipeline from Phase 1 fixed throughout.
---
### 2.1 — DCGAN *(baseline)*
Simplest GAN baseline. BCE loss, no gradient penalty.
- Adam β1=0.5, β2=0.999, lr=2e-4
- ngf=ndf=64, latent_dim=100
- Resolution: 64×64
**Expected failure**: mode collapse, training instability, oscillating losses. Document these explicitly — they motivate 2.2.
---
### 2.2 — WGAN-GP
**Because**: DCGAN showed mode collapse and instability → Wasserstein loss + gradient penalty.
- Adam β1=0.0, β2=0.9, lr_g=lr_d=1e-4
- ngf=ndf=64, latent_dim=128, n_critic=2, gp_lambda=10
- Resolution: 64×64
**Expected**: more stable training, better diversity. Likely remaining issues: texture artifacts, limited global coherence at higher resolution.
---
### 2.3 — WGAN-GP + Spectral Norm + GroupNorm + Self-Attention
**Because**: WGAN-GP showed texture artifacts / limited coherence → principled Lipschitz constraint and long-range dependencies.
- Generator: BatchNorm → GroupNorm (no batch-size coupling)
- Critic: InstanceNorm → Spectral Normalization (principled Lipschitz constraint)
- Self-attention at 16×16 in both generator and critic
```python
class SelfAttention(nn.Module):
def __init__(self, in_ch):
super().__init__()
mid = max(in_ch // 8, 1)
self.q = nn.Conv2d(in_ch, mid, 1, bias=False)
self.k = nn.Conv2d(in_ch, mid, 1, bias=False)
self.v = nn.Conv2d(in_ch, in_ch, 1, bias=False)
self.gamma = nn.Parameter(torch.zeros(1))
self._mid = mid
def forward(self, x):
b, c, h, w = x.shape
q = self.q(x).view(b, self._mid, -1).transpose(-2, -1)
k = self.k(x).view(b, self._mid, -1)
v = self.v(x).view(b, c, -1)
attn = torch.softmax(q @ k * self._mid ** -0.5, dim=-1)
return x + self.gamma * (v @ attn.transpose(-2, -1)).view(b, c, h, w)
```
---
### 2.4 — Scale to 128×128 *(if 2.3 looks coherent at 64×64)*
**Because**: 2.3 produces coherent samples at 64×64 → does the architecture hold up at higher resolution?
Same architecture as 2.3, retrained at 128×128. Add attention at 32×32 if memory permits.
---
### Phase 2 Results
| Step | Model | FID @ 100ep ↓ | Main observed failure | Motivates next step |
|---|---|---|---|---|
| 2.1 | DCGAN | ? | ? | ? |
| 2.2 | WGAN-GP | ? | ? | ? |
| 2.3 | WGAN-GP + SN + Attn | ? | ? | ? |
| 2.4 | + 128×128 | ? | ? | — |
For each step: FID curve, 16-sample grid, one paragraph on what failed and why the next change addresses it.
---
## Phase 3 — VAE Track
**Goal**: A self-contained evolution story for the likelihood-based family. Every step motivated by a known limitation of the previous.
| Step | Model | Because |
|---|---|---|
| 3.1 | Vanilla VAE (MSE) | Baseline — expect blur |
| 3.2 | + Perceptual loss (VGG) | MSE blur is fundamental to pixel-space reconstruction |
| 3.3 | + PatchGAN discriminator (VQGAN-lite) | Perceptual loss still lacks local texture realism |
**3.1 — Vanilla VAE**: Adam lr=1e-3, latent_dim=256, β=1.0. Plain convolutional encoder/decoder, MSE reconstruction.
**3.2 — Perceptual loss**: VGG-16 feature matching at relu1_2, relu2_2, relu3_3.
**3.3 — Patch discriminator**: PatchGAN adversarial loss targeting local texture realism.
```
L = L_mse + λ_perc·L_vgg + λ_adv·L_adv + β·L_kl
λ_perc=0.1, λ_adv=0.1, β=0.0001
```
**Decoder fix** (applied from 3.1 onward): replace `ConvTranspose2d` with `Upsample(nearest) + Conv2d` — eliminates checkerboard artifacts.
| Step | Model | FID ↓ | Main observed failure |
|---|---|---|---|
| 3.1 | VAE MSE | ? | ? |
| 3.2 | + Perceptual | ? | ? |
| 3.3 | + PatchGAN | ? | ? |
---
## Phase 4 — DDPM Track
**Goal**: A self-contained evolution story for the diffusion family.
| Step | Model | Because |
|---|---|---|
| 4.1 | DDPM linear + ε-pred | Baseline |
| 4.2 | + cosine schedule | Linear schedule wastes capacity at low timesteps |
| 4.3 | + v-prediction | ε-prediction is unstable across the full trajectory |
| 4.4 | + wider U-Net / more attention | If 4.3 still underfits |
**4.1 — Baseline**: AdamW lr=2e-4, base_ch=128, T=1000, attention at 8×8 and 16×16. DDIM sampling, 100 steps.
**4.2 — Cosine schedule**:
```python
def cosine_betas(T: int, s: float = 0.008):
t = torch.linspace(0, T, T + 1)
f = torch.cos((t / T + s) / (1 + s) * math.pi / 2) ** 2
alpha_bar = f / f[0]
betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
return betas.clamp(0, 0.999)
```
**4.3 — v-prediction**: replaces ε target with `v = √ᾱ·ε − √(1−ᾱ)·x₀`.
**4.4 — Wider U-Net**: base_ch 128 → 192, attention at 8×8, 16×16, 32×32.
| Step | Model | FID ↓ | Main observed failure |
|---|---|---|---|
| 4.1 | DDPM linear + ε | ? | ? |
| 4.2 | + cosine | ? | ? |
| 4.3 | + v-pred | ? | ? |
| 4.4 | + wider | ? | ? |
---
## Phase 5 — Cross-Family Comparison
**Goal**: Side-by-side comparison of the best from each family (2.4, 3.3, 4.4) under identical conditions.
Best-of-each retrained for 200 epochs at the same resolution and pipeline.
### 5A — Quantitative
| Model | FID ↓ | IS ↑ | LPIPS diversity ↑ | Params | Train time |
|---|---|---|---|:---:|:---:|
| Best GAN (2.4) | ? | ? | ? | ? | ? |
| Best VAE (3.3) | ? | ? | ? | ? | ? |
| Best DDPM (4.4) | ? | ? | ? | ? | ? |
### 5B — Qualitative
- **Visual grids**: 16-image sample grids per finalist
- **Progression**: epoch 10 → 50 → 100 → 200 side by side
- **Latent interpolation**: smooth transitions between two latent codes (GAN, VAE)
- **Diversity**: average pairwise LPIPS distance across 100 generated images
- **Failure modes**: worst-generated images per model
---
## Compute Budget Notes
Three families × multiple steps is a lot of runs. If compute is tight:
- **Keep the GAN track complete** (2.1 → 2.4) — it carries the strongest narrative.
- **VAE and DDPM can drop the last step each** (stop at 3.2 and 4.3) without hurting the story.
- Phase 1 ablations can use 50 epochs instead of 100 — pipeline deltas show up early.
---
## Summary
| Phase | Purpose | Models | Output |
|---|---|---|---|
| 1 | Pipeline selection | DCGAN @ 64×64 across data variants | Best pipeline |
| 2 | GAN evolution (main spine) | DCGAN → WGAN-GP → +SN+Attn → 128×128 | GAN failure→fix narrative |
| 3 | VAE evolution | VAE → +Perceptual → +PatchGAN | VAE failure→fix narrative |
| 4 | DDPM evolution | DDPM → cosine → v-pred → wider | DDPM failure→fix narrative |
| 5 | Cross-family comparison | Best of each, retrained 200ep | Final FID + IS + qualitative |
**The narrative**: baseline fails in a specific way → fix targets that failure → new failure emerges → next fix targets that → repeat per family → compare families on equal footing.