280 lines
11 KiB
Markdown
280 lines
11 KiB
Markdown
# Generator Plan
|
||
|
||
The assignment rewards *iterative improvement with intermediate results*. This plan is structured around **model evolution as the spine**: each step has a *because* tied to an observed failure of the previous step. Pipeline ablations are honest but de-emphasized — they clear the table for the real story.
|
||
|
||
---
|
||
|
||
## Standard Settings (Applied Everywhere Unless Noted)
|
||
|
||
| Setting | Value | Reason |
|
||
|---------|-------|--------|
|
||
| Batch size | 64 | Consistent across experiments |
|
||
| Mixed precision | float16 + GradScaler | Speed |
|
||
| EMA decay | 0.9999 | Sample from EMA weights for GANs |
|
||
| FID evaluation | Every 25 epochs | Objective quality tracking |
|
||
| FID n_real | 5000 | Held-out real images |
|
||
| Default epochs | 100 | Best-of-each in Phase 4 retrains to 200 |
|
||
|
||
Per-model optimizer/hyperparameters are listed inside each phase.
|
||
|
||
---
|
||
|
||
## Phase 1 — Pipeline Selection *(quick, one figure)*
|
||
|
||
**Goal**: Pick the data pipeline used for every downstream experiment. Don't dwell here — this is clearing the table, not the story.
|
||
|
||
Fixed model: **DCGAN at 64×64** (cheapest baseline, fast iteration). One variable per experiment.
|
||
|
||
| Experiment | Variable | Variants | Decision |
|
||
|---|---|---|---|
|
||
| 1A | Resolution | 64×64 vs 128×128 | Pick by FID — assumed transferable |
|
||
| 1B | Face crop + alignment | Full image vs MTCNN-aligned | Pick by FID — assumed transferable |
|
||
| 1C | Augmentation | H-flip only vs H-flip + rotation ±5° + mild color jitter | Per-family: validate inside Phase 2 for GAN, default to H-flip-only for VAE/DDPM |
|
||
| 1D | Combined dataset | Aligned only vs aligned + raw mixed | Pick by FID — expected to underperform aligned-only |
|
||
|
||
**Caveat on transferability**: Phase 1 uses DCGAN as a proxy to choose the pipeline cheaply, then assumes the choice transfers to VAE and DDPM. Resolution and alignment are largely architecture-invariant (more pixels help everyone; structural consistency helps any spatial prior). Augmentation is *not* — diffusion models benefit less from aug, and MSE-VAE may even be hurt by color jitter. So 1C is treated as an **indicative** result for GANs and re-checked per family rather than baked in globally.
|
||
|
||
**1D — combined dataset rationale**: Mixing aligned + raw doubles the variance the generator must model (face anywhere/any scale + face fixed) and dilutes the geometric prior. Hypothesis: combined < aligned-only. Cheap to test (one extra DCGAN run). Included for completeness so the report shows we considered it rather than asserting it.
|
||
|
||
**MTCNN alignment** (one-time preprocessing, cached to disk):
|
||
|
||
```python
|
||
from facenet_pytorch import MTCNN
|
||
from skimage.transform import SimilarityTransform, warp
|
||
import numpy as np
|
||
from PIL import Image
|
||
|
||
mtcnn = MTCNN(keep_all=False, device='cuda')
|
||
|
||
REF_LANDMARKS = np.array([ # reference positions in 128×128
|
||
[38.0, 51.0], # left eye
|
||
[90.0, 51.0], # right eye
|
||
[64.0, 71.0], # nose
|
||
[45.0, 95.0], # left mouth
|
||
[83.0, 95.0], # right mouth
|
||
], dtype=np.float32)
|
||
|
||
def align_face(img: Image.Image, out_size: int = 128):
|
||
boxes, _, landmarks = mtcnn.detect(img, landmarks=True)
|
||
if boxes is None:
|
||
return None
|
||
tform = SimilarityTransform()
|
||
tform.estimate(landmarks[0], REF_LANDMARKS)
|
||
aligned = warp(np.array(img), tform.inverse,
|
||
output_shape=(out_size, out_size),
|
||
order=3, preserve_range=True).astype(np.uint8)
|
||
return Image.fromarray(aligned)
|
||
```
|
||
|
||
**Augmentation philosophy** — only structure-preserving transforms (face-aligned crops are consistent by design):
|
||
|
||
| Transform | Apply? | Reason |
|
||
|---|---|---|
|
||
| Horizontal flip | Yes, p=0.5 | Faces are symmetric |
|
||
| Rotation | Yes, ±5° | Residual head tilt post-alignment |
|
||
| Color jitter | Yes, mild | brightness ±0.1, contrast ±0.1, saturation ±0.05 |
|
||
| Translation | No | Breaks alignment |
|
||
| Vertical flip | No | Meaningless for faces |
|
||
| Strong blur / noise | No | Teaches the model to generate blur |
|
||
|
||
**Output**: ~1 page in the report. Best pipeline carries forward to all phases.
|
||
|
||
---
|
||
|
||
## Phase 2 — GAN Evolution *(main spine)*
|
||
|
||
**Goal**: The richest narrative — each step has a clear *because* from observed failure. This is the strongest part of the storyline; keep it front and center.
|
||
|
||
Best pipeline from Phase 1 fixed throughout.
|
||
|
||
---
|
||
|
||
### 2.1 — DCGAN *(baseline)*
|
||
|
||
Simplest GAN baseline. BCE loss, no gradient penalty.
|
||
|
||
- Adam β1=0.5, β2=0.999, lr=2e-4
|
||
- ngf=ndf=64, latent_dim=100
|
||
- Resolution: 64×64
|
||
|
||
**Expected failure**: mode collapse, training instability, oscillating losses. Document these explicitly — they motivate 2.2.
|
||
|
||
---
|
||
|
||
### 2.2 — WGAN-GP
|
||
|
||
**Because**: DCGAN showed mode collapse and instability → Wasserstein loss + gradient penalty.
|
||
|
||
- Adam β1=0.0, β2=0.9, lr_g=lr_d=1e-4
|
||
- ngf=ndf=64, latent_dim=128, n_critic=2, gp_lambda=10
|
||
- Resolution: 64×64
|
||
|
||
**Expected**: more stable training, better diversity. Likely remaining issues: texture artifacts, limited global coherence at higher resolution.
|
||
|
||
---
|
||
|
||
### 2.3 — WGAN-GP + Spectral Norm + GroupNorm + Self-Attention
|
||
|
||
**Because**: WGAN-GP showed texture artifacts / limited coherence → principled Lipschitz constraint and long-range dependencies.
|
||
|
||
- Generator: BatchNorm → GroupNorm (no batch-size coupling)
|
||
- Critic: InstanceNorm → Spectral Normalization (principled Lipschitz constraint)
|
||
- Self-attention at 16×16 in both generator and critic
|
||
|
||
```python
|
||
class SelfAttention(nn.Module):
|
||
def __init__(self, in_ch):
|
||
super().__init__()
|
||
mid = max(in_ch // 8, 1)
|
||
self.q = nn.Conv2d(in_ch, mid, 1, bias=False)
|
||
self.k = nn.Conv2d(in_ch, mid, 1, bias=False)
|
||
self.v = nn.Conv2d(in_ch, in_ch, 1, bias=False)
|
||
self.gamma = nn.Parameter(torch.zeros(1))
|
||
self._mid = mid
|
||
|
||
def forward(self, x):
|
||
b, c, h, w = x.shape
|
||
q = self.q(x).view(b, self._mid, -1).transpose(-2, -1)
|
||
k = self.k(x).view(b, self._mid, -1)
|
||
v = self.v(x).view(b, c, -1)
|
||
attn = torch.softmax(q @ k * self._mid ** -0.5, dim=-1)
|
||
return x + self.gamma * (v @ attn.transpose(-2, -1)).view(b, c, h, w)
|
||
```
|
||
|
||
---
|
||
|
||
### 2.4 — Scale to 128×128 *(if 2.3 looks coherent at 64×64)*
|
||
|
||
**Because**: 2.3 produces coherent samples at 64×64 → does the architecture hold up at higher resolution?
|
||
|
||
Same architecture as 2.3, retrained at 128×128. Add attention at 32×32 if memory permits.
|
||
|
||
---
|
||
|
||
### Phase 2 Results
|
||
|
||
| Step | Model | FID @ 100ep ↓ | Main observed failure | Motivates next step |
|
||
|---|---|---|---|---|
|
||
| 2.1 | DCGAN | ? | ? | ? |
|
||
| 2.2 | WGAN-GP | ? | ? | ? |
|
||
| 2.3 | WGAN-GP + SN + Attn | ? | ? | ? |
|
||
| 2.4 | + 128×128 | ? | ? | — |
|
||
|
||
For each step: FID curve, 16-sample grid, one paragraph on what failed and why the next change addresses it.
|
||
|
||
---
|
||
|
||
## Phase 3 — VAE Track
|
||
|
||
**Goal**: A self-contained evolution story for the likelihood-based family. Every step motivated by a known limitation of the previous.
|
||
|
||
| Step | Model | Because |
|
||
|---|---|---|
|
||
| 3.1 | Vanilla VAE (MSE) | Baseline — expect blur |
|
||
| 3.2 | + Perceptual loss (VGG) | MSE blur is fundamental to pixel-space reconstruction |
|
||
| 3.3 | + PatchGAN discriminator (VQGAN-lite) | Perceptual loss still lacks local texture realism |
|
||
|
||
**3.1 — Vanilla VAE**: Adam lr=1e-3, latent_dim=256, β=1.0. Plain convolutional encoder/decoder, MSE reconstruction.
|
||
|
||
**3.2 — Perceptual loss**: VGG-16 feature matching at relu1_2, relu2_2, relu3_3.
|
||
|
||
**3.3 — Patch discriminator**: PatchGAN adversarial loss targeting local texture realism.
|
||
|
||
```
|
||
L = L_mse + λ_perc·L_vgg + λ_adv·L_adv + β·L_kl
|
||
λ_perc=0.1, λ_adv=0.1, β=0.0001
|
||
```
|
||
|
||
**Decoder fix** (applied from 3.1 onward): replace `ConvTranspose2d` with `Upsample(nearest) + Conv2d` — eliminates checkerboard artifacts.
|
||
|
||
| Step | Model | FID ↓ | Main observed failure |
|
||
|---|---|---|---|
|
||
| 3.1 | VAE MSE | ? | ? |
|
||
| 3.2 | + Perceptual | ? | ? |
|
||
| 3.3 | + PatchGAN | ? | ? |
|
||
|
||
---
|
||
|
||
## Phase 4 — DDPM Track
|
||
|
||
**Goal**: A self-contained evolution story for the diffusion family.
|
||
|
||
| Step | Model | Because |
|
||
|---|---|---|
|
||
| 4.1 | DDPM linear + ε-pred | Baseline |
|
||
| 4.2 | + cosine schedule | Linear schedule wastes capacity at low timesteps |
|
||
| 4.3 | + v-prediction | ε-prediction is unstable across the full trajectory |
|
||
| 4.4 | + wider U-Net / more attention | If 4.3 still underfits |
|
||
|
||
**4.1 — Baseline**: AdamW lr=2e-4, base_ch=128, T=1000, attention at 8×8 and 16×16. DDIM sampling, 100 steps.
|
||
|
||
**4.2 — Cosine schedule**:
|
||
|
||
```python
|
||
def cosine_betas(T: int, s: float = 0.008):
|
||
t = torch.linspace(0, T, T + 1)
|
||
f = torch.cos((t / T + s) / (1 + s) * math.pi / 2) ** 2
|
||
alpha_bar = f / f[0]
|
||
betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
|
||
return betas.clamp(0, 0.999)
|
||
```
|
||
|
||
**4.3 — v-prediction**: replaces ε target with `v = √ᾱ·ε − √(1−ᾱ)·x₀`.
|
||
|
||
**4.4 — Wider U-Net**: base_ch 128 → 192, attention at 8×8, 16×16, 32×32.
|
||
|
||
| Step | Model | FID ↓ | Main observed failure |
|
||
|---|---|---|---|
|
||
| 4.1 | DDPM linear + ε | ? | ? |
|
||
| 4.2 | + cosine | ? | ? |
|
||
| 4.3 | + v-pred | ? | ? |
|
||
| 4.4 | + wider | ? | ? |
|
||
|
||
---
|
||
|
||
## Phase 5 — Cross-Family Comparison
|
||
|
||
**Goal**: Side-by-side comparison of the best from each family (2.4, 3.3, 4.4) under identical conditions.
|
||
|
||
Best-of-each retrained for 200 epochs at the same resolution and pipeline.
|
||
|
||
### 5A — Quantitative
|
||
|
||
| Model | FID ↓ | IS ↑ | LPIPS diversity ↑ | Params | Train time |
|
||
|---|---|---|---|:---:|:---:|
|
||
| Best GAN (2.4) | ? | ? | ? | ? | ? |
|
||
| Best VAE (3.3) | ? | ? | ? | ? | ? |
|
||
| Best DDPM (4.4) | ? | ? | ? | ? | ? |
|
||
|
||
### 5B — Qualitative
|
||
|
||
- **Visual grids**: 16-image sample grids per finalist
|
||
- **Progression**: epoch 10 → 50 → 100 → 200 side by side
|
||
- **Latent interpolation**: smooth transitions between two latent codes (GAN, VAE)
|
||
- **Diversity**: average pairwise LPIPS distance across 100 generated images
|
||
- **Failure modes**: worst-generated images per model
|
||
|
||
---
|
||
|
||
## Compute Budget Notes
|
||
|
||
Three families × multiple steps is a lot of runs. If compute is tight:
|
||
|
||
- **Keep the GAN track complete** (2.1 → 2.4) — it carries the strongest narrative.
|
||
- **VAE and DDPM can drop the last step each** (stop at 3.2 and 4.3) without hurting the story.
|
||
- Phase 1 ablations can use 50 epochs instead of 100 — pipeline deltas show up early.
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Phase | Purpose | Models | Output |
|
||
|---|---|---|---|
|
||
| 1 | Pipeline selection | DCGAN @ 64×64 across data variants | Best pipeline |
|
||
| 2 | GAN evolution (main spine) | DCGAN → WGAN-GP → +SN+Attn → 128×128 | GAN failure→fix narrative |
|
||
| 3 | VAE evolution | VAE → +Perceptual → +PatchGAN | VAE failure→fix narrative |
|
||
| 4 | DDPM evolution | DDPM → cosine → v-pred → wider | DDPM failure→fix narrative |
|
||
| 5 | Cross-family comparison | Best of each, retrained 200ep | Final FID + IS + qualitative |
|
||
|
||
**The narrative**: baseline fails in a specific way → fix targets that failure → new failure emerges → next fix targets that → repeat per family → compare families on equal footing.
|