Files

T

Johnny Fernandes bb3dfb92d5 Clean state

2026-04-30 01:25:39 +01:00

11 KiB

Raw Blame History

Generator Plan

The assignment rewards iterative improvement with intermediate results. This plan is structured around model evolution as the spine: each step has a because tied to an observed failure of the previous step. Pipeline ablations are honest but de-emphasized — they clear the table for the real story.

Standard Settings (Applied Everywhere Unless Noted)

Setting	Value	Reason
Batch size	64	Consistent across experiments
Mixed precision	float16 + GradScaler	Speed
EMA decay	0.9999	Sample from EMA weights for GANs
FID evaluation	Every 25 epochs	Objective quality tracking
FID n_real	5000	Held-out real images
Default epochs	100	Best-of-each in Phase 4 retrains to 200

Per-model optimizer/hyperparameters are listed inside each phase.

Phase 1 — Pipeline Selection (quick, one figure)

Goal: Pick the data pipeline used for every downstream experiment. Don't dwell here — this is clearing the table, not the story.

Fixed model: DCGAN at 64×64 (cheapest baseline, fast iteration). One variable per experiment.

Experiment	Variable	Variants	Decision
1A	Resolution	64×64 vs 128×128	Pick by FID — assumed transferable
1B	Face crop + alignment	Full image vs MTCNN-aligned	Pick by FID — assumed transferable
1C	Augmentation	H-flip only vs H-flip + rotation ±5° + mild color jitter	Per-family: validate inside Phase 2 for GAN, default to H-flip-only for VAE/DDPM
1D	Combined dataset	Aligned only vs aligned + raw mixed	Pick by FID — expected to underperform aligned-only

Caveat on transferability: Phase 1 uses DCGAN as a proxy to choose the pipeline cheaply, then assumes the choice transfers to VAE and DDPM. Resolution and alignment are largely architecture-invariant (more pixels help everyone; structural consistency helps any spatial prior). Augmentation is not — diffusion models benefit less from aug, and MSE-VAE may even be hurt by color jitter. So 1C is treated as an indicative result for GANs and re-checked per family rather than baked in globally.

1D — combined dataset rationale: Mixing aligned + raw doubles the variance the generator must model (face anywhere/any scale + face fixed) and dilutes the geometric prior. Hypothesis: combined < aligned-only. Cheap to test (one extra DCGAN run). Included for completeness so the report shows we considered it rather than asserting it.

MTCNN alignment (one-time preprocessing, cached to disk):

from facenet_pytorch import MTCNN
from skimage.transform import SimilarityTransform, warp
import numpy as np
from PIL import Image

mtcnn = MTCNN(keep_all=False, device='cuda')

REF_LANDMARKS = np.array([   # reference positions in 128×128
    [38.0, 51.0],  # left eye
    [90.0, 51.0],  # right eye
    [64.0, 71.0],  # nose
    [45.0, 95.0],  # left mouth
    [83.0, 95.0],  # right mouth
], dtype=np.float32)

def align_face(img: Image.Image, out_size: int = 128):
    boxes, _, landmarks = mtcnn.detect(img, landmarks=True)
    if boxes is None:
        return None
    tform = SimilarityTransform()
    tform.estimate(landmarks[0], REF_LANDMARKS)
    aligned = warp(np.array(img), tform.inverse,
                   output_shape=(out_size, out_size),
                   order=3, preserve_range=True).astype(np.uint8)
    return Image.fromarray(aligned)

Augmentation philosophy — only structure-preserving transforms (face-aligned crops are consistent by design):

Transform	Apply?	Reason
Horizontal flip	Yes, p=0.5	Faces are symmetric
Rotation	Yes, ±5°	Residual head tilt post-alignment
Color jitter	Yes, mild	brightness ±0.1, contrast ±0.1, saturation ±0.05
Translation	No	Breaks alignment
Vertical flip	No	Meaningless for faces
Strong blur / noise	No	Teaches the model to generate blur

Output: ~1 page in the report. Best pipeline carries forward to all phases.

Phase 2 — GAN Evolution (main spine)

Goal: The richest narrative — each step has a clear because from observed failure. This is the strongest part of the storyline; keep it front and center.

Best pipeline from Phase 1 fixed throughout.

2.1 — DCGAN (baseline)

Simplest GAN baseline. BCE loss, no gradient penalty.

Adam β1=0.5, β2=0.999, lr=2e-4
ngf=ndf=64, latent_dim=100
Resolution: 64×64

Expected failure: mode collapse, training instability, oscillating losses. Document these explicitly — they motivate 2.2.

2.2 — WGAN-GP

Because: DCGAN showed mode collapse and instability → Wasserstein loss + gradient penalty.

Adam β1=0.0, β2=0.9, lr_g=lr_d=1e-4
ngf=ndf=64, latent_dim=128, n_critic=2, gp_lambda=10
Resolution: 64×64

Expected: more stable training, better diversity. Likely remaining issues: texture artifacts, limited global coherence at higher resolution.

2.3 — WGAN-GP + Spectral Norm + GroupNorm + Self-Attention

Because: WGAN-GP showed texture artifacts / limited coherence → principled Lipschitz constraint and long-range dependencies.

Generator: BatchNorm → GroupNorm (no batch-size coupling)
Critic: InstanceNorm → Spectral Normalization (principled Lipschitz constraint)
Self-attention at 16×16 in both generator and critic

class SelfAttention(nn.Module):
    def __init__(self, in_ch):
        super().__init__()
        mid = max(in_ch // 8, 1)
        self.q = nn.Conv2d(in_ch, mid, 1, bias=False)
        self.k = nn.Conv2d(in_ch, mid, 1, bias=False)
        self.v = nn.Conv2d(in_ch, in_ch, 1, bias=False)
        self.gamma = nn.Parameter(torch.zeros(1))
        self._mid = mid

    def forward(self, x):
        b, c, h, w = x.shape
        q = self.q(x).view(b, self._mid, -1).transpose(-2, -1)
        k = self.k(x).view(b, self._mid, -1)
        v = self.v(x).view(b, c, -1)
        attn = torch.softmax(q @ k * self._mid ** -0.5, dim=-1)
        return x + self.gamma * (v @ attn.transpose(-2, -1)).view(b, c, h, w)

2.4 — Scale to 128×128 (if 2.3 looks coherent at 64×64)

Because: 2.3 produces coherent samples at 64×64 → does the architecture hold up at higher resolution?

Same architecture as 2.3, retrained at 128×128. Add attention at 32×32 if memory permits.

Phase 2 Results

Step	Model	FID @ 100ep ↓	Main observed failure	Motivates next step
2.1	DCGAN	?	?	?
2.2	WGAN-GP	?	?	?
2.3	WGAN-GP + SN + Attn	?	?	?
2.4	+ 128×128	?	?	—

For each step: FID curve, 16-sample grid, one paragraph on what failed and why the next change addresses it.

Phase 3 — VAE Track

Goal: A self-contained evolution story for the likelihood-based family. Every step motivated by a known limitation of the previous.

Step	Model	Because
3.1	Vanilla VAE (MSE)	Baseline — expect blur
3.2	+ Perceptual loss (VGG)	MSE blur is fundamental to pixel-space reconstruction
3.3	+ PatchGAN discriminator (VQGAN-lite)	Perceptual loss still lacks local texture realism

3.1 — Vanilla VAE: Adam lr=1e-3, latent_dim=256, β=1.0. Plain convolutional encoder/decoder, MSE reconstruction.

3.2 — Perceptual loss: VGG-16 feature matching at relu1_2, relu2_2, relu3_3.

3.3 — Patch discriminator: PatchGAN adversarial loss targeting local texture realism.

L = L_mse + λ_perc·L_vgg + λ_adv·L_adv + β·L_kl
λ_perc=0.1, λ_adv=0.1, β=0.0001

Decoder fix (applied from 3.1 onward): replace ConvTranspose2d with Upsample(nearest) + Conv2d — eliminates checkerboard artifacts.

Step	Model	FID ↓	Main observed failure
3.1	VAE MSE	?	?
3.2	+ Perceptual	?	?
3.3	+ PatchGAN	?	?

Phase 4 — DDPM Track

Goal: A self-contained evolution story for the diffusion family.

Step	Model	Because
4.1	DDPM linear + ε-pred	Baseline
4.2	+ cosine schedule	Linear schedule wastes capacity at low timesteps
4.3	+ v-prediction	ε-prediction is unstable across the full trajectory
4.4	+ wider U-Net / more attention	If 4.3 still underfits

4.1 — Baseline: AdamW lr=2e-4, base_ch=128, T=1000, attention at 8×8 and 16×16. DDIM sampling, 100 steps.

4.2 — Cosine schedule:

def cosine_betas(T: int, s: float = 0.008):
    t = torch.linspace(0, T, T + 1)
    f = torch.cos((t / T + s) / (1 + s) * math.pi / 2) ** 2
    alpha_bar = f / f[0]
    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
    return betas.clamp(0, 0.999)

4.3 — v-prediction: replaces ε target with v = √ᾱ·ε − √(1−ᾱ)·x₀.

4.4 — Wider U-Net: base_ch 128 → 192, attention at 8×8, 16×16, 32×32.

Step	Model	FID ↓	Main observed failure
4.1	DDPM linear + ε	?	?
4.2	+ cosine	?	?
4.3	+ v-pred	?	?
4.4	+ wider	?	?

Phase 5 — Cross-Family Comparison

Goal: Side-by-side comparison of the best from each family (2.4, 3.3, 4.4) under identical conditions.

Best-of-each retrained for 200 epochs at the same resolution and pipeline.

5A — Quantitative

Model	FID ↓	IS ↑	LPIPS diversity ↑	Params	Train time
Best GAN (2.4)	?	?	?	?	?
Best VAE (3.3)	?	?	?	?	?
Best DDPM (4.4)	?	?	?	?	?

5B — Qualitative

Visual grids: 16-image sample grids per finalist
Progression: epoch 10 → 50 → 100 → 200 side by side
Latent interpolation: smooth transitions between two latent codes (GAN, VAE)
Diversity: average pairwise LPIPS distance across 100 generated images
Failure modes: worst-generated images per model

Compute Budget Notes

Three families × multiple steps is a lot of runs. If compute is tight:

Keep the GAN track complete (2.1 → 2.4) — it carries the strongest narrative.
VAE and DDPM can drop the last step each (stop at 3.2 and 4.3) without hurting the story.
Phase 1 ablations can use 50 epochs instead of 100 — pipeline deltas show up early.

Summary

Phase	Purpose	Models	Output
1	Pipeline selection	DCGAN @ 64×64 across data variants	Best pipeline
2	GAN evolution (main spine)	DCGAN → WGAN-GP → +SN+Attn → 128×128	GAN failure→fix narrative
3	VAE evolution	VAE → +Perceptual → +PatchGAN	VAE failure→fix narrative
4	DDPM evolution	DDPM → cosine → v-pred → wider	DDPM failure→fix narrative
5	Cross-family comparison	Best of each, retrained 200ep	Final FID + IS + qualitative

The narrative: baseline fails in a specific way → fix targets that failure → new failure emerges → next fix targets that → repeat per family → compare families on equal footing.

11 KiB Raw Blame History Unescape Escape