Notebooks Terminados

2026-05-11 17:36:08 +01:00
parent 522a8f8d46
commit 9ae334410d
86 changed files with 3747 additions and 1093 deletions
@@ -0,0 +1,340 @@
+# Generator Pipeline Summary (Phases 0-5)
+
+## Scope
+This document summarizes the full story told across the generator notebooks:
+- `phase0_analysis.ipynb`
+- `phase1_analysis.ipynb`
+- `phase2_analysis.ipynb`
+- `phase3_analysis.ipynb`
+- `phase4_analysis.ipynb`
+- `phase5_analysis.ipynb`
+
+It covers pipeline design, experiment evolution, result analysis, and final sample outcomes. The last section provides a super-detailed list of what is still missing and what should be included.
+
+Important constraint for follow-up work:
+- No additional model training is assumed or recommended.
+- All suggested improvements below are limited to post-hoc analysis, evaluation, documentation, stress tests, or re-use of already trained checkpoints and generated samples.
+
+## 1) End-to-End Story of the Pipeline
+
+### Phase 0: Baseline sanity check
+Goal:
+- Verify training loops for GAN, VAE, and DDPM are working end-to-end.
+- Create intentionally rough baseline outputs to compare later improvements.
+
+What was done:
+- Trained baseline WGAN-GP, VAE, DDPM, and a small DDPM variant.
+- Used raw/un-aligned images.
+- Focused on training curves and visual samples rather than strong quantitative quality.
+
+Findings:
+- WGAN produced coarse face-like blobs.
+- VAE produced blurry mean-like reconstructions/samples.
+- DDPM showed better local texture but still noisy.
+- Main takeaway: data quality/preprocessing is a major bottleneck.
+
+Outputs:
+- Run logs in `generator/outputs/logs/`.
+- Sample grids/checkpoints in `generator/outputs/samples/`.
+
+---
+
+### Phase 1: Data pipeline ablation and lock-in
+Goal:
+- Identify the best data/preprocessing recipe using cheap proxy experiments.
+- Lock pipeline decisions before expensive model evolution.
+
+What was done:
+- Four ablation groups with short DCGAN runs:
+1. Resolution (64 vs 128)
+2. Alignment (raw vs MTCNN aligned)
+3. Augmentation (simple vs richer augmentation)
+4. Dataset mixing (aligned-only vs aligned+raw)
+
+Findings:
+- Alignment is the strongest lever.
+- 64x64 is better than 128x128 under the tested budget.
+- Richer augmentation helps in the proxy setup.
+- Mixing aligned and raw data hurts quality.
+
+Decision locked for future phases:
+- Use aligned faces, 64x64, no raw/aligned mixing.
+
+Outputs:
+- Comparative FID plots and ablation figures in `generator/outputs/figures/`.
+
+---
+
+### Phase 2: GAN evolution (architecture and stability)
+Goal:
+- Solve GAN collapse behavior and improve quality under the locked data pipeline.
+
+What was done:
+- Progressive GAN experiments:
+1. Baseline DCGAN-like setup
+2. WGAN-GP objective update
+3. Add spectral normalization + GroupNorm + self-attention
+4. Test 128x128 at similar budget
+
+Findings:
+- Objective change alone gave small gains.
+- Biggest jump came from stability/capacity design (SN + GroupNorm + attention).
+- 128x128 regressed under fixed compute budget.
+
+Decision:
+- Best GAN recipe kept at 64x64 with SN + attention stack.
+
+Outputs:
+- Best checkpoints and phase comparison samples in `generator/outputs/models/`, `generator/outputs/samples/`, and `generator/outputs/figures/`.
+
+---
+
+### Phase 3: VAE evolution (composite objective)
+Goal:
+- Improve VAE from overly smooth outputs to better perceptual quality.
+
+What was done:
+- Step-wise loss composition:
+1. MSE + KL baseline
+2. Add perceptual (VGG) loss
+3. Add adversarial (PatchGAN) component
+
+Findings:
+- Perceptual loss provided major detail recovery.
+- Adding adversarial loss provided further gain.
+- Loss components were complementary.
+
+Decision:
+- Keep VAE with MSE + weighted KL + perceptual + PatchGAN terms.
+
+Outputs:
+- Prior samples, reconstructions, and loss/FID trends in `generator/outputs/samples/` and `generator/outputs/figures/`.
+
+---
+
+### Phase 4: DDPM evolution (schedule, target, width)
+Goal:
+- Improve diffusion quality via more modern design choices.
+
+What was done:
+- Sequential DDPM upgrades:
+1. Baseline linear schedule + epsilon prediction
+2. Cosine schedule
+3. Cosine + v-prediction
+4. Wider UNet/capacity increase
+
+Findings:
+- Schedule alone gave small gains.
+- v-prediction produced the major improvement.
+- Wider network improved further, at higher training cost.
+
+Decision:
+- Best DDPM setup: cosine schedule + v-prediction + wider backbone.
+
+Outputs:
+- Noise schedule visuals, progression grids, and best samples in `generator/outputs/figures/` and `generator/outputs/samples/`.
+
+---
+
+### Phase 5: Best-of-family final comparison
+Goal:
+- Fair head-to-head across the best GAN, VAE, and DDPM recipes.
+- Conclude practical model choice using quality vs compute trade-offs.
+
+What was done:
+- Trained/evaluated best recipes from phases 2-4 on same pipeline constraints.
+- Compared FID curves, final samples, progress snapshots, and interpolation behavior.
+
+Main result:
+- DDPM achieved best quality (best FID in this project).
+- GAN was close in quality but much faster in training/inference.
+- VAE was fastest to train but clearly behind in final sample quality.
+
+Practical interpretation:
+- If absolute sample quality is primary: DDPM.
+- If quality-speed balance is primary: GAN.
+- If quick prototyping/low compute is primary: VAE.
+
+Outputs:
+- Final family samples and comparisons in `generator/outputs/samples/` and `generator/outputs/figures/`.
+
+## 2) Evolution of Decisions Across Phases
+
+1. Phase 0 showed baseline failure patterns and established motivation for targeted improvements.
+2. Phase 1 proved data preprocessing (especially alignment) is the foundation.
+3. Phase 2 showed GAN quality breakthrough came from stability/capacity changes, not only loss swap.
+4. Phase 3 showed VAE quality improves strongly via loss composition.
+5. Phase 4 showed diffusion gains were driven mostly by prediction target choice and then model width.
+6. Phase 5 demonstrated final family ranking and trade-offs under common conditions.
+
+## 3) What Is Already Well Covered
+
+- Clear multi-phase narrative from baseline to final comparison.
+- Systematic ablation mindset in each phase.
+- Good use of saved artifacts (logs, figures, samples).
+- Strong comparative storytelling in final phase (quality vs speed vs practicality).
+
+## 4) Super-Detailed Missing / Should-Be-Included Section
+
+This section is intentionally exhaustive. Every item below is designed to work with the models, checkpoints, samples, and logs that already exist.
+
+### A. Evaluation and analysis gaps
+
+1. Missing multi-metric evaluation beyond FID.
+Should include:
+- KID, Precision/Recall (for generative coverage vs fidelity), and optionally IS computed on the already-trained outputs.
+- A short explanation of what each metric captures and where FID can be misleading.
+
+2. No uncertainty/statistical significance around reported FID.
+Should include:
+- Bootstrap confidence intervals over the already generated sample sets.
+- Mean +- std tables across repeated FID subsampling on the saved outputs.
+
+3. Missing mode coverage/diversity analysis.
+Should include:
+- Precision-recall split for generative models.
+- Cluster-level coverage checks using the generated samples already on disk.
+- Nearest-neighbor distance plots for generated vs. training data.
+
+4. Missing per-attribute quality analysis.
+Should include:
+- Analysis by pose, illumination, expression, and age bands using the existing samples.
+- Generated-vs-real attribute distribution matching.
+
+5. Missing metric protocol sensitivity analysis.
+Should include:
+- FID stability under different sample counts and bootstrap resampling.
+- A clear explanation of why phase-to-phase absolute FID comparability can fail.
+
+6. Missing human-perception validation.
+Should include:
+- A small blind ranking study using the already generated sample grids.
+- A comparison between human preference and metric preference.
+
+### B. Post-hoc experiment analysis gaps
+
+1. Loss-weight behavior is not interpreted deeply enough.
+Should include:
+- A post-hoc explanation of how the chosen perceptual/adversarial weights affected the saved VAE outputs.
+- A summary table of the observed trade-off across the completed runs, without proposing new training.
+
+2. Family-specific preprocessing effects are not fully separated.
+Should include:
+- A careful read of how the locked aligned-64 pipeline interacts with each family’s final samples.
+- Visual comparisons that isolate preprocessing benefits already visible in the saved figures.
+
+3. Hyperparameter conclusions are narrow.
+Should include:
+- A consolidated summary of which configurations already worked best and which were discarded.
+- No new sweeps; only interpretation of the existing trained runs.
+
+4. Generalization checks are missing.
+Should include:
+- Evaluation of the existing checkpoints on any available held-out or alternate data, if such data already exists.
+- If no extra data exists, explicitly state that generalization was not tested.
+
+5. Failure-case experiments are not explicitly catalogued.
+Should include:
+- A concise “negative results” subsection per phase with what failed and why, based only on the completed experiments.
+
+### C. Reproducibility gaps
+
+1. Seeds are not consistently documented.
+Should include:
+- A run-level seed log for the completed experiments.
+
+2. Environment and hardware specs are missing in notebook narrative.
+Should include:
+- GPU, CUDA, PyTorch, Python, and key package versions.
+
+3. Config traceability could be clearer inside notebooks.
+Should include:
+- Printed key config values in each phase notebook.
+- A direct link from each run name to its exact config JSON.
+
+4. Checkpoint selection policy should be formalized.
+Should include:
+- A clear rule for when final EMA or best EMA is used and why.
+
+5. Reproduction guide is missing in notebooks folder.
+Should include:
+- Step-by-step commands to replay the notebooks and re-open the saved artifacts.
+
+### D. Practical deployment/evaluation gaps
+
+1. Inference speed and memory profiling is incomplete.
+Should include:
+- Throughput, latency, and VRAM table for the already trained GAN/VAE/DDPM checkpoints.
+
+2. Sample count vs. quality behavior is missing.
+Should include:
+- FID-vs-number-of-generated-samples curve using already saved samples or deterministic re-sampling from existing checkpoints.
+
+3. Robustness/distribution shift testing is missing.
+Should include:
+- Corruption robustness tests (blur, noise, compression) applied to the existing outputs.
+- Optional out-of-domain face evaluation if a suitable held-out dataset already exists.
+
+4. Model selection guide should be more operational.
+Should include:
+- A decision table by target constraints: best quality, best latency, lowest compute burden, easiest analysis, and most stable outputs.
+
+### E. Ethics and risk gaps
+
+1. Dataset bias assessment is not included.
+Should include:
+- Demographic/attribute distribution report if labels are available.
+- Generated distribution parity analysis against the real data.
+
+2. Misuse and deepfake risk section is missing.
+Should include:
+- Clear misuse statement and mitigation suggestions.
+
+3. Memorization/privacy leakage checks are missing.
+Should include:
+- A nearest-neighbor memorization audit and threshold-based discussion using the trained models' samples.
+
+4. Responsible use guidance is absent.
+Should include:
+- Recommended and discouraged use cases in the summary/report.
+
+### F. Documentation quality gaps
+
+1. Mathematical objective definitions are incomplete in narrative form.
+Should include:
+- Formal equations for the VAE composite loss with explicit coefficients.
+
+2. Architectural diagrams are missing.
+Should include:
+- Compact diagrams for the GAN, VAE, and DDPM best variants.
+
+3. Troubleshooting guidance is missing.
+Should include:
+- Common failure patterns (loss explosion, collapse, OOM) and practical fixes that reflect what already happened in the project.
+
+4. Literature baseline context is limited.
+Should include:
+- Comparison table versus well-known references, with protocol caveats.
+
+## 5) Recommended Next-Step Priorities
+
+### Priority 1 (fast and high impact)
+1. Add bootstrap uncertainty bands and confidence intervals to the existing FID comparisons.
+2. Add precision/recall and KID alongside FID for the current sample sets.
+3. Add an explicit FID protocol box in all notebooks.
+4. Add a short model selection guide and reproducibility/environment block.
+
+### Priority 2 (medium effort, strong value)
+1. Add a negative-results appendix and troubleshooting notes based on the completed runs.
+2. Add inference throughput/VRAM benchmarking for the already trained checkpoints.
+3. Add per-attribute and nearest-neighbor analysis using existing outputs.
+
+### Priority 3 (larger effort, publication-level completeness)
+1. Human preference study on the saved sample grids.
+2. Fairness/bias and memorization audits.
+3. Cross-dataset generalization analysis if another dataset already exists in the project environment.
+
+## 6) Final Bottom-Line Conclusion
+The notebook set tells a coherent and strong experimental story: baseline failures -> pipeline correction -> family-specific improvements -> final cross-family comparison. The final evidence shows a clear quality-speed trade-off: DDPM gives the best sample quality, GAN gives near-best quality with far better speed, and VAE remains useful when compute and iteration speed dominate.
+
+Because no further training is planned, the most valuable remaining work is not new model fitting. It is post-hoc analysis of the models already trained: broader evaluation metrics, uncertainty estimates, robustness checks, memorization/privacy checks, and clearer documentation of protocol and limitations.