Notebooks Terminados

This commit is contained in:
DiogoCosta18
2026-05-11 17:36:08 +01:00
parent 522a8f8d46
commit 9ae334410d
86 changed files with 3747 additions and 1093 deletions
+340
View File
@@ -0,0 +1,340 @@
# Generator Pipeline Summary (Phases 0-5)
## Scope
This document summarizes the full story told across the generator notebooks:
- `phase0_analysis.ipynb`
- `phase1_analysis.ipynb`
- `phase2_analysis.ipynb`
- `phase3_analysis.ipynb`
- `phase4_analysis.ipynb`
- `phase5_analysis.ipynb`
It covers pipeline design, experiment evolution, result analysis, and final sample outcomes. The last section provides a super-detailed list of what is still missing and what should be included.
Important constraint for follow-up work:
- No additional model training is assumed or recommended.
- All suggested improvements below are limited to post-hoc analysis, evaluation, documentation, stress tests, or re-use of already trained checkpoints and generated samples.
## 1) End-to-End Story of the Pipeline
### Phase 0: Baseline sanity check
Goal:
- Verify training loops for GAN, VAE, and DDPM are working end-to-end.
- Create intentionally rough baseline outputs to compare later improvements.
What was done:
- Trained baseline WGAN-GP, VAE, DDPM, and a small DDPM variant.
- Used raw/un-aligned images.
- Focused on training curves and visual samples rather than strong quantitative quality.
Findings:
- WGAN produced coarse face-like blobs.
- VAE produced blurry mean-like reconstructions/samples.
- DDPM showed better local texture but still noisy.
- Main takeaway: data quality/preprocessing is a major bottleneck.
Outputs:
- Run logs in `generator/outputs/logs/`.
- Sample grids/checkpoints in `generator/outputs/samples/`.
---
### Phase 1: Data pipeline ablation and lock-in
Goal:
- Identify the best data/preprocessing recipe using cheap proxy experiments.
- Lock pipeline decisions before expensive model evolution.
What was done:
- Four ablation groups with short DCGAN runs:
1. Resolution (64 vs 128)
2. Alignment (raw vs MTCNN aligned)
3. Augmentation (simple vs richer augmentation)
4. Dataset mixing (aligned-only vs aligned+raw)
Findings:
- Alignment is the strongest lever.
- 64x64 is better than 128x128 under the tested budget.
- Richer augmentation helps in the proxy setup.
- Mixing aligned and raw data hurts quality.
Decision locked for future phases:
- Use aligned faces, 64x64, no raw/aligned mixing.
Outputs:
- Comparative FID plots and ablation figures in `generator/outputs/figures/`.
---
### Phase 2: GAN evolution (architecture and stability)
Goal:
- Solve GAN collapse behavior and improve quality under the locked data pipeline.
What was done:
- Progressive GAN experiments:
1. Baseline DCGAN-like setup
2. WGAN-GP objective update
3. Add spectral normalization + GroupNorm + self-attention
4. Test 128x128 at similar budget
Findings:
- Objective change alone gave small gains.
- Biggest jump came from stability/capacity design (SN + GroupNorm + attention).
- 128x128 regressed under fixed compute budget.
Decision:
- Best GAN recipe kept at 64x64 with SN + attention stack.
Outputs:
- Best checkpoints and phase comparison samples in `generator/outputs/models/`, `generator/outputs/samples/`, and `generator/outputs/figures/`.
---
### Phase 3: VAE evolution (composite objective)
Goal:
- Improve VAE from overly smooth outputs to better perceptual quality.
What was done:
- Step-wise loss composition:
1. MSE + KL baseline
2. Add perceptual (VGG) loss
3. Add adversarial (PatchGAN) component
Findings:
- Perceptual loss provided major detail recovery.
- Adding adversarial loss provided further gain.
- Loss components were complementary.
Decision:
- Keep VAE with MSE + weighted KL + perceptual + PatchGAN terms.
Outputs:
- Prior samples, reconstructions, and loss/FID trends in `generator/outputs/samples/` and `generator/outputs/figures/`.
---
### Phase 4: DDPM evolution (schedule, target, width)
Goal:
- Improve diffusion quality via more modern design choices.
What was done:
- Sequential DDPM upgrades:
1. Baseline linear schedule + epsilon prediction
2. Cosine schedule
3. Cosine + v-prediction
4. Wider UNet/capacity increase
Findings:
- Schedule alone gave small gains.
- v-prediction produced the major improvement.
- Wider network improved further, at higher training cost.
Decision:
- Best DDPM setup: cosine schedule + v-prediction + wider backbone.
Outputs:
- Noise schedule visuals, progression grids, and best samples in `generator/outputs/figures/` and `generator/outputs/samples/`.
---
### Phase 5: Best-of-family final comparison
Goal:
- Fair head-to-head across the best GAN, VAE, and DDPM recipes.
- Conclude practical model choice using quality vs compute trade-offs.
What was done:
- Trained/evaluated best recipes from phases 2-4 on same pipeline constraints.
- Compared FID curves, final samples, progress snapshots, and interpolation behavior.
Main result:
- DDPM achieved best quality (best FID in this project).
- GAN was close in quality but much faster in training/inference.
- VAE was fastest to train but clearly behind in final sample quality.
Practical interpretation:
- If absolute sample quality is primary: DDPM.
- If quality-speed balance is primary: GAN.
- If quick prototyping/low compute is primary: VAE.
Outputs:
- Final family samples and comparisons in `generator/outputs/samples/` and `generator/outputs/figures/`.
## 2) Evolution of Decisions Across Phases
1. Phase 0 showed baseline failure patterns and established motivation for targeted improvements.
2. Phase 1 proved data preprocessing (especially alignment) is the foundation.
3. Phase 2 showed GAN quality breakthrough came from stability/capacity changes, not only loss swap.
4. Phase 3 showed VAE quality improves strongly via loss composition.
5. Phase 4 showed diffusion gains were driven mostly by prediction target choice and then model width.
6. Phase 5 demonstrated final family ranking and trade-offs under common conditions.
## 3) What Is Already Well Covered
- Clear multi-phase narrative from baseline to final comparison.
- Systematic ablation mindset in each phase.
- Good use of saved artifacts (logs, figures, samples).
- Strong comparative storytelling in final phase (quality vs speed vs practicality).
## 4) Super-Detailed Missing / Should-Be-Included Section
This section is intentionally exhaustive. Every item below is designed to work with the models, checkpoints, samples, and logs that already exist.
### A. Evaluation and analysis gaps
1. Missing multi-metric evaluation beyond FID.
Should include:
- KID, Precision/Recall (for generative coverage vs fidelity), and optionally IS computed on the already-trained outputs.
- A short explanation of what each metric captures and where FID can be misleading.
2. No uncertainty/statistical significance around reported FID.
Should include:
- Bootstrap confidence intervals over the already generated sample sets.
- Mean +- std tables across repeated FID subsampling on the saved outputs.
3. Missing mode coverage/diversity analysis.
Should include:
- Precision-recall split for generative models.
- Cluster-level coverage checks using the generated samples already on disk.
- Nearest-neighbor distance plots for generated vs. training data.
4. Missing per-attribute quality analysis.
Should include:
- Analysis by pose, illumination, expression, and age bands using the existing samples.
- Generated-vs-real attribute distribution matching.
5. Missing metric protocol sensitivity analysis.
Should include:
- FID stability under different sample counts and bootstrap resampling.
- A clear explanation of why phase-to-phase absolute FID comparability can fail.
6. Missing human-perception validation.
Should include:
- A small blind ranking study using the already generated sample grids.
- A comparison between human preference and metric preference.
### B. Post-hoc experiment analysis gaps
1. Loss-weight behavior is not interpreted deeply enough.
Should include:
- A post-hoc explanation of how the chosen perceptual/adversarial weights affected the saved VAE outputs.
- A summary table of the observed trade-off across the completed runs, without proposing new training.
2. Family-specific preprocessing effects are not fully separated.
Should include:
- A careful read of how the locked aligned-64 pipeline interacts with each familys final samples.
- Visual comparisons that isolate preprocessing benefits already visible in the saved figures.
3. Hyperparameter conclusions are narrow.
Should include:
- A consolidated summary of which configurations already worked best and which were discarded.
- No new sweeps; only interpretation of the existing trained runs.
4. Generalization checks are missing.
Should include:
- Evaluation of the existing checkpoints on any available held-out or alternate data, if such data already exists.
- If no extra data exists, explicitly state that generalization was not tested.
5. Failure-case experiments are not explicitly catalogued.
Should include:
- A concise “negative results” subsection per phase with what failed and why, based only on the completed experiments.
### C. Reproducibility gaps
1. Seeds are not consistently documented.
Should include:
- A run-level seed log for the completed experiments.
2. Environment and hardware specs are missing in notebook narrative.
Should include:
- GPU, CUDA, PyTorch, Python, and key package versions.
3. Config traceability could be clearer inside notebooks.
Should include:
- Printed key config values in each phase notebook.
- A direct link from each run name to its exact config JSON.
4. Checkpoint selection policy should be formalized.
Should include:
- A clear rule for when final EMA or best EMA is used and why.
5. Reproduction guide is missing in notebooks folder.
Should include:
- Step-by-step commands to replay the notebooks and re-open the saved artifacts.
### D. Practical deployment/evaluation gaps
1. Inference speed and memory profiling is incomplete.
Should include:
- Throughput, latency, and VRAM table for the already trained GAN/VAE/DDPM checkpoints.
2. Sample count vs. quality behavior is missing.
Should include:
- FID-vs-number-of-generated-samples curve using already saved samples or deterministic re-sampling from existing checkpoints.
3. Robustness/distribution shift testing is missing.
Should include:
- Corruption robustness tests (blur, noise, compression) applied to the existing outputs.
- Optional out-of-domain face evaluation if a suitable held-out dataset already exists.
4. Model selection guide should be more operational.
Should include:
- A decision table by target constraints: best quality, best latency, lowest compute burden, easiest analysis, and most stable outputs.
### E. Ethics and risk gaps
1. Dataset bias assessment is not included.
Should include:
- Demographic/attribute distribution report if labels are available.
- Generated distribution parity analysis against the real data.
2. Misuse and deepfake risk section is missing.
Should include:
- Clear misuse statement and mitigation suggestions.
3. Memorization/privacy leakage checks are missing.
Should include:
- A nearest-neighbor memorization audit and threshold-based discussion using the trained models' samples.
4. Responsible use guidance is absent.
Should include:
- Recommended and discouraged use cases in the summary/report.
### F. Documentation quality gaps
1. Mathematical objective definitions are incomplete in narrative form.
Should include:
- Formal equations for the VAE composite loss with explicit coefficients.
2. Architectural diagrams are missing.
Should include:
- Compact diagrams for the GAN, VAE, and DDPM best variants.
3. Troubleshooting guidance is missing.
Should include:
- Common failure patterns (loss explosion, collapse, OOM) and practical fixes that reflect what already happened in the project.
4. Literature baseline context is limited.
Should include:
- Comparison table versus well-known references, with protocol caveats.
## 5) Recommended Next-Step Priorities
### Priority 1 (fast and high impact)
1. Add bootstrap uncertainty bands and confidence intervals to the existing FID comparisons.
2. Add precision/recall and KID alongside FID for the current sample sets.
3. Add an explicit FID protocol box in all notebooks.
4. Add a short model selection guide and reproducibility/environment block.
### Priority 2 (medium effort, strong value)
1. Add a negative-results appendix and troubleshooting notes based on the completed runs.
2. Add inference throughput/VRAM benchmarking for the already trained checkpoints.
3. Add per-attribute and nearest-neighbor analysis using existing outputs.
### Priority 3 (larger effort, publication-level completeness)
1. Human preference study on the saved sample grids.
2. Fairness/bias and memorization audits.
3. Cross-dataset generalization analysis if another dataset already exists in the project environment.
## 6) Final Bottom-Line Conclusion
The notebook set tells a coherent and strong experimental story: baseline failures -> pipeline correction -> family-specific improvements -> final cross-family comparison. The final evidence shows a clear quality-speed trade-off: DDPM gives the best sample quality, GAN gives near-best quality with far better speed, and VAE remains useful when compute and iteration speed dominate.
Because no further training is planned, the most valuable remaining work is not new model fitting. It is post-hoc analysis of the models already trained: broader evaluation metrics, uncertainty estimates, robustness checks, memorization/privacy checks, and clearer documentation of protocol and limitations.