Notebooks Classificador

2026-05-14 16:20:33 +01:00
parent 9ae334410d
commit 2062a91985
734 changed files with 75472 additions and 1730 deletions
@@ -19,9 +19,9 @@
    "Notebook roadmap:\n",
    "1. `01_eda` maps sources, labels, balance, and leakage risks.\n",
    "2. `02_preprocessing` turns those risks into deterministic input handling and controlled augmentation choices.\n",
-    "3. `03_phase1_analysis` compares baseline models under one shared protocol.\n",
-    "4. `04_phase2_analysis` tests preprocessing/model ablations.\n",
-    "5. `05_gradcam_analysis` checks where trained models look using existing checkpoints.\n"
+    "3. `03_baselines` compares baseline models under one shared protocol.\n",
+    "4. `04_ablation_questions` tests preprocessing/model ablations.\n",
+    "5. `05_gradcam` checks where trained models look using existing checkpoints.\n"
   ]
  },
  {
@@ -14,7 +14,7 @@
    "\n",
    "Face crops are generated offline with `classifier/tools/facecrop.py`, producing `cropped/classifier/`. Training configs then choose either raw `data/` or the cropped directory through `data_dir`; face cropping is not hidden inside the transform pipeline.\n",
    "\n",
-    "Roadmap link: this notebook implements the defenses motivated by `01_eda`; `03_phase1_analysis` then asks which baseline model benefits under that fixed protocol.\n"
+    "Roadmap link: this notebook implements the defenses motivated by `01_eda`; `03_baselines` then asks which baseline model benefits under that fixed protocol.\n"
   ]
  },
  {
@@ -716,7 +716,7 @@
    "- augmentation is train-only and can either improve robustness or over-regularize;\n",
    "- normalization is tested as a color-shortcut diagnostic, while ImageNet/default normalization remains the standard pretrained-model default.\n",
    "\n",
-    "Next: `03_phase1_analysis.ipynb` uses this fixed protocol to compare SimpleCNN and pretrained ResNet18 baselines.\n"
+    "Next: `03_baselines.ipynb` uses this fixed protocol to compare SimpleCNN and pretrained ResNet18 baselines.\n"
   ]
  }
 ],
@@ -12,7 +12,7 @@
    "\n",
    "This is a controlled baseline, not a search for the final model. Phase 1 fixes the protocol so that later Phase 2 changes can be interpreted as ablations rather than confounded improvements.\n",
    "\n",
-    "Roadmap link: `01_eda` identified leakage and shortcut risks; `02_preprocessing` defined the input path; this notebook establishes the model baseline before `04_phase2_analysis` changes one design choice at a time.\n"
+    "Roadmap link: `01_eda` identified leakage and shortcut risks; `02_preprocessing` defined the input path; this notebook establishes the model baseline before `04_ablation_questions` changes one design choice at a time.\n"
   ]
  },
  {
@@ -829,7 +829,7 @@
    "\n",
    "Per-source pairwise AUC also shows the value of the pretrained backbone. ResNet18 is much more stable across fake sources, while SimpleCNN struggles especially on `wiki_vs_insight`. This justifies using ResNet18 as the main diagnostic model for Phase 2 preprocessing ablations, while still tracking SimpleCNN to check whether preprocessing effects generalize to a smaller architecture.\n",
    "\n",
-    "Next: `04_phase2_analysis.ipynb` keeps the same evidence discipline and tests resolution, normalization, facecrop, augmentation, and source-holdout behavior.\n"
+    "Next: `04_ablation_questions.ipynb` keeps the same evidence discipline and tests resolution, normalization, facecrop, augmentation, and source-holdout behavior.\n"
   ]
  }
 ],
@@ -17,7 +17,7 @@
    "- Does face crop help?\n",
    "- Does augmentation help or over-regularize?\n",
    "\n",
-    "Roadmap link: this notebook selects the best supported classifier setting from existing logs. `05_gradcam_analysis` checks where the selected and comparison models focus, then `06_phase3_model_family_analysis` tests stronger pretrained families and `07_phase4_data_scaling_analysis` records the scaling plan/status.\n"
+    "Roadmap link: this notebook selects the best supported classifier setting from existing logs. `05_gradcam` checks where the selected and comparison models focus, then `06_model_families` tests stronger pretrained families and `07_data_scaling` records the scaling plan/status.\n"
   ]
  },
  {
@@ -1315,7 +1315,7 @@
    "\n",
    "Report-ready decision: use 224x224 input, facecrop enabled, augmentation disabled for the current 20% setting, ImageNet/default normalization, and discuss source generalization as the main limitation.\n",
    "\n",
-    "Next: `05_gradcam_analysis.ipynb` inspects model focus qualitatively. The story then continues with `06_phase3_model_family_analysis.ipynb` for stronger backbones and `07_phase4_data_scaling_analysis.ipynb` for the data-scaling plan/status.\n"
+    "Next: `05_gradcam.ipynb` inspects model focus qualitatively. The story then continues with `06_model_families.ipynb` for stronger backbones and `07_data_scaling.ipynb` for the data-scaling plan/status.\n"
   ]
  }
 ],
@@ -19,7 +19,7 @@
    "- Do augmentation and source-holdout runs reveal instability in attention?\n",
    "- Are errors visually plausible, or do they suggest shortcut behavior?\n",
    "\n",
-    "Roadmap link: after this qualitative check, `06_phase3_model_family_analysis.ipynb` compares stronger pretrained backbones and `07_phase4_data_scaling_analysis.ipynb` records the planned data-scaling analysis.\n"
+    "Roadmap link: after this qualitative check, `06_model_families.ipynb` compares stronger pretrained backbones and `07_data_scaling.ipynb` records the planned data-scaling analysis.\n"
   ]
  },
  {
@@ -1701,7 +1701,7 @@
    "\n",
    "The key limitation from Phase 2 still stands: high in-distribution AUC does not guarantee source-agnostic generalization. The Grad-CAM panels help make that limitation visible, but the source-holdout pairwise AUC values are the primary quantitative evidence.\n",
    "\n",
-    "Next: `06_phase3_model_family_analysis.ipynb` asks whether stronger pretrained model families improve on the selected Phase 2 pipeline.\n"
+    "Next: `06_model_families.ipynb` asks whether stronger pretrained model families improve on the selected Phase 2 pipeline.\n"
   ]
  }
 ],
@@ -12,7 +12,7 @@
    "\n",
    "This notebook is evidence-only. It reads existing configs and logs, uses the saved log schema as the source of truth, and does not train, reevaluate, or invent missing results.\n",
    "\n",
-    "Roadmap link: `01_eda` -> `02_preprocessing` -> `03_phase1_analysis` -> `04_phase2_analysis` -> `05_gradcam_analysis` -> this model-family comparison -> `07_phase4_data_scaling_analysis`.\n"
+    "Roadmap link: `01_eda` -> `02_preprocessing` -> `03_baselines` -> `04_ablation_questions` -> `05_gradcam` -> this model-family comparison -> `07_data_scaling`.\n"
   ]
  },
  {
@@ -1311,7 +1311,7 @@
    "\n",
    "The report decision should therefore be: Phase 2 found the right input pipeline, and Phase 3 selects ResNet50 as the best classifier backbone for this task. ConvNeXt-Tiny is the AUC winner, but ResNet50 is the better detector when the full metric set is considered. The remaining caution is unchanged: source-wise and pairwise behavior must stay in the analysis because high global AUC alone does not prove source-agnostic generalization.\n",
    "\n",
-    "Next: `07_phase4_data_scaling_analysis.ipynb` turns from architecture choice to data scaling. It asks whether the selected top model families, especially ResNet50, improve further when trained with more of the available data.\n"
+    "Next: `07_data_scaling.ipynb` turns from architecture choice to data scaling. It asks whether the selected top model families, especially ResNet50, improve further when trained with more of the available data.\n"
   ]
  }
 ],
@@ -1878,9 +1878,9 @@
   "id": "3a55e559",
   "metadata": {},
   "source": [
-    "The 100% confusion matrices show that all three models are very close overall once they are trained with the full dataset. The deciding factor is the real-image error pattern: if the goal is to avoid labeling genuine images as fake, ConvNeXt-Tiny is the best practical choice because it has the lowest false positive rate and the best balanced accuracy. ResNet50 still has the strongest fake recall and macro F1, so it remains a strong alternative when missed fakes are the bigger concern. EfficientNet-B0 sits between them: strong and efficient, but not the top choice on either error profile.\n",
+    "The 100% confusion matrices show that all three models are very close overall once they are trained with the full dataset. The deciding factor is the pattern of false negatives (missed fakes): if the goal is to avoid missing fake images, ResNet50 is the best practical choice because it has the highest fake recall and the strongest macro F1. ConvNeXt-Tiny still has the lowest false positive rate and the best balanced accuracy, so it remains the safer choice when avoiding false alarms on real images is the priority. EfficientNet-B0 sits between them: strong and efficient, but not the top choice on either error profile.\n",
    "\n",
-    "For this project, ConvNeXt-Tiny is the best practical classifier once the full-dataset confusion matrices are considered. It matches the top group in overall performance while making the fewest mistakes on real images, which is the most important operational priority here. ResNet50 remains the strongest detector for catching fakes, but ConvNeXt-Tiny is the safer final choice when false positives on real images matter most."
+    "For this project, ResNet50 is the best practical classifier when the main operational priority is to minimize missed fakes (false negatives). It matches the top group in overall performance while giving the strongest fake detection metrics. ConvNeXt-Tiny remains the preferred alternative when false positives on real images are the biggest concern."
   ]
  },
  {
@@ -1894,7 +1894,7 @@
    "|---|---|---|\n",
    "| Data scale | Use 100% for the final Phase 4 setting | Every backbone improves from 20% to 50% to 100% across AUC, accuracy, and F1. |\n",
    "| Best AUC model | ConvNeXt-Tiny at 100% | Highest mean AUC: `0.9954`, with very tight fold stability. |\n",
-    "| Best practical detector | ConvNeXt-Tiny at 100% | Lowest false positive rate on real images and best balanced accuracy in the full-dataset confusion matrices. |\n",
+    "| Best practical detector (minimize missed fakes) | ResNet50 at 100% | Highest fake recall and strongest macro F1 at 100%, making it the preferred choice when false negatives are the main concern. |\n",
    "| Efficient model alternative | EfficientNet-B0 at 100% | Very close AUC `0.9949` with much smaller checkpoint size, but slightly weaker operating metrics. |\n",
    "| Remaining limitation | Source behavior still matters | `insight` remains the weakest fake source, even though scaling raises it to around `0.990` pairwise AUC. |"
   ]
@@ -1942,7 +1942,7 @@
    "\n",
    "Phase 4 shows that data scale is a real improvement lever. At 20%, the top backbones were already strong, and with 50% and 100% data they converge to very similar overall performance. ConvNeXt-Tiny remains the best AUC model at `0.9954`, while ResNet50 remains strongest on fake-catching thresholded metrics (accuracy/F1 and fake recall).\n",
    "\n",
-    "For the final practical decision, the confusion matrices are decisive: because this project prioritizes avoiding false alarms on real images, `p4_convnext_tiny_100pct` is the recommended classifier. It achieves top-tier overall performance while producing the lowest false positive rate and best balanced accuracy at 100% data. ResNet50 stays the preferred alternative when the priority is maximizing fake detection, and source-wise diagnostics remain important because `insight` is still the weakest fake source."
+    "For the final practical decision, the confusion matrices are decisive. Because this project prioritizes avoiding missed fakes (false negatives), `p4_resnet50_100pct` is the recommended classifier. It achieves top-tier fake detection metrics while staying competitive on overall performance. ConvNeXt-Tiny remains the preferred alternative when minimizing false positives on real images is the operational priority, and source-wise diagnostics remain important because `insight` is still the weakest fake source."
   ]
  }
 ],