Problem: 94% negative class meant random sampling annotated 94 non-defect images per 6 defect images — exactly the inverse of model need. Rare defect sub-types had insufficient training examples. Model was good at 'no defect', poor at rare defects.
Solution: Uncertainty sampling flagged images with predicted class probability 0.4–0.6 for next labelling batch. Diversity filter using k-means on image embeddings prevented near-duplicate frames from the same production run dominating the batch. Argilla managed labelling workflow with labeller confidence metadata. Calibration set of 500 gold-standard images monitored labeller quality — anyone below 85% agreement flagged for retraining.
Technology: PyTorch · Argilla · CLIP embeddings · scikit-learn · Python
Optimisation pattern: random-sampling-to-uncertainty-plus-diversity-active-learning
Outcomes:
Labelling cost reduced 67% — same accuracy with 660K vs 2M annotations. Rare defect class F1: 0.71 → 0.89. Time to production model: 14 weeks → 9 weeks. 3 labeller quality issues detected via calibration set monitoring.