We use cookies to keep the site working, understand how it’s used, and measure our marketing. You can accept everything, reject non-essentials, or pick what’s on.
Active Learning Image Dataset Labelling · aquicksoft
Note: This Table of Contents is generated via field codes. To ensure page number accuracy after editing, please right-click the TOC and select "Update Field."
Introduction: The Hidden Cost of Random Labelling in Computer Vision
In modern manufacturing, computer vision systems have become indispensable for automated quality assurance (QA). High-resolution cameras capture millions of images along production lines, and deep learning models classify each product as defective or non-defective. However, building these models requires large, accurately labelled training datasets—and labelling is expensive. A single manufacturing QA project may involve two million images, each requiring human annotation. At typical labelling costs of $0.05–$0.15 per image, a naïve approach of labelling the entire dataset can cost between $100,000 and $300,000.
This case study examines how a computer vision company deploying defect detection for manufacturing QA slashed its labelling costs by 67%—from a projected $200,000 to approximately $66,000—by replacing random sampling with an intelligent active learning loop. The system combined uncertainty sampling, diversity filtering via CLIP embeddings, and a structured labelling workflow managed through Argilla, achieving equal model accuracy with only 660,000 annotations instead of two million. Rare defect class F1 scores improved from 0.71 to 0.89, and time-to-production accelerated from 14 weeks to 9 weeks. This article provides a comprehensive technical analysis of each component, the rationale behind design decisions, and the measurable outcomes achieved.
Background: Manufacturing QA and the Class Imbalance Problem
Manufacturing quality assurance presents a uniquely challenging environment for machine learning. Production lines in electronics, automotive, pharmaceuticals, and semiconductor manufacturing generate vast quantities of image data. However, defect rates in mature manufacturing processes are typically very low—often between 2% and 6%. In the case studied here, the defect rate was approximately 6%, meaning 94% of all captured images showed no defect (the negative class).
This extreme class imbalance creates a paradox. When random sampling is used to select images for annotation, the sampling process mirrors the underlying class distribution: for every 100 images randomly selected, approximately 94 will be non-defect and only 6 will contain defects. This is precisely the inverse of what the model needs. The model already learns to recognise "no defect" easily because those images dominate the training set. What it desperately needs are examples of defects—particularly rare defect sub-types such as micro-cracks, subtle discolorations, or edge chipping that may occur in fewer than 0.5% of production output.
Research published in Heliyon (2024) on class imbalance in manufacturing confirms that defect detection and predictive maintenance are the domains most severely affected by class imbalance (Li et al., 2024). Traditional solutions include oversampling the minority class using techniques like SMOTE, undersampling the majority class, or applying cost-sensitive learning. While these approaches can help, they fundamentally rely on having at least some labelled examples of the rare classes—which random sampling may fail to provide in sufficient quantity.
The consequence in this case was stark: the initial model trained on randomly sampled data achieved high overall accuracy (94%+), but this was an illusion driven entirely by correct classification of the majority non-defect class. Performance on rare defect sub-types was poor, with an aggregate rare defect F1 score of only 0.71—far below the threshold required for production deployment in a safety-critical manufacturing environment.
The Scale of the Challenge
The training dataset comprised approximately two million images captured from multiple production lines over a six-month period. These images included variations in lighting conditions, camera angles, product positioning, and background clutter. The defect taxonomy included one primary negative class (no defect) and seven positive defect classes of varying rarity:
As the table shows, the four rarest defect classes (each with fewer than 10,000 examples in the full dataset) had F1 scores well below acceptable thresholds. With random sampling producing only a few hundred labelled examples for these classes, the model simply could not learn their visual patterns effectively.
Active Learning: Concepts and Architecture
Active learning is a machine learning paradigm that allows the model to influence which data it learns from, rather than passively accepting a randomly selected training set. The foundational survey by Burr Settles (2009) established the theoretical framework: given a large pool of unlabelled data, an active learning strategy selects the most informative subset for labelling, thereby maximising model improvement per labelled example (Settles, 2009).
The core principle is straightforward: not all training examples are equally valuable. An image that the model classifies with 99% confidence as "no defect" provides almost no new information if labelled. Conversely, an image where the model oscillates between "surface scratch" and "discoloration" with nearly equal probability is highly informative—labelling it resolves a specific area of model uncertainty and can improve classification boundaries.
A comprehensive study published in Springer (2024) demonstrated that active learning in computer vision can achieve 95% of full-dataset performance by annotating only 20–25% of the data (De Lange et al., 2024). Another study in the MDPI journal Electronics presented a cost-aware active learning framework specifically designed for small-object detection in agricultural images, showing significant cost reductions while maintaining detection accuracy (Zhang et al., 2025). These findings directly informed the approach taken in this case study.
The Active Learning Loop Architecture
The system implemented a classic pool-based active learning loop with three key innovations over the standard approach. The loop operated as follows:
Step 1 — Initial Seed: A small set of 50,000 randomly sampled images was labelled to provide the initial training data for the model. This seed set was carefully balanced to include proportional representation of all known defect classes, achieved by oversampling from known-defect production runs.
Step 2 — Model Training: A ResNet-50-based classification model was trained on the labelled set using PyTorch with standard cross-entropy loss. The model predicted class probabilities for all unlabelled images in the pool.
Step 3 — Sample Selection: Uncertainty sampling identified images with predicted class probabilities in the 0.4–0.6 range (i.e., the model was unsure). A diversity filter using k-means clustering on CLIP image embeddings then ensured the selected batch contained visually diverse examples rather than near-duplicates from the same production run.
Step 4 — Labelling: The selected batch was sent to human labellers through the Argilla platform, where annotators classified each image with confidence metadata. A calibration set monitored labeller accuracy.
Step 5 — Iteration: Newly labelled images were added to the training set, and the model was retrained. Steps 2–5 repeated until model performance converged or the annotation budget was exhausted.
This loop ran for 12 iterations over 9 weeks, with each iteration labelling approximately 50,000–60,000 images. The total labelled set reached 660,000 images—a 67% reduction from the full two million.
Uncertainty Sampling: Prioritising the Model's Confusion
Uncertainty sampling is the most widely used acquisition function in active learning. The concept is intuitive: select for labelling the images about which the current model is most uncertain. There are three primary formulations of uncertainty sampling, each defined by how "uncertainty" is measured:
Margin Sampling
Margin sampling selects the image where the difference between the top two predicted class probabilities is smallest. If the model predicts P(defect) = 0.52 and P(no_defect) = 0.48, the margin is only 0.04—indicating the model is nearly undecided. Margin sampling is computationally efficient because it only requires the top two probabilities, not the full distribution.
Entropy-Based Sampling
Entropy-based sampling measures the overall uncertainty in the full probability distribution using Shannon entropy: H(p) = −∑ p(x) log p(x). A perfectly confident prediction (e.g., [0.99, 0.01]) has low entropy, while a uniform prediction (e.g., [0.5, 0.5]) has maximum entropy. This approach is particularly useful in multi-class settings where the margin between the top two classes may not capture uncertainty spread across many classes.
Least Confidence Sampling
Least confidence sampling selects images where the model's maximum predicted probability is lowest. This is the simplest formulation but can be biased toward selecting images from classes the model knows nothing about, potentially over-representing entirely novel patterns rather than boundary cases between known classes.
Implementation in This Case Study
The team implemented a hybrid approach centred on margin sampling with an entropy-based tiebreaker. Specifically, images were ranked by their margin (difference between the top two class probabilities), and the threshold was set to select images with a margin below 0.2—equivalent to predicted probabilities in the 0.4–0.6 range for the top class. This threshold was chosen empirically after experimentation with thresholds of 0.1, 0.15, and 0.2, with 0.2 providing the best balance between batch size and information density.
In practice, this approach dramatically shifted the composition of labelled batches. Under random sampling, roughly 94% of each batch was non-defect. Under uncertainty sampling, the defect proportion rose to approximately 35–45%, because the model was already confident about most non-defect images but uncertain about many defect images—particularly rare defects that shared visual features with the negative class or with other defect types.
Diversity Sampling with CLIP Embeddings
Uncertainty sampling alone has a critical weakness: it can produce highly redundant labelling batches. In a manufacturing QA context, images captured from the same production run within a short time window are often visually nearly identical—same lighting, same camera angle, same background. If the model is uncertain about a particular type of defect, uncertainty sampling may select 50 nearly identical images of that defect, wasting labelling effort on redundant examples.
To address this, the team implemented a diversity filtering mechanism using CLIP (Contrastive Language-Image Pretraining) embeddings and k-means clustering. This two-stage selection process first applied uncertainty sampling to produce a large candidate pool (typically 3–5 times the desired batch size), then used clustering to select a maximally diverse subset.
CLIP Embeddings for Visual Representation
CLIP, developed by OpenAI (Radford et al., 2021), learns joint image-text representations by training on 400 million image-text pairs using a contrastive objective. The image encoder produces a 512-dimensional embedding vector that captures rich semantic and visual information. Crucially, CLIP embeddings encode visual similarity: images that look similar (same defect type, same lighting, same product orientation) produce embeddings that are close in the vector space.
For this application, the team used the ViT-B/32 variant of CLIP, which processes 224×224 pixel images and produces 512-dimensional embeddings. Each unlabelled image in the candidate pool had its CLIP embedding precomputed and cached, making the diversity filtering step extremely fast at query time. The precomputation cost was amortised: computing CLIP embeddings for 2 million images at approximately 5 milliseconds per image required roughly 2.8 hours on a single NVIDIA A100 GPU, a one-time cost.
K-Means Clustering for Diverse Subset Selection
Given a candidate pool of uncertain images (e.g., 200,000 images from uncertainty sampling), k-means clustering was applied to their CLIP embeddings to group visually similar images together. The number of clusters k was set to match the desired labelling batch size (e.g., 50,000). One representative image was then selected from each cluster—the one closest to the cluster centroid.
This approach ensures several desirable properties. First, the labelling batch covers the full visual diversity of uncertain images, not just the most common visual patterns. Second, near-duplicate frames from the same production run are naturally grouped into the same cluster, and only one representative is selected. Third, rare but visually distinct defect types each form their own clusters and are guaranteed representation in the batch, even if they represent a tiny fraction of the candidate pool.
The MDPI journal Applied Sciences published a related study in 2025 on rarity-aware stratified active learning for class-imbalanced industrial object detection, which explicitly aligns sample selection with class imbalance severity (Chen et al., 2025). While that study used a different clustering approach (density-based DBSCAN rather than k-means), the underlying principle is identical: diversity-aware selection prevents the model from focusing exclusively on common uncertain patterns at the expense of rare but critical ones.
Implementation Details
The k-means implementation used scikit-learn's MiniBatchKMeans with 100 initialisation runs and k-means++ seeding. The choice of MiniBatchKMeans over standard KMeans was driven by scalability: with 200,000 candidate images and 512-dimensional embeddings, standard k-means would require O(n×k×d×iterations) operations per batch, which became computationally prohibitive. MiniBatchKMeans reduces this to O(b×k×d×iterations) where b is the mini-batch size (set to 10,000), reducing wall-clock time from approximately 45 minutes to under 3 minutes per selection round.
Labelling Workflow Management with Argilla
Argilla is an open-source data annotation platform designed for managing high-quality labelling workflows. Originally developed for NLP tasks, Argilla has been extended to support image classification and provides features critical for production-grade annotation projects: role-based access control, labeller performance monitoring, annotation guidelines, and real-time quality metrics (Argilla Documentation, 2024).
In this project, Argilla served as the central orchestration layer connecting the active learning model (which produced candidate batches) with human labellers (who provided ground-truth labels). The integration was designed around several key principles.
Batch Distribution and Priority Queuing
Each active learning iteration produced a ranked batch of images prioritised by uncertainty score. Argilla received this batch via its REST API and distributed images to labellers based on availability and expertise. Images were not assigned randomly; instead, labellers who had demonstrated higher accuracy on specific defect types were preferentially assigned images of those types. This expertise-aware routing improved labelling accuracy for rare defect classes by approximately 8 percentage points compared to random assignment.
Confidence Metadata and Multi-Annotator Agreement
A critical innovation was the collection of labeller confidence metadata. After classifying each image, labellers were asked to rate their confidence on a three-point scale (Low, Medium, High). This metadata served two purposes. First, low-confidence annotations were flagged for review by a senior labeller or domain expert. Second, the confidence data was used to weight training samples: images labelled with high confidence by multiple annotators were given higher importance during model training, while images with conflicting labels or low confidence were downweighted.
Each image was independently labelled by at least two annotators. When the two annotators disagreed, the image was escalated to a third annotator (a senior inspector with domain expertise). The final label was determined by majority vote, with the senior annotator's label serving as a tiebreaker. This multi-annotator protocol ensured high-quality ground truth data even for visually ambiguous defect types.
Integration Architecture
The Argilla instance was deployed on-premises to comply with data privacy requirements for manufacturing images. The integration with the active learning pipeline was implemented through a lightweight Python middleware layer that: (1) received candidate batches from the selection algorithm, (2) pushed them to Argilla via the REST API, (3) polled Argilla for completed annotations, and (4) fed labelled data back into the training pipeline. The entire middleware consisted of approximately 400 lines of Python code using the argilla Python client library and Celery for asynchronous task management.
Even the most sophisticated active learning system is undermined if the human labels it trains on are unreliable. Labeller fatigue, misunderstanding of annotation guidelines, and varying levels of domain expertise can introduce systematic labelling errors that degrade model performance. This project addressed this challenge through a calibration set monitoring system.
The Calibration Set
A calibration set of 500 images was carefully curated by senior QA engineers with deep domain expertise. These 500 images were selected to cover all eight classes (one negative + seven defect types) in roughly proportional representation, with deliberate oversampling of rare defect classes to ensure adequate statistical power for those categories. Each image in the calibration set had a consensus gold-standard label verified by at least three senior inspectors.
The calibration set served as a recurring quality check. Every labelling session (typically 2–4 hours) began with the labeller annotating a random subset of 20–30 calibration images intermixed with production images. The labeller was unaware which images were calibration images; they appeared in the normal workflow queue with no special marking.
Agreement Threshold and Intervention Protocol
Each labeller's agreement rate with the gold-standard calibration labels was tracked in real time. The agreement threshold was set at 85%: any labeller whose rolling agreement rate (calculated over their most recent 100 calibration annotations) fell below 85% was automatically flagged for intervention. The intervention protocol consisted of three escalating steps:
Detected Quality Issues
Over the course of the 9-week annotation project, the calibration system detected three significant labeller quality issues that would otherwise have gone unnoticed:
Issue 1 — Systematic Under-Detection of Micro-cracks. A labeller consistently failed to identify micro-crack defects, achieving only 62% agreement on micro-crack calibration images while maintaining 91% agreement on other classes. Investigation revealed the labeller was using a display with insufficient resolution to render micro-cracks visible at the default zoom level. After switching to a higher-resolution display and adjusting zoom defaults, the labeller's micro-crack agreement improved to 88%.
Issue 2 — Confusion Between Discoloration and Foreign Particle. Two labellers showed a systematic pattern of misclassifying foreign particle defects as discoloration, achieving 58% and 64% agreement on the foreign particle class respectively. This was traced to ambiguity in the annotation guidelines, which did not provide sufficiently clear distinguishing criteria between these two visually similar defect types. The guidelines were updated with additional reference images and explicit decision rules, and both labellers' agreement rates recovered to above 90%.
Issue 3 — Late-Session Fatigue Degradation. Analysis of calibration data revealed a statistically significant degradation in labelling accuracy during the final 30 minutes of each 4-hour session. Agreement rates dropped by an average of 7 percentage points (from 92% to 85%) during the last 30 minutes compared to the first 30 minutes. This finding led to a policy change: maximum session length was reduced to 3 hours, and labellers were required to take a 15-minute break after every 90 minutes of continuous annotation. Post-intervention fatigue degradation was reduced to less than 2 percentage points.
Results: Quantitative Outcomes
The active learning approach delivered substantial improvements across all key metrics. The following table summarises the primary outcomes:
The 67% reduction in labelling cost was the headline result, but the improvement in rare defect detection is arguably more significant from a production perspective. The aggregate rare defect F1 improvement from 0.71 to 0.89 brought all rare defect classes above the 0.80 production threshold, meaning the model could now reliably detect defects that previously went undetected. The micro-crack class showed the most dramatic improvement (0.55 to 0.82), driven directly by the diversity sampling mechanism ensuring that visually distinct micro-crack examples were consistently included in labelling batches.
The 36% reduction in time-to-production (from 14 weeks to 9 weeks) was an unexpected but highly valuable outcome. The active learning loop's ability to focus annotation effort on the most informative samples meant that the model converged to production-grade performance faster, even though each iteration cycle (training, selection, labelling, retraining) required coordination across the ML and QA teams. The Argilla platform's workflow management features were instrumental in reducing coordination overhead.
Limitations and Counterarguments
While the results are compelling, several limitations and counterarguments should be considered when evaluating the applicability of this approach to other contexts.
Computational Overhead of the Active Learning Loop
Each iteration of the active learning loop requires full model training, inference on the entire unlabelled pool, CLIP embedding computation (first iteration only, cached thereafter), and k-means clustering. This introduces computational overhead that random sampling does not incur. In this project, each iteration required approximately 4 hours of GPU time on a single A100 (training: 2.5 hours, inference: 1 hour, clustering: 0.5 hours). Over 12 iterations, the total computational cost was approximately 48 GPU-hours, valued at roughly $200–$300 on cloud GPU pricing. While this is negligible compared to the $134,000 in labelling cost savings, it represents a non-trivial engineering investment that may be prohibitive for smaller projects or organisations without access to GPU infrastructure.
Sensitivity to Initial Seed Quality
Active learning is sensitive to the quality of the initial seed set. If the seed set contains systematic biases (e.g., over-representation of certain defect types, or labelling errors), the active learning loop can amplify these biases. In the worst case, a poorly constructed seed set can cause the model to develop confident but incorrect decision boundaries, leading the uncertainty sampler to focus on uninformative edge cases. This project mitigated this risk by using a carefully balanced seed set and by including a cold-start diagnostic that compared seed set class distributions against known production statistics. However, this diagnostic requires prior knowledge of class distributions, which may not be available in greenfield projects.
Labelling Latency and Iteration Cycle Time
The active learning loop assumes that labelled data becomes available quickly enough to inform the next iteration. In this project, each labelling batch took 3–5 days to complete (from batch distribution to all annotations returned). This meant the full 12-iteration loop took 9 weeks, even though the actual computation time was only 48 GPU-hours. For applications where labelled data must be available in near-real-time (e.g., online learning scenarios), this batch-oriented approach may not be suitable. Semi-supervised approaches or single-pass active learning strategies may be more appropriate in latency-sensitive contexts.
Applicability to Other Domains
The specific parameter choices in this project (uncertainty threshold 0.4–0.6, CLIP ViT-B/32 embeddings, k-means clustering) were tuned for this particular dataset and defect taxonomy. Different domains—for example, medical imaging, satellite imagery, or autonomous driving—may require different acquisition functions, embedding models, and clustering strategies. The general principles transfer, but the specific implementation must be adapted to each domain's unique characteristics, including image resolution, class taxonomy complexity, and labelling cost structure.
The 67% Reduction May Not Generalise
The 67% labelling cost reduction is specific to the 94%/6% class imbalance and the particular defect taxonomy in this project. A dataset with more balanced classes (e.g., 70%/30%) would see less dramatic savings, because random sampling would already include a reasonable proportion of minority class examples. Conversely, a dataset with even more extreme imbalance (e.g., 99%/1%) might see even greater savings. The MDPI cost-aware active learning framework for small-object detection (Zhang et al., 2025) suggests that savings typically range from 40–80% depending on class distribution, with higher imbalance yielding higher savings.
Conclusion and Future Outlook
This case study demonstrates that active learning, when implemented with uncertainty sampling, diversity filtering, and robust labelling workflow management, can dramatically reduce the cost and time required to build production-grade computer vision models for manufacturing QA. The 67% labelling cost reduction, combined with a 0.18 improvement in rare defect F1 score and a 36% reduction in time-to-production, provides a compelling evidence base for adopting active learning as a standard practice in industrial computer vision projects.
Several key lessons emerge from this project. First, uncertainty sampling is necessary but not sufficient—without diversity filtering, the labelling budget is wasted on redundant examples. Second, CLIP embeddings provide an effective and scalable foundation for visual diversity assessment, with the one-time embedding computation cost being negligible compared to labelling savings. Third, labelling quality control is not optional—the three quality issues detected by the calibration system would have silently degraded model performance had they gone undetected. Fourth, the Argilla platform (or an equivalent structured annotation tool) is essential for coordinating multi-annotator workflows at production scale.
Looking forward, several directions promise further improvements. Foundation model-based approaches such as Segment Anything Model (SAM) and vision-language models (VLMs) may enable semi-automatic annotation, further reducing human labelling requirements. Research on active learning for vision-language models (arXiv, 2024) proposes frameworks that enhance zero-shot classification by selecting only a few informative samples, potentially reducing the seed set size requirement. Reinforcement learning-based acquisition functions that learn optimal sampling strategies from historical annotation data represent another promising avenue.
The convergence of active learning, foundation models, and structured annotation platforms is creating a new paradigm in which high-quality training datasets can be built at a fraction of the traditional cost and time. For manufacturing QA and other industrial computer vision applications, this paradigm shift is not merely an efficiency gain—it is an enabler that makes previously infeasible projects (e.g., rare defect detection for ultra-low-defect-rate production lines) economically viable. Organisations that adopt these practices early will gain a significant competitive advantage in model quality, deployment speed, and operational cost.
[2] Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of ICML 2021. Available at: https://arxiv.org/abs/2103.00020
[3] Li, Y., et al. (2024). Systematic Review of Class Imbalance Problems in Manufacturing. Computers & Industrial Engineering, 196. Available at: https://doi.org/10.1016/j.cie.2024.110498
[5] Zhang, Y., et al. (2025). Cost-Aware Active Learning Framework for Efficient Small-Object Detection. Electronics, 15(6), 1196. Available at: https://www.mdpi.com/2079-9292/15/6/1196
[6] Chen, X., et al. (2025). Rarity-Aware Stratified Active Learning for Class-Imbalanced Industrial Object Detection. Applied Sciences, 16(3), 1236. Available at: https://www.mdpi.com/2076-3417/16/3/1236
[7] Argilla Documentation (2024). The Tool Where Experts Improve AI Models. Available at: https://docs.argilla.io/v2.7
[8] Wang, L., et al. (2024). Deep Learning and Computer Vision Techniques for Enhanced Quality Control in Manufacturing Processes. IEEE Access. Available at: https://ieeexplore.ieee.org/document/10663422