In the AgDetection project, we faced a common but painful challenge in Computer Vision engineering: every task — Classification, Detection, and Segmentation — comes with a different dataset format, different tooling, and its own set of metrics.
Running different models across different benchmarks becomes inconsistent, hard to reproduce, and almost impossible to compare fairly.
My goal was to transform this chaos into a single, structured, configurable pipeline that can execute any benchmark, on any model, in a predictable and scalable way.
Here’s how I designed and built the system.
Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task
| Task | Data Format | Output | Metrics |
|---|---|---|---|
| Classifica… |
In the AgDetection project, we faced a common but painful challenge in Computer Vision engineering: every task — Classification, Detection, and Segmentation — comes with a different dataset format, different tooling, and its own set of metrics.
Running different models across different benchmarks becomes inconsistent, hard to reproduce, and almost impossible to compare fairly.
My goal was to transform this chaos into a single, structured, configurable pipeline that can execute any benchmark, on any model, in a predictable and scalable way.
Here’s how I designed and built the system.
Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task
| Task | Data Format | Output | Metrics |
|---|---|---|---|
| Classification | Folder structure | Label index | Accuracy, F1 |
| Detection | COCO/YOLO JSON/TXT | Bounding boxes | mAP |
| Segmentation | PNG masks | Pixel-level mask | IoU |
The result?
Non-uniform pipelines
Model-specific scripts
Benchmark-specific code paths
No consistent evaluation flow
To build a real benchmarking platform — not a set of scripts — we needed a unified execution model.
2. A Declarative Approach: One YAML Defines the Entire Benchmark
The first architectural decision was to move from hard-coded logic to a declarative configuration model.
Each benchmark is defined by a single YAML file specifying:
task type
dataset format
dataset paths
splits (train/val/test)
evaluation metrics
runtime parameters (device, batch size, etc.)
Example:
task: detection dataset: kind: coco root: datasets/fruit splits: val: val eval: metrics: ["map50", "map"] device: auto
Why this matters
The YAML becomes the Single Source of Truth for the entire system.
And the biggest advantage:
⭐ Adding a new benchmark requires only creating a new YAML file.
No code changes. No new scripts. No duplicated logic.
This design choice directly enabled system extensibility.
3. YAML Alone Isn’t Enough — Enter Pydantic AppConfig
YAML is flexible, but also fragile. A typo, a missing field, or an invalid value can break an evaluation in unpredictable ways.
To solve this, I built a strongly typed AppConfig layer based on Pydantic models.
The AppConfig performs:
✔ Deep validation
Types, allowed values, required fields, structural consistency.
✔ Normalization
Path resolution, default values, device handling, metric validation.
✔ Deterministic interpretation
Converting YAML → stable Python object.
✔ A clear contract between system components
DatasetAdapters, Runners, Metrics, and the UI all rely on the same structured config.
Example:
`class DatasetConfig(BaseModel): kind: str root: Path splits: Dict[str, str]
class EvalConfig(BaseModel): metrics: List[str] device: str = "auto" batch_size: int = 16
class AppConfig(BaseModel): task: str dataset: DatasetConfig eval: EvalConfig `
Why this is crucial
A correct AppConfig means the entire pipeline will behave predictably. An incorrect YAML is caught immediately — before any runner starts executing.
This layer is what makes the system stable at scale.
4. Unifying Inconsistent Formats: DatasetAdapters
Once the benchmark is defined and validated, the next challenge is handling incompatible dataset formats.
I built a modular DatasetAdapter layer to convert any dataset into a uniform iteration interface:
for image, target in adapter: ...
Adapters include:
ClassificationFolderAdapter
CocoDetectionAdapter
YoloDetectionAdapter
MaskSegmentationAdapter
Each adapter:
reads the dataset
converts annotations to a normalized structure
exposes consistent outputs across tasks
This eliminates dozens of conditional branches and format-specific parsing.
5. Task Runners: Executing Models Consistently Across Benchmarks
With datasets unified, the next layer is running models.
I implemented three modular Runners:
ClassifierRunner
DetectorRunner
SegmenterRunner
All runners share the same API:
result = runner.run(dataset, model, config)
Each runner handles:
forward passes
output normalization
prediction logging
metric computation
artifact generation
real-time UI reporting
This design allows any model to run on any benchmark, provided the configuration matches.
6. From Script to System: Client–Server Architecture
To support multiple users and parallel evaluations, the project evolved into a full Client–Server system.
Server responsibilities:
job scheduling
queue management
load balancing
artifact storage (MinIO)
model/version tracking
failure isolation
Client (PyQt) responsibilities:
uploading models
selecting benchmarks
configuring runs
real-time logs
comparing metrics across runs
downloading prediction artifacts
This architecture transformed the pipeline into a usable, scalable research tool.
7. Key Engineering Lessons Learned
Configuration sho
uld drive execution, not the other way around.
Strong validation (Pydantic) saves hours of debugging.
Adapters normalize complexity and prevent format-specific logic explosion.
Modular runners make task logic replaceable and easy to extend.
Incremental evaluation is essential for real-world datasets.
Client–Server separation turns a pipeline into a production-grade system.
Conclusion
By combining:
declarative YAML configuration
a strongly typed AppConfig layer
dataset normalization through adapters
modular runners
incremental computation
and a client–server architecture
I built a unified benchmarking pipeline capable of running any CV model on any benchmark — without writing new code for each task.
This approach provides stability, extensibility, and reproducibility — all essential qualities in a real-world evaluation system.