Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

In the AgDetection project, we faced a common but painful challenge in Computer Vision engineering: every task — Classification, Detection, and Segmentation — comes with a different dataset format, different tooling, and its own set of metrics.

Running different models across different benchmarks becomes inconsistent, hard to reproduce, and almost impossible to compare fairly.

My goal was to transform this chaos into a single, structured, configurable pipeline that can execute any benchmark, on any model, in a predictable and scalable way.

Here’s how I designed and built the system.

Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

Task	Data Format	Output	Metrics
Classifica…

Running different models across different benchmarks becomes inconsistent, hard to reproduce, and almost impossible to compare fairly.

My goal was to transform this chaos into a single, structured, configurable pipeline that can execute any benchmark, on any model, in a predictable and scalable way.

Here’s how I designed and built the system.

Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

Task	Data Format	Output	Metrics
Classification	Folder structure	Label index	Accuracy, F1
Detection	COCO/YOLO JSON/TXT	Bounding boxes	mAP
Segmentation	PNG masks	Pixel-level mask	IoU

The result?

Non-uniform pipelines

Model-specific scripts

Benchmark-specific code paths

No consistent evaluation flow

To build a real benchmarking platform — not a set of scripts — we needed a unified execution model.

2. A Declarative Approach: One YAML Defines the Entire Benchmark

The first architectural decision was to move from hard-coded logic to a declarative configuration model.

Each benchmark is defined by a single YAML file specifying:

task type

dataset format

dataset paths

splits (train/val/test)

evaluation metrics

runtime parameters (device, batch size, etc.)

Example:

task: detection dataset: kind: coco root: datasets/fruit splits: val: val eval: metrics: ["map50", "map"] device: auto

Why this matters

The YAML becomes the Single Source of Truth for the entire system.

And the biggest advantage:

⭐ Adding a new benchmark requires only creating a new YAML file.

No code changes. No new scripts. No duplicated logic.

This design choice directly enabled system extensibility.

3. YAML Alone Isn’t Enough — Enter Pydantic AppConfig

YAML is flexible, but also fragile. A typo, a missing field, or an invalid value can break an evaluation in unpredictable ways.

To solve this, I built a strongly typed AppConfig layer based on Pydantic models.

The AppConfig performs:

✔ Deep validation

Types, allowed values, required fields, structural consistency.

✔ Normalization

Path resolution, default values, device handling, metric validation.

✔ Deterministic interpretation

Converting YAML → stable Python object.

✔ A clear contract between system components

DatasetAdapters, Runners, Metrics, and the UI all rely on the same structured config.

Example:

`class DatasetConfig(BaseModel): kind: str root: Path splits: Dict[str, str]

class EvalConfig(BaseModel): metrics: List[str] device: str = "auto" batch_size: int = 16

class AppConfig(BaseModel): task: str dataset: DatasetConfig eval: EvalConfig `

Why this is crucial

A correct AppConfig means the entire pipeline will behave predictably. An incorrect YAML is caught immediately — before any runner starts executing.

This layer is what makes the system stable at scale.

4. Unifying Inconsistent Formats: DatasetAdapters

Once the benchmark is defined and validated, the next challenge is handling incompatible dataset formats.

I built a modular DatasetAdapter layer to convert any dataset into a uniform iteration interface:

for image, target in adapter: ...

Adapters include:

ClassificationFolderAdapter

CocoDetectionAdapter

YoloDetectionAdapter

MaskSegmentationAdapter

Each adapter:

reads the dataset

converts annotations to a normalized structure

exposes consistent outputs across tasks

This eliminates dozens of conditional branches and format-specific parsing.

5. Task Runners: Executing Models Consistently Across Benchmarks

With datasets unified, the next layer is running models.

I implemented three modular Runners:

ClassifierRunner

DetectorRunner

SegmenterRunner

All runners share the same API:

result = runner.run(dataset, model, config)

Each runner handles:

forward passes

output normalization

prediction logging

metric computation

artifact generation

real-time UI reporting

This design allows any model to run on any benchmark, provided the configuration matches.

6. From Script to System: Client–Server Architecture

To support multiple users and parallel evaluations, the project evolved into a full Client–Server system.

Server responsibilities:

job scheduling

queue management

load balancing

artifact storage (MinIO)

model/version tracking

failure isolation

Client (PyQt) responsibilities:

uploading models

selecting benchmarks

configuring runs

real-time logs

comparing metrics across runs

downloading prediction artifacts

This architecture transformed the pipeline into a usable, scalable research tool.

7. Key Engineering Lessons Learned

Configuration sho

uld drive execution, not the other way around.

Strong validation (Pydantic) saves hours of debugging.

Adapters normalize complexity and prevent format-specific logic explosion.

Modular runners make task logic replaceable and easy to extend.

Incremental evaluation is essential for real-world datasets.

Client–Server separation turns a pipeline into a production-grade system.

Conclusion

By combining:

declarative YAML configuration

a strongly typed AppConfig layer

dataset normalization through adapters

modular runners

incremental computation

and a client–server architecture

I built a unified benchmarking pipeline capable of running any CV model on any benchmark — without writing new code for each task.

This approach provides stability, extensibility, and reproducibility — all essential qualities in a real-world evaluation system.

Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

2. A Declarative Approach: One YAML Defines the Entire Benchmark

3. YAML Alone Isn’t Enough — Enter Pydantic AppConfig

4. Unifying Inconsistent Formats: DatasetAdapters

5. Task Runners: Executing Models Consistently Across Benchmarks

6. From Script to System: Client–Server Architecture

Client (PyQt) responsibilities:

7. Key Engineering Lessons Learned

Conclusion

Similar Posts