Build in Public: Week 5. The Week We Finally Measured Things Instead of Just Hoping for the Best

Last week I ended with a dramatic cliffhanger: “We need metrics!”

And then immediately regretted it, because “metrics” is one of those words that sounds simple until you realize you now have to measure things сonsistently and over time. And then look at the numbers even when you don’t like them.

But here we are and Wykra officially has observability. We went with the classic open-source trio - Prometheus, Alertmanager, Grafana - the monitoring starter pack everyone eventually ends up when the fun part is over and your project starts behaving like something people might actually rely on.

Before we get into that a quick reminder for anyone who already lost the plot: Wykra is our AI agent that discovers and analyses influencers using [Bright Data](https://get.brightdata.com/kgkd75c5…

Last week I ended with a dramatic cliffhanger: “We need metrics!”

Before we get into that a quick reminder for anyone who already lost the plot: Wykra is our AI agent that discovers and analyses influencers using Bright Data scraping and a couple of LLMs stitched together into one workflow. That’s the thing we’ve been building week by week, sometimes actually making progress, sometimes just banging our heads against the wall, but still moving forward.

What We Actually Added This Week

If you want the full setup: how to run everything, where the dashboards live, how to scrape the metrics, that’s all in the README.

Here I just want to show the main things we can finally measure and explain what’s actually doing the measuring. We use three tools, each with a very clear job:

Prometheus - our metrics store and query engine.

It fetches data from our API every few seconds and keeps track of all our counters and timings, so we can see how things change over time. This is essentially where all our HTTP, task and system metrics end up, and where we read them from when we want to understand what’s going on.

Alertmanager - the routing and notification layer.

Prometheus checks the alert rules, and when something crosses a threshold, Alertmanager sends the notification - Slack, email, webhooks, whatever we set up. It also groups and filters alerts so we don’t get spammed every time the system twitches.

Grafana - the visualization layer.

It sits on top of Prometheus and turns raw time-series data into dashboards we can monitor in real time. It’s where we track request rates, latency, task behaviour and system load without reading query output directly.

Together they cover everything we need for basic observability: Prometheus collects the data, Alertmanager sends the alerts and Grafana shows what’s happening.

The Core Metrics We Focus on

Even though Wykra isn’t handling real traffic yet and everything still runs inside Docker on our machines, having metrics already makes a huge difference. It lets us see how the system behaves under our own tests, load simulations and all the strange edge cases we manage to generate while building this thing.

There are plenty of metrics in Prometheus (the README has the full list), but the ones that actually help us understand what’s going on right now fall into 4 groups.

1. HTTP Metrics

These show how the API responds under our local runs: request rates, error rates and response times across all routes.

It’s an easy way to catch regressions, for example, when one change suddenly turns a fast endpoint into something that looks like it’s running through a VPN in Antarctica.

2. System Metrics

The basics: CPU, memory, process usage.

Even in Docker these tell useful stories such as sudden memory spikes, noisy CPU neighbours, inefficient code paths. When latency jumps, this is often where the explanation starts.

3. Task Pipeline Metrics

This is the part of Wykra that actually moves work through the system.

We track how many tasks we create during testing, how many complete or fail, how long they take and how the queue grows or drains over time. These metrics show whether the pipeline is behaving normally or slowly drifting into a backlog spiral.

We also collect latency distributions for specific task types (like Instagram search) to catch tail slowdowns that averages tend to hide.

4. External Service Metrics

Since the system relies heavily on external APIs we monitor them separately. They degrade differently from our own code and cause issues that look similar on the surface but require a different fix.

Bright Data metrics

Success rates, response times and error spikes for every Bright Data call.

This helps us see whether an issue comes from our code or from a day when the scraper ecosystem simply isn’t cooperating.

LLM call and token metrics

We also track how the LLMs behave under our test runs. The metrics cover call frequency, latency, token usage and error patterns, basically everything that tends to drift over time.

We record how many LLM calls we make, how long each one takes, how many prompt and completion tokens the model consumes and how that translates into total token usage per request. Errors are tracked separately so we can see when the model slows down, times out or starts returning bad responses.

Dashboards

I’m not going to paste the full Grafana board here (nobody needs 19 screenshots in a blog post) but here are a few core panels that demonstrate how the system behaves during our test runs.

The following panel shows the call rate for the two LLMs during our local test runs. Claude (green) peaks higher because it handles the heavier analysis steps, while Perplexity (yellow) stays lower and more steady. The small drops simply reflect pauses between test batches.

The chart below shows how token usage changes during our test runs. Claude’s prompt and completion tokens (green and blue) spike during the heavier analysis steps, which is why the total line (red) climbs sharply. Perplexity stays much lower. its queries are simpler and produce shorter responses. When the test batch ends, all token rates drop back to near zero until the next run.

You can also look the rate of Bright Data calls during our test runs. The spikes correspond to batches where we’re pulling Instagram profile data, and the flat sections reflect pauses between those batches.

The panel below lists all alert rules we’ve configured - errors, slow responses, resource spikes, LLM issues, database problems, and queue backlogs. Everything is green here because we’re only running test loads.

And then we have a dashboard that shows the basics: CPU usage rising during a test run, Instagram search tasks being created and completed at a steady rate and no failures during this window. This simple view is enough to confirm that the pipeline behaves as expected under local load.

Conclusion

The most useful thing we built this week is the ability to see what the system is actually doing. Instead of assuming everything works because a test passed once on my laptop, we now have real visibility: metrics, alerts, dashboards.

And now we can start expanding again: adding more social platforms, trying different search strategies, breaking things on purpose, because at least we’ll know how the system behaved before the change and whether the new idea made anything better or worse.