Performance Testing Databases the Right Way: A Real-World Engineering Journey

A few months ago, I found myself facing one of those architectural decisions you can’t take lightly: Which database should power our application?

On paper, several options looked great. Azure SQL had strong consistency and a familiar relational model. Google Spanner promised global scale and near-infinite horizontal growth. But the same problem kept bothering me:

None of the available benchmarks reflected my application’s workload.

Every benchmark I found was synthetic. Every article was based on someone else’s traffic patterns. Every vendor claim assumed a workload that looked nothing like ours.

And that’s when it clicked.

If I wanted real answers, I needed to simulate our actual read/write behavior, our transaction mix, our concurrency pattern, our connection pooling.

…

A few months ago, I found myself facing one of those architectural decisions you can’t take lightly: Which database should power our application?

None of the available benchmarks reflected my application’s workload.

Every benchmark I found was synthetic. Every article was based on someone else’s traffic patterns. Every vendor claim assumed a workload that looked nothing like ours.

And that’s when it clicked.

If I wanted real answers, I needed to simulate our actual read/write behavior, our transaction mix, our concurrency pattern, our connection pooling.

I needed a test setup that behaved like our Java application—not a generic stress tester.This article is the story of how I modeled the workloads before I even touched Azure SQL or Google Spanner.

In Part 2, I’ll share what happened when I finally put those databases under load.

Why I Couldn’t Trust Generic Database Benchmarks

At first, I thought I’d simply compare p95 latencies or look up TPS numbers for each database. But it didn’t take long to realize the flaws:

These numbers came from engineered environments.
They assumed ideal network conditions.
And they definitely didn’t match our schema or access patterns.

Most importantly?

Our system was not a synthetic benchmark.

It had:

specific read/write ratios
certain frequently accessed tables
particular update patterns
a real connection pool
and lock-sensitive flows

So I stopped searching for benchmarks and started designing tests that mirrored our workload.

Step 1: Defining What “Good Performance” Actually Meant

Before modeling anything in JMeter, I forced myself to answer a simple question:

What does good database performance look like for this application?

This gave me the baseline metrics and below is the “Performance criteria” I defined:

p95 and p99 response time targets
Expected throughput (operations per second)
Resource utilization boundaries (CPU, memory, IOPS)
Acceptable error rate
Scalability expectations under increased concurrency

These weren’t arbitrary numbers—they were tied to real SLAs and user expectations. With this foundation, I could work backwards and ensure the workloads I modeled aligned with the actual business needs.

Step 2: Understanding the Application’s Actual Database Behaviour

Next, I analyzed our application’s schema and data-access patterns.

I listed every operation the application performed frequently:

Reads (SELECT)
Writes (INSERT)
Updates (UPDATE)
Deletes (DELETE)

But I didn’t stop there. I also identified:

which tables were read-heavy
which operations were latency-sensitive
which queries created row-level locks
which flows needed strict consistency

This step was crucial because performance testing is not just about “running queries fast.” It’s about understanding how those queries behave under real concurrency.

Step 3: Classifying Operations by Frequency and Importance

One of the biggest mistakes in performance testing is treating all operations equally. Real applications never have a perfectly even CRUD distribution.

Some flows run thousands of times per minute and others just a few. So, I categorized operations based on:

frequency
business criticality
concurrency sensitivity

This gave me clarity on how to weight each operation in the workload.

Step 4: Designing Realistic Workload Scenarios

Now came the core of the modeling process.

I created workload mixes that reflected how our application behaves in production.

Scenario 1: Read-Heavy Workload Most user-facing apps are read-dominant. For ours, this looked like:

70% reads
20% writes
10% updates

Scenario 2: Balanced Workload

For internal workflows and batch processes:

50% reads
30% writes
10% updates
10% deletes

Each scenario represented a realistic slice of our system’s behaviour not a synthetic stress test.

Step 5: Designing the JMeter Setup to Behave Like a Java Application

This part was non-negotiable for me. I didn’t just want JMeter to hit the database, I wanted it to behave like our Java service.

So I built the test plan with care:

Separate thread groups for each CRUD operation

This allowed me to configure:

individual concurrency control
precise latency measurement
different ramp-up patterns

Parameterized SQL queries

Using CSV Data Set Config, every iteration pulled dynamic values. and this avoided caching effects and mimicked real traffic variation.

Shared JDBC connection pool (1:1 ratio)

All thread groups used the same JDBC pool, just like production. This created:

connection contention
realistic queueing
lock waits

Overall, true concurrency behaviour.

At this point, JMeter wasn’t just “sending traffic.” It was simulating the exact way our application interacted with the database.

The Workload Modeling Was Complete — Now the Real Testing Could Begin

After several iterations, reviews, and dry runs, I finally had a performance test suite that felt authentic. It matched:

our query patterns
our connection pool settings
our concurrency behaviour
our read/write ratios

And most importantly, it reflected the pressure our application would put on any database—whether Azure SQL, Google Spanner, or something else entirely.

Now I was ready for the real showdown.

Coming Up Next: Azure SQL vs Google Spanner — The Actual Results

In Part 2, I’ll share how both databases performed under the exact workloads modeled in this article. And trust me the results were nothing like the vendor benchmarks.

I’ll cover:

p95/p99 latencies
read/write throughput
locking and contention behaviour
resource consumption

and which database ultimately won for our use case

Stay tuned for Part 2.