1 billion JSON records, 1-second query response: Apache Doris vs. ClickHouse, Elasticsearch, and PostgreSQL

Honestly, every time I check performance benchmarks, my eyes instinctively dart to see where Apache Doris ranks. Opening JSONBench’s leaderboard this time, I felt that familiar mix of anticipation and nervousness. Fortunately, the result brought me a sigh of relief: Apache Doris snagged third place with just its default configuration, trailing only two versions of ClickHouse (the maintainer of JSONBench itself).

Not bad. But can Apache Doris go even further? I wanted to see how much more we could cut query latency through optimization, and find out the true performance gap between Apache Doris and ClickHouse. Long story short, here’s a before-and-after comparison chart of our optimizations. For the details behind the improvements, read on!

I. What is JSONBench?

JSONBench is…

I. What is JSONBench?

JSONBench is a benchmark tool specifically designed for JSON data analytics, with the following core features:

Test Data: 1 billion JSON-format user behavior logs from real production environments;

Test Cases: 5 SQL queries specifically designed for JSON structures, accurately evaluating the database’s ability to process semi-structured data;

Participants: Covers mainstream databases such as ClickHouse, SingleStore, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.

At the time of testing, Apache Doris had already delivered an impressive performance: twice as fast as Elasticsearch and a staggering 80 times faster than PostgreSQL!

JSONBench Official Website: jsonbench.com

In addition to performance advantages, Apache Doris also has strong competitiveness in storage occupancy: under the same dataset, its storage volume is only 50% of Elasticsearch and 1/3 of PostgreSQL.

1.1 JSONBench Testing Process

Create a table named Bluesky in the database and import 1 billion real user behavior logs; 1.

Each query is executed 3 times, and the operating system’s Page Cache is cleared before each execution to simulate both cold and warm query scenarios; 1.

Determine the database performance ranking based on the total query execution time.

1.2 Apache Doris Test Basic Configuration

In this test, Apache Doris used the VARIANT data type to store JSON data (introduced in Doris version 2.1, specifically designed for semi-structured JSON data), with the default table structure as follows:


CREATE TABLE bluesky (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`data` variant NOT NULL
)
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES ("replication_num"="1");

Core Advantages of VARIANT Data Type:

No need to predefine column structures; can directly store complex data containing integers, strings, booleans and other types;

Adapts to frequently changing nested structures; can automatically infer column information based on data structure and type during writing, and dynamically merge write schemas;

Stores JSON key-value pairs as columns and dynamic sub-columns, balancing the flexibility of semi-structured data and query efficiency.

More information about VARIANT data type: Apache Doris Official Documentation

II. Apache Doris Performance Optimization Practice

The JSONBench leaderboard is based on the performance data of each database system under its default configuration. However, in actual production environments, can we further unlock the potential of Apache Doris through tuning? The following is the complete optimization process.

2.1 Basic Environment Configuration

Server: AWS M6i.8xlarge (32 cores, 128GB memory);

Operating System: Ubuntu 24.04;

Apache Doris Version: v3.0.5.

2.2 Core Optimization: Schema Structuring Transformation

All queries in JSONBench target fixed JSON extraction paths, which means the actual schema of the semi-structured data is fixed. Based on this, we used Generated Columns to extract frequently accessed fields, combining the advantages of semi-structured and structured data. For frequently accessed JSON paths or calculation expressions, adding generated columns can significantly improve query speed!

2.2.1 Optimized Table Structure


CREATE TABLE bluesky (
kind VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data, '$.kind')) NOT NULL,
operation VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data, '$.commit.operation')) NULL,
collection VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data, '$.commit.collection')) NULL,
did VARCHAR(100) GENERATED ALWAYS AS (get_json_string(data,'$.did')) NOT NULL,
time DATETIME GENERATED ALWAYS AS (from_microsecond(get_json_bigint(data, '$.time_us'))) NOT NULL,
`data` variant NOT NULL
)
DUPLICATE KEY (kind, operation, collection)
DISTRIBUTED BY HASH(collection, did) BUCKETS 32
PROPERTIES ("replication_num"="1");

This transformation not only reduces the data extraction overhead during queries, but the flattened columns can also be used as partition columns to achieve more balanced data distribution.

2.2.2 Supporting Query Statement Optimization

Query statements need to be modified synchronously to use flattened columns. The following is a comparison before and after optimization:

Before Optimization (Native JSON Query):


-- Query 1
SELECT cast(data['commit']['collection'] AS TEXT ) AS event, COUNT(*) AS count FROM bluesky GROUP BY event ORDER BY count DESC;
-- Query 2
SELECT cast(data['commit']['collection'] AS TEXT ) AS event, COUNT(*) AS count, COUNT(DISTINCT cast(data['did'] AS TEXT )) AS users FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' GROUP BY event ORDER BY count DESC;
-- Query 3
SELECT cast(data['commit']['collection'] AS TEXT ) AS event, HOUR(from_microsecond(CAST(data['time_us'] AS BIGINT))) AS hour_of_day, COUNT(*) AS count FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' AND cast(data['commit']['collection'] AS TEXT ) IN ('app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like') GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
-- Query 4
SELECT cast(data['did'] AS TEXT ) AS user_id, MIN(from_microsecond(CAST(data['time_us'] AS BIGINT))) AS first_post_ts FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' AND cast(data['commit']['collection'] AS TEXT ) = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3;
-- Query 5
SELECT cast(data['did'] AS TEXT ) AS user_id, MILLISECONDS_DIFF(MAX(from_microsecond(CAST(data['time_us'] AS BIGINT))),MIN(from_microsecond(CAST(data['time_us'] AS BIGINT)))) AS activity_span FROM bluesky WHERE cast(data['kind'] AS TEXT ) = 'commit' AND cast(data['commit']['operation'] AS TEXT ) = 'create' AND cast(data['commit']['collection'] AS TEXT ) = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;

After Optimization (Flattened Column Query):


-- Query 1
SELECT collection AS event, COUNT(*) AS count FROM bluesky GROUP BY event ORDER BY count DESC;
-- Query 2
SELECT collection AS event, COUNT(*) AS count, COUNT(DISTINCT did) AS users FROM bluesky WHERE kind = 'commit' AND operation = 'create' GROUP BY event ORDER BY count DESC;
-- Query 3
SELECT collection AS event, HOUR(time) AS hour_of_day, COUNT(*) AS count FROM bluesky WHERE kind = 'commit' AND operation = 'create' AND collection IN ('app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like') GROUP BY event, hour_of_day ORDER BY hour_of_day, event;
-- Query 4
SELECT did AS user_id, MIN(time) AS first_post_ts FROM bluesky WHERE kind = 'commit' AND operation = 'create' AND collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3;
-- Query 5
SELECT did AS user_id, MILLISECONDS_DIFF(MAX(time),MIN(time)) AS activity_span FROM bluesky WHERE kind = 'commit' AND operation = 'create' AND collection = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3;

2.3 Page Cache Tuning

After modifying the query statements, we enabled performance profiling and executed the complete test:


set enable_profile=true;

By accessing the FE Web UI (port 8030) to view the profile, we found that the Page Cache hit rate of the SCAN Operator was extremely low — this meant that cold reads still occurred during the hot query test (similar to wanting to get something from the fridge but finding it empty, having to go all the way to the supermarket). The key data is as follows:

Cached Pages Number (CachedPagesNum): 1.258K (1258);

Total Pages Number (TotalPagesNum): 7.422K (7422).

The root cause is that the default size of Page Cache is not sufficient to hold all the data of the Bluesky table. The solution is to add a configuration in be.conf to increase the proportion of Page Cache in total memory from the default 20% to 60%:


storage_page_cache_limit=60%

After re-running the test, the cold read issue was completely resolved, with a cache hit rate of 100%:

Cached Pages Number (CachedPagesNum): 7.316K (7316);

Total Pages Number (TotalPagesNum): 7.316K (7316).

2.4 Maximizing Parallelism Configuration

To further unleash performance, we set the session variable parallel_pipeline_task_num to 32 — since the test server has 32 CPU cores, matching the parallelism to the number of CPU cores can maximize CPU utilization:


-- Parallelism configuration for a single Fragment
set global parallel_pipeline_task_num=32;

III. Optimization Result: Surpassing ClickHouse by 39%

After the above-mentioned adjustments to schema, queries, memory limits, and CPU parameters, we compared the performance of Apache Doris before and after optimization, as well as with other database systems:

Core improvement data:

Compared with pre-optimization, Apache Doris reduced the total query time by 74%;

Compared with ClickHouse, which was previously ranked first on the leaderboard, the performance was improved by 39%.

IV. Summary and Future Outlook

Through schema structuring transformation, query statement optimization, cache configuration adjustment, and parallelism parameter tuning, Apache Doris has achieved a significant reduction in semi-structured data query latency: under default configuration, it only lagged behind ClickHouse by a few seconds when querying 1 billion JSON records. However, with its strong JSON processing capabilities, VARIANT data type support, and Generated Columns feature, it has clearly surpassed similar databases in this scenario after optimization.

In the future, Apache Doris will continue to deepen its semi-structured data processing capabilities and achieve more powerful and efficient analytics through the following directions:

Optimize sparse VARIANT column storage to support more than 10,000 sub-columns;

Reduce memory usage of wide tables with 10,000-level columns;

Support custom types and indexes for VARIANT sub-columns based on column name patterns.

I. What is JSONBench?

I. What is JSONBench?

Test Data: 1 billion JSON-format user behavior logs from real production environments;

Test Cases: 5 SQL queries specifically designed for JSON structures, accurately evaluating the database’s ability to process semi-structured data;

1.1 JSONBench Testing Process

1.2 Apache Doris Test Basic Configuration

No need to predefine column structures; can directly store complex data containing integers, strings, booleans and other types;

Adapts to frequently changing nested structures; can automatically infer column information based on data structure and type during writing, and dynamically merge write schemas;

II. Apache Doris Performance Optimization Practice

2.1 Basic Environment Configuration

Server: AWS M6i.8xlarge (32 cores, 128GB memory);

Operating System: Ubuntu 24.04;

2.2 Core Optimization: Schema Structuring Transformation

2.2.1 Optimized Table Structure

2.2.2 Supporting Query Statement Optimization

Before Optimization (Native JSON Query):

After Optimization (Flattened Column Query):

2.3 Page Cache Tuning

Cached Pages Number (CachedPagesNum): 1.258K (1258);

Cached Pages Number (CachedPagesNum): 7.316K (7316);

2.4 Maximizing Parallelism Configuration

III. Optimization Result: Surpassing ClickHouse by 39%

Compared with pre-optimization, Apache Doris reduced the total query time by 74%;

IV. Summary and Future Outlook

Optimize sparse VARIANT column storage to support more than 10,000 sub-columns;

Reduce memory usage of wide tables with 10,000-level columns;

Similar Posts