A couple times within the past month, I’ve had people send me a message asking if I have any suggestions about where to learn postgres. I like to share the collection of links that I’ve accumulated (and please send me more, if you have good ones!) but another thing I always say is that the public postgres slack is a nice place to see people asking questions (Discord, Telegram and IRC also have thriving Postgres user communities). Trying to answer questions and help people out can be a great way to learn!
Last month there was a brief thread on the public postgres slack about the idea of sanatizing SQL and this has been stuck in my head for awhile.
 but another thing I always say is that the public postgres slack is a nice place to see people asking questions (Discord, Telegram and IRC also have thriving Postgres user communities). Trying to answer questions and help people out can be a great way to learn!
Last month there was a brief thread on the public postgres slack about the idea of sanatizing SQL and this has been stuck in my head for awhile.
The topic of sensitive data and SQL is actually pretty nuanced.
First, I think it’s important to directly address the question about how to treat databases schemas – table and column names, function names, etc. We can take our cue from the large number industry vendors with data catalog, data lineage and data masking products. Schemas should be internal and confidential to a company – but they are not sensitive in the same way that PII or PCI data is. Within a company, it’s desirable for most schemas to be discoverable by engineers across multiple development teams – this is worth the benefits of better collaboration and better architecture of internal software.
Unfortunately, the versatile SQL language does not cleanly separate things. A SQL statement is a string that can mix keywords and schema and data all together. As Benoit points out in the slack thread – there are prepared (parameterized) statements, but you can easily miss a spot and end up with literal strings in queries. And I would add to this that most enterprises will also have occasional needs for manual “data fixes” which may involve basic scripts where literal values are common.
Benoit’s suggestion was to run a full parse of the query text. This is a good idea – in fact PgAnalyze already maintains a standalone open-source library which can be used to directly leverage Postgres’ query parser in many languages. This is really the best solution. It’s worth noting though that I’m interested in cases of post-processing query texts from pg_stat_activity
and pg_stat_statements
, both of which have maximum lengths and will truncate query texts that are longer than this. So a query parsing would need to still work with truncated texts that throw syntax errors.
The PgAnalyze library approach is interesting, but I think a simplistic regex-based approach actually has a lot of merit. This can give very useful sanatized SQL for developers to debug, it can still make strong guarantees that we won’t include sensitive data, and the code is incredibly simple… especially compared with importing the entire postgres parser and trying to link to compiled C libraries in other languages!
Tonight I finally got around to writing the code for this.
My design choices here were very intentional:
- I’m stripping out comments because libraries like sqlcommenter will add unique values via comments which break any ability to aggregate and summarize and report top queries or problem queries.
- I would always include the query_id alongside the sanitized SQL text. A user can always go back to the database later and look directly at pg_stat_statements to get the full query text as long as they have the Query ID.
- My decision to include the first three words (excluding comments) and two words following every occurrence of FROM is very strategic. In most cases (CTEs being the exception), the first three words will tell what kind of command is being executed – SELECT or DML or DDL or some utility/misc statement. By including two words after the command, we will generally see the table name for inserts and updates. By including words after FROM, we’ll know at least two of the tables being operated on for queries and deletes. This means we always know at a glance “it’s updating table X” or “it’s querying table Y and table Z”.
- The likelihood of this algorithm including a string literal with sensitive data is next-to-zero. We’re guaranteed never to get literals from INSERTs or UPDATEs. Function and procedure calls must always include parentheses, so that’s is easily mitigated with a simple regex to nuke anything after an open-parenthesis.
- There may be a few cases where this algorithm’s sanitized SQL isn’t as useful as it could be. But that’s why we include the Query ID for retrieving the full query text if needed – and my main goal here is just to have something that’s helpful most of the time and that we could ensure is safe to give developers without requiring PII/PCI data controls.
Sanitize SQL PL/pgSQL Function
https://gist.github.com/ardentperf/44e94ac484e53ff8353f6c1dc0b8f272
Here’s what the code looks like:
CREATE OR REPLACE FUNCTION sanatize_sql(sql_text text)
RETURNS text AS $$
DECLARE
cleaned_text text;
first_part_regex_3words text := '([^[:space:]]+)[[:space:]]+([^[:space:]]+)[[:space:]]+([^[:space:]]+)';
first_part_regex_2words text := '([^[:space:]]+)[[:space:]]+([^[:space:]]+)';
first_part_regex_1words text := '([^[:space:]]+)';
first_part text;
match_array text;
from_parts_regex_3words text := '(FROM)[[:space:]]+([^[:space:]]+)[[:space:]]*([^[:space:]]*)';
from_parts text := '';
BEGIN
-- Remove multi-line comments (/* ... */)
cleaned_text := regexp_replace(sql_text, '/\*.*?\*/', '', 'g');
-- Remove single-line comments (-- to end of line)
cleaned_text := regexp_replace(cleaned_text, '--.*?(\n|$)', '', 'g');
-- Extract the first keyword and up to two words after it
first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_3words),' ');
if first_part is null or first_part ILIKE '% FROM %' or first_part ILIKE '% FROM' then
first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_2words),' ');
if first_part is null or first_part ILIKE '% FROM' then
first_part := array_to_string(regexp_match(cleaned_text,first_part_regex_1words),' ');
end if;
end if;
first_part := regexp_replace(first_part, '\(.*','(...)');
-- Find all occurrences of FROM and two words after each
FOR match_array IN
SELECT array_to_string(regexp_matches(cleaned_text,from_parts_regex_3words,'gi'),' ')
LOOP
match_array := regexp_replace(match_array, '\(.*','(...)');
from_parts := from_parts || '...' || match_array;
END LOOP;
-- Return combined result
RETURN first_part || from_parts;
END;
$$ LANGUAGE plpgsql;
Test 1: Sensitive Data in a Function Call
postgres=# SELECT sanatize_sql($test$
SELECT pgp_sym_encrypt('123-45-6789', 'my_secret_key') AS encrypted_ssn;
$test$);
sanatize_sql
-----------------------------
SELECT pgp_sym_encrypt(...)
(1 row)
Test 2: Simple SELECT with Inline and Block Comments
postgres=# SELECT sanatize_sql($test$
-- Fetch active users only
SELECT id, name -- user info
FROM users /* main table */
WHERE active = TRUE; /* status flag */
$test$);
sanatize_sql
------------------------------------
SELECT id, name...FROM users WHERE
(1 row)
Test 3: SELECT with Subquery and Mixed Comment Styles
postgres=# SELECT sanatize_sql($test$
SELECT id, name
FROM users
WHERE id IN (
/* subquery for high-value customers */
SELECT user_id -- link to users.id
FROM orders
WHERE total > 100 -- filter expensive orders
);
-- end of query
$test$);
sanatize_sql
--------------------------------------------------------
SELECT id, name...FROM users WHERE...FROM orders WHERE
(1 row)
Test 4: SELECT + CTE with Comments Inside and Outside
postgres=# SELECT sanatize_sql($test$
-- recent orders per user
WITH recent_orders AS (
SELECT user_id, MAX(created_at) AS last_order
FROM orders
GROUP BY user_id /* aggregation */
)
SELECT u.name, r.last_order
FROM users u
JOIN recent_orders r ON u.id = r.user_id; -- join results
$test$);
sanatize_sql
----------------------------------------------------------
WITH recent_orders AS...FROM orders GROUP...FROM users u
(1 row)
Test 5: INSERT with Comments in Values
postgres=# SELECT sanatize_sql($test$
INSERT INTO users (name, email, created_at)
VALUES (
'Alice', -- first name
'alice@example.com', /* email */
NOW() /* timestamp */
);
-- new user inserted
$test$);
sanatize_sql
-------------------
INSERT INTO users
(1 row)
Test 6: UPDATE with Trailing and Embedded Comments
postgres=# SELECT sanatize_sql($test$
UPDATE users
SET last_login = NOW() -- set current time
WHERE id = 42 /* target specific user */; -- done
$test$);
sanatize_sql
------------------
UPDATE users SET
(1 row)
Test 7: DELETE with Multi-line Comment Block
postgres=# SELECT sanatize_sql($test$
/*
* Delete old sessions.
* Keep data from the last 30 days.
* Be careful: irreversible.
*/
DELETE FROM sessions
WHERE last_access < NOW() - INTERVAL '30 days';
$test$);
sanatize_sql
------------------------------
DELETE...FROM sessions WHERE
(1 row)
Test 8: UPSERT (Insert … On Conflict) with Inline + Header Comments
postgres=# SELECT sanatize_sql($test$
-- Upsert settings
INSERT INTO user_settings (user_id, theme, notifications)
VALUES (
1, /* user id */
'dark', -- theme
TRUE -- notifications on
)
ON CONFLICT (user_id)
DO UPDATE
SET theme = EXCLUDED.theme, -- overwrite
notifications = EXCLUDED.notifications;
$test$);
sanatize_sql
---------------------------
INSERT INTO user_settings
(1 row)
Test 9: CTE-Based UPDATE with Nested Comments
postgres=# SELECT sanatize_sql($test$
-- mark inactive users
WITH inactive_users AS (
SELECT id
FROM users
WHERE last_login < NOW() - INTERVAL '1 year' /* cutoff */
)
UPDATE users
SET active = FALSE
WHERE id IN (
SELECT id FROM inactive_users -- CTE reference
);
$test$);
sanatize_sql
--------------------------------------------------------------------
WITH inactive_users AS...FROM users WHERE...FROM inactive_users );
(1 row)
Test 10: DDL with Comments Everywhere
postgres=# SELECT sanatize_sql($test$
-- create table if missing
CREATE TABLE IF NOT EXISTS audit_log ( /* audit records */
id SERIAL PRIMARY KEY, -- identity column
user_id INT REFERENCES users(id), /* FK */
action TEXT NOT NULL, -- what happened
created_at TIMESTAMP DEFAULT NOW() /* timestamp */
);
$test$);
sanatize_sql
-----------------
CREATE TABLE IF
(1 row)
Test 11: Complex Query with Multi-CTE, Inline + Block Comments
postgres=# SELECT sanatize_sql($test$
/*
This query finds top customers.
It uses multiple CTEs and subqueries.
*/
WITH order_totals AS (
SELECT user_id, SUM(total) AS lifetime_value
FROM orders
GROUP BY user_id -- one row per user
),
top_customers AS (
SELECT user_id
FROM order_totals
WHERE lifetime_value > 10000 /* threshold */
)
SELECT u.id, u.name, o.lifetime_value -- main output
FROM users u
JOIN order_totals o ON u.id = o.user_id
WHERE u.id IN (SELECT user_id FROM top_customers)
ORDER BY o.lifetime_value DESC /* high to low */
LIMIT 10; -- top 10
$test$);
sanatize_sql
---------------------------------------------------------------------------------------------------------------
WITH order_totals AS...FROM orders GROUP...FROM order_totals WHERE...FROM users u...FROM top_customers) ORDER
(1 row)
Test 12: Function Call in the FROM Clause
postgres=# SELECT sanatize_sql($test$
SELECT * FROM generate_series(1,10);
$test$);
sanatize_sql
--------------------------------------
SELECT *...FROM generate_series(...)
(1 row)
Test 13: Anonymous Code Block
postgres=# SELECT sanatize_sql($test$
DO $$
DECLARE
tbl RECORD;
BEGIN
OPEN table_cursor;
LOOP
FETCH table_cursor INTO tbl;
EXIT WHEN NOT FOUND;
EXECUTE 'VACUUM ' || tbl.tablename;
END LOOP;
CLOSE table_cursor;
END $$;
$test$);
sanatize_sql
---------------
DO $$ DECLARE
(1 row)
Test 14: Declaring a Cursor
postgres=# SELECT sanatize_sql($test$
-- Declare a cursor for employees in Engineering
DECLARE emp_cursor CURSOR FOR
SELECT id, name, salary
FROM employees
WHERE department = 'Engineering';
$test$);
sanatize_sql
--------------------------------------------------
DECLARE emp_cursor CURSOR...FROM employees WHERE
(1 row)
.