Scaling Data Safety at Gaia: Detecting PII and Phi in the Data Warehouse

Press enter or click to view image in full size

8 min readJust now

–

Increasing access to affordable fertility treatment involves handling a large amount of sensitive data. This ranges from financial and income information to detailed medical data recorded during fertility care. That data is essential to how Gaia operates and to improving outcomes for our members, but it also introduces compliance risk. Personally identifiable information (PII) and protected health information (PHI) can easily end up being exposed, particularly as data platforms grow and more teams rely on analytics. In this article, I’ll explain how we built an automated system at Gaia to detect PII and PHI in our data warehouse.

The problem

As Gaia scales, the number of systems and data feeding into the…

Press enter or click to view image in full size

8 min readJust now

–

The problem

As Gaia scales, the number of systems and data feeding into the data warehouse increases. Some of these quite legitimately contain PII and PHI essential for our business. Other fields can contain PII or PHI unintentionally, particularly free-text fields, notes, and end-user communication. Manually reviewing models is incredibly time consuming and has a high chance of human error. At the same time, most off-the-shelf PII detection tools rely on hosted machine-learning APIs. Under HIPAA constraints, exposing raw data outside our environment was not an option. We needed a way to scan the entire data warehouse, flag potential PII and PHI, and surface it in a way that the data team could realistically act on.

What we built

We built a local-first detection pipeline that runs entirely inside our infrastructure.

The system combines a few simple ideas:

Pattern-based column checks for obvious cases such as emails or identifiers (for example email addresses or phone numbers)
Local NLP models (https://huggingface.co/dslim/bert-base-NER) to catch less obvious PII in free-text columns
Metadata-driven scanning so we only inspect content when needed
A review workflow so humans can confirm or reject detections

All detection results are written back into the warehouse so they can be tracked over time.

How detection works

The process starts with table metadata. Schemas, table names and column names across the warehouse are scanned and anything suspicious is flagged. This alone catches more than you might expect and is cheap enough to run continuously.

For content detection, we take small, controlled samples rather than scanning full tables. High-confidence cases are handled with simple pattern matching. For free-text columns, we run an open-source named-entity recognition model locally to detect potential PII or PHI.

No raw values are persisted, and no data ever leaves our environment.

Column-level PII detection

Column names alone catch a large percentage of potential issues. Fields like email, dob, patient_id, or insurance_number are often enough to warrant a closer look.

We scan column names against a compiled set of patterns and record any matches.

def check_column_names_for_pii(columns):    suspicious = {}    for column in columns:            name = column["name"].lower()            for category, patterns in COMPILED_PATTERNS.items():                for pattern in patterns:                    if pattern.search(name):                        suspicious.setdefault(category, []).append(column["name"])                        break     return suspicious

This runs across the entire warehouse using metadata only, which minimizes costs for daily execution and is safe enough to be the default.

Detecting PII in free-text with a local AI model

Column names only get you so far. Free-text fields are where most of the subtle risk lives.

For tables with suspicious columns, we take a small sample of rows and run a recognition model locally against selected text fields. This helps catch names, dates, or medical terms embedded in longer text.

def detect_pii_in_text(text: str) -> set:    inputs = tokenizer(        text,        return_tensors="pt",        truncation=True,        max_length=512    ).to(device)    with torch.no_grad():        outputs = model(**inputs)    predictions = torch.argmax(outputs.logits, dim=2)    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])    detected = set()    for token, label_id in zip(tokens, predictions[0]):        label = model.config.id2label[label_id]        if label != "O":            detected.add(label)    return detected

We deliberately limit how much data we analyse:

only text columns
A random sample of rows representative of the population

This keeps scans fast and reduces unnecessary data exposure.

Reviewing detections

Automated detection is only useful if someone can review the results and make a decision. False positives are inevitable, especially when scanning text. To deal with this, we built a simple internal review application using Streamlit.

Press enter or click to view image in full size

Review PII detections by table (Streamlit UI showing grouped tables and column-level detections)

Detections are grouped by table and shown at column level. For each detection, reviewers can see what was flagged, how it was detected, and when it occurred.

Reviewers are required to either mark detections as false positives or remove the field from the data warehouse. Any false-positive decision must include a short explanation of why the field is considered safe.

Press enter or click to view image in full size

False-positive selection with reason and reviewer attribution

These decisions are stored and automatically applied to future scans. Over time, this significantly reduces noise and avoids repeatedly flagging the same benign fields.

Notifications and visibility

On every run, the system sends a short summary notification to internal tooling.

Press enter or click to view image in full size

PII/PHI detection alert summary

The alert shows how many tables were scanned, how many still contain unresolved findings, and which tables need review. It links directly to the review application. Only metadata and counts are included.

How we resolve flagged fields

Detection is only the first step. Once a field is flagged and reviewed, it needs to be resolved properly. We were deliberate about keeping remediation simple and explicit rather than trying to automate destructive changes. In practice, we resolve flagged fields in three main ways:

1. Removing PII columns via pull requests

If a column clearly contains PII or PHI and has no business being in the warehouse, we remove it via a standard pull request.

This keeps the change visible, reviewable, and tied to version control. We deliberately avoided auto-dropping columns directly from the detection pipeline. Anything destructive still goes through normal engineering workflows.

2. Dropping unused tables

In some cases, the scanner surfaced tables that were no longer being used at all. These were often legacy models or abandoned experiments.

Once confirmed with the relevant stakeholder, those tables were dropped. This reduced the surface area of the warehouse and removed unnecessary risk.

3. Redacting free-text fields

Free-text fields are harder. They are often useful for analytics, but also the most likely place for sensitive information to appear.

Rather than removing these fields entirely, we chose to redact them at query time using Snowflake’s built-in AI redaction function. This preserves the structure of the data while masking sensitive content. We wrapped this in a small dbt macro so it could be applied consistently.

{% macro safe_ai_redact(column) -%}CASE  WHEN {{ column }} IS NULL OR TRIM({{ column }}) = '' THEN NULL  ELSE AI_REDACT({{ column }})END{%- endmacro %}

This worked particularly well for notes, descriptions, and other text fields where deleting the column entirely would have removed useful analytical context.

Review volume and cadence

The system only works if it runs continuously and if the review process stays manageable.

The column-level scan runs daily. It is cheap, fast, and catches obvious issues early. The AI-based content scan runs weekly, as it is more expensive and only useful once metadata suggests a real risk.

What matters more than cadence, though, is how the output changes over time.

On the very first run, the signal to noise ratio was poor, as expected. Out of 212 tables scanned, 170 were flagged as potentially containing PII or PHI. The alert was long, noisy, and difficult to prioritise. After reviewers started marking false positives and those decisions were fed back into the system, the output changed significantly.

For example, one table was flagged as potentially containing PII because it included information about the Sales team, such as employee email addresses and activity dates. Upon review, these were determined to be false positives: the data was collected and stored with the explicit consent of Sales employees and was intentionally exposed for operational purposes. Once these decisions were recorded, the system learned to suppress similar alerts for sales-activity–related data, reducing noise without changing the underlying detection logic.

After applying historical false-positive decisions and the above resolution steps, all PII and PHI were removed from the data warehouse and 0 columns and tables remained. Those tables were flagged because actual PII was detected in text fields, not just because of column naming. The detection logic did not change. What changed was the accumulated review history. The system did not get smarter in an abstract sense; it got quieter. That is what made the alerts actionable.

Why this works

The key part of this system is the feedback loop. Detection alone would quickly become noisy and ignored. By recording false positives and feeding those decisions back into the pipeline, the system improves over time without constant tuning. Running everything locally keeps us within clear compliance boundaries, while still allowing us to use machine-learning techniques where they actually add value.

Impact

This system significantly reduced the risk of accidental PII and PHI exposure in our analytics layer and removed a large amount of manual auditing work from the data team. By continuously scanning the warehouse and surfacing issues early, we were able to move from reactive clean ups to proactive prevention. As a result, all teams gained confidence that the warehouse is a safe place to work with data as the company grows.

An important and somewhat unexpected effect was a broader shift in awareness and behaviour. As teams became more familiar with what constitutes PII and PHI, and why it matters, the number of requests to expose sensitive data directly in BI models noticeably decreased. Conversations about reporting started earlier and with more context, leading to better design decisions.

In practice, this meant that reporting needs involving sensitive data were more often handled through appropriate channels: either via the warehouse with deliberately limited access, or through purpose-built operational views rather than the exposure in the BI layer. The detection pipeline did not just protect the data; it helped establish new best practices around governance and data ownership.

Next steps

The current system has proven effective, but there are two obvious directions for improvement.

The first is better sampling. Today we rely on small, representative samples for content-level detection to keep scans fast and minimise data exposure. Over time, we plan to refine this by increasing sample sizes for higher-risk tables and using more targeted sampling for skewed or free-text–heavy columns. This should improve coverage without turning the scanner into a full table sweep.

The second is prevention. Right now, detection happens after data has already landed in the warehouse. A natural next step is to run a lightweight version of the same checks in the CI pipeline for dbt models. This would allow us to flag or block new PII and PHI before it is merged, reducing the need for downstream cleanup and keeping unsafe data out of the warehouse altogether.

Together, these changes shift the system from primarily reactive to increasingly preventive, while keeping the same local-first and review-driven approach.

Closing thoughts

This was not about adding another compliance tool. It was about building data safety into the warehouse in a way that scales.

By combining simple rules, local models and human review, we ended up with something practical, maintainable and appropriate for a healthcare domain. As Gaia continues to grow, this pipeline helps ensure that sensitive data stays where it should.

The problem

The problem

What we built

How detection works

Column-level PII detection

Detecting PII in free-text with a local AI model

Reviewing detections

Notifications and visibility

How we resolve flagged fields

1. Removing PII columns via pull requests

2. Dropping unused tables

3. Redacting free-text fields

Review volume and cadence

Why this works

Impact

Next steps

Closing thoughts

Similar Posts