Scaling Data Safety at Gaia: Detecting PII and Phi in the Data Warehouse (opens in new tab)

Press enter or click to view image in full size

8 min readJust now

Increasing access to affordable fertility treatment involves handling a large amount of sensitive data. This ranges from financial and income information to detailed medical data recorded during fertility care. That data is essential to how Gaia operates and to improving outcomes for our members, but it also introduces compliance risk. Personally identifiable information (PII) and protected health information (PHI) can easily end up being exposed, particularly as data platforms grow and more teams rely on analytics. In this article, I’ll explain how we built an automated system at Gaia to detect PII and PHI in our data warehouse.

The problem

As Gaia scales, the number of systems and data feeding into the data warehouse increases. Some of these quite legitimately contain PII and PHI essential for our business. Other fields can contain PII or PHI unintentionally, particularly free-text fields, notes, and end-user communication. Manually reviewing models is incredibly time consuming and has a high chance of human error. At the same time, most off-the-shelf PII detection tools rely on hosted machine-learning APIs. Under HIPAA constraints, exposing raw data outside our environment was not an option. We needed a way to scan the entire data warehouse, flag potential PII and PHI, and surface it in a way that the data team could realistically act on.

What we built

We built a local-first detection pipeline that runs entirely inside our infrastructure.

The system combines a few simple ideas:

  • Pattern-based column checks for obvious cases such as emails or identifiers (for example email addresses or phone numbers)
  • Local NLP models (https://huggingface.co/dslim/bert-base-NER) to catch less obvious PII in free-text columns
  • Metadata-driven scanning so we only inspect content when needed
  • A review workflow so humans can confirm or reject detections

All detection results are written back into the warehouse so they can be tracked over time.

How detection works

The process starts with table metadata. Schemas, table names and column names across the warehouse are scanned and anything suspicious is flagged. This alone catches more than you might expect and is cheap enough to run continuously.

For content detection, we take small, controlled samples rather than scanning full tables. High-confidence cases are handled with simple pattern matching. For free-text columns, we run an open-source named-entity recognition model locally to detect potential PII or PHI.

No raw values are persisted, and no data ever leaves our environment.

Column-level PII detection

Column names alone catch a large percentage of potential issues. Fields like email, dob, patient_id, or insurance_number are often enough to warrant a closer look.

Loading more...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help