Press enter or click to view image in full size
8 min readJust now
–
Increasing access to affordable fertility treatment involves handling a large amount of sensitive data. This ranges from financial and income information to detailed medical data recorded during fertility care. That data is essential to how Gaia operates and to improving outcomes for our members, but it also introduces compliance risk. Personally identifiable information (PII) and protected health information (PHI) can easily end up being exposed, particularly as data platforms grow and more teams rely on analytics. In this article, I’ll explain how we built an automated system at Gaia to detect PII and PHI in our data warehouse.
The problem
As Gaia scales, the number of systems and data feeding into the data warehouse increases. Some of these quite legitimately contain PII and PHI essential for our business. Other fields can contain PII or PHI unintentionally, particularly free-text fields, notes, and end-user communication. Manually reviewing models is incredibly time consuming and has a high chance of human error. At the same time, most off-the-shelf PII detection tools rely on hosted machine-learning APIs. Under HIPAA constraints, exposing raw data outside our environment was not an option. We needed a way to scan the entire data warehouse, flag potential PII and PHI, and surface it in a way that the data team could realistically act on.
What we built
We built a local-first detection pipeline that runs entirely inside our infrastructure.
The system combines a few simple ideas:
- Pattern-based column checks for obvious cases such as emails or identifiers (for example email addresses or phone numbers)
- Local NLP models (https://huggingface.co/dslim/bert-base-NER) to catch less obvious PII in free-text columns
- Metadata-driven scanning so we only inspect content when needed
- A review workflow so humans can confirm or reject detections
All detection results are written back into the warehouse so they can be tracked over time.
How detection works
The process starts with table metadata. Schemas, table names and column names across the warehouse are scanned and anything suspicious is flagged. This alone catches more than you might expect and is cheap enough to run continuously.
For content detection, we take small, controlled samples rather than scanning full tables. High-confidence cases are handled with simple pattern matching. For free-text columns, we run an open-source named-entity recognition model locally to detect potential PII or PHI.
No raw values are persisted, and no data ever leaves our environment.
Column-level PII detection
Column names alone catch a large percentage of potential issues. Fields like email, dob, patient_id, or insurance_number are often enough to warrant a closer look.