Synth Studio π§ͺ
Privacy-first synthetic data generation for healthcare and fintech
β‘ Quick Install
# Clone
git clone https://github.com/Urz1/synthetic-data-studio.git && cd synthetic-data-studio
# Backend
cd backend && cp .env.example .env
pip install -r requirements.txt && alembic upgrade head
uvicorn app.main:app --reload
# Frontend (new terminal)
cd frontend && cp .env.local.example .env.local
pnpm install && pnpm dev
Frontend: http://localhost:3000 | API Docs: http://localhost:8000/docs
π Full setup guide: LOCAL_DEVELOPMENT.md
π― What It Does
Generate high-quality synthetic data with **differential privacyβ¦
Synth Studio π§ͺ
Privacy-first synthetic data generation for healthcare and fintech
β‘ Quick Install
# Clone
git clone https://github.com/Urz1/synthetic-data-studio.git && cd synthetic-data-studio
# Backend
cd backend && cp .env.example .env
pip install -r requirements.txt && alembic upgrade head
uvicorn app.main:app --reload
# Frontend (new terminal)
cd frontend && cp .env.local.example .env.local
pnpm install && pnpm dev
Frontend: http://localhost:3000 | API Docs: http://localhost:8000/docs
π Full setup guide: LOCAL_DEVELOPMENT.md
π― What It Does
Generate high-quality synthetic data with differential privacy guarantees. Built for regulated industries:
| Industry | Use Case |
|---|---|
| π₯ Healthcare (HIPAA) | Synthetic EHR, FHIR, patient records |
| π¦ Fintech (SOC-2/GDPR) | Transaction data, fraud testing |
| π€ ML Teams | Privacy-safe training datasets |
| π’ Enterprise | Cross-department data sharing |
β¨ Key Features
Generation Methods
| Method | Description | Best For |
|---|---|---|
| Schema-Based | Define columns β generate data (no source dataset needed) | Testing, prototyping |
| Dataset-Based ML | Train on real data β generate synthetic | Production quality |
| LLM-Powered Seed | AI generates realistic seed data β statistical expansion | Domain-specific realism |
ML Generators
- CTGAN - Conditional Tabular GAN (mixed numeric + categorical)
- TVAE - Tabular Variational Autoencoder (high-cardinality categorical)
- GaussianCopula - Statistical copulas (fast, correlation-preserving)
Privacy & Compliance
- Differential Privacy - Configurable Ξ΅/Ξ΄ with RDP accounting
- PII/PHI Detection - Automatic sensitive column identification
- Compliance Reports - HIPAA, GDPR, SOC-2 ready documentation
- Audit Logs - Immutable activity tracking
AI-Powered Features
- Chat Assistant - Natural language data generation guidance
- Enhanced PII Detection - LLM-powered sensitivity analysis
- Compliance Writer - Auto-generate compliance documentation
Quality Evaluation
- Statistical Similarity - Distribution matching, K-S tests
- ML Utility - Train/test accuracy preservation
- Privacy Risk - Membership inference, re-identification risk
π Prerequisites
| Requirement | Version |
|---|---|
| Python | 3.9+ |
| Node.js | 18+ |
| PostgreSQL | 13+ |
| Redis | 7+ (local Docker by default; set REDIS_URL for managed) |
Environment Variables:
# Backend (.env)
DATABASE_URL=postgresql://user:pass@localhost/synthstudio
SECRET_KEY=your-jwt-secret
AWS_S3_BUCKET=your-bucket # optional
REDIS_URL=redis://localhost:6379/0 # default local container; use rediss:// for hosted
# Frontend (.env.local)
NEXT_PUBLIC_API_URL=http://localhost:8000
BETTER_AUTH_SECRET=your-auth-secret
π§ Usage
Generate from Schema (No Dataset Needed)
curl -X POST "http://localhost:8000/generators/schema" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"columns": {
"name": {"type": "string", "faker": "name"},
"age": {"type": "integer", "min": 18, "max": 80},
"email": {"type": "string", "faker": "email"},
"balance": {"type": "number", "min": 0, "max": 50000}
}
}'
Generate from Dataset (ML-Based)
# Upload dataset
curl -X POST "http://localhost:8000/datasets/upload" \
-H "Authorization: Bearer $TOKEN" \
-F "file=@data.csv"
# Generate synthetic data with DP
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"generator_type": "ctgan",
"num_rows": 10000,
"epochs": 300,
"differential_privacy": {"enabled": true, "epsilon": 1.0, "delta": 1e-5}
}'
Python SDK Example
import requests
# Login
session = requests.Session()
session.post("http://localhost:8000/auth/login", json={
"email": "user@example.com", "password": "secret"
})
# Schema-based generation
synth_data = session.post("/generators/schema?num_rows=1000", json={
"columns": {
"patient_id": {"type": "string", "pattern": "PAT-[0-9]{6}"},
"diagnosis": {"type": "category", "values": ["A01", "B12", "C34"]},
"visit_date": {"type": "date", "min": "2024-01-01", "max": "2024-12-31"}
}
}).json()
π§ͺ Testing
# Backend tests with coverage
cd backend && pytest tests/ -v --cov=app
# Frontend tests
cd frontend && pnpm test
# E2E tests
cd frontend && pnpm test:e2e
π Project Structure
synth-studio/
βββ backend/ # FastAPI API server
β βββ app/
β β βββ auth/ # Authentication (JWT, OAuth, 2FA)
β β βββ datasets/ # Dataset upload, profiling
β β βββ generators/ # Schema + ML generation
β β βββ evaluations/ # Quality metrics
β β βββ services/
β β β βββ synthesis/ # CTGAN, TVAE, Copula
β β β βββ llm/ # AI chat, PII detection
β β β βββ privacy/ # DP accounting
β β βββ compliance/ # HIPAA/GDPR reports
β β βββ audit/ # Activity logging
β βββ tests/
βββ frontend/ # Next.js 16 web app
β βββ app/
β β βββ dashboard/ # Overview & metrics
β β βββ datasets/ # Upload & profile
β β βββ generators/ # Create & manage
β β βββ evaluations/ # Quality reports
β β βββ synthetic-datasets/ # Generated data
β β βββ compliance/ # Compliance center
β β βββ assistant/ # AI chat
β βββ components/
βββ docs/ # Docusaurus docs
π Documentation
| Resource | Description |
|---|---|
| Docs Site | Full documentation |
| Getting Started | Installation & quickstart |
| User Guide | Feature walkthroughs |
| API Reference | OpenAPI/Swagger |
| Examples | Code samples & Postman |
π€ Contributing
- Fork & clone
- Create feature branch (
git checkout -b feature/amazing) - Add tests & make changes
- Run tests (
pytest/pnpm test) - Submit PR
See CONTRIBUTING.md for guidelines.
π Security
Report vulnerabilities privately: halisadam391@gmail.com or see SECURITY.md.
π License
MIT Β© 2025 Sadam Husen
π¬ Contact
Sadam Husen @Urz1 halisadam391@gmail.com
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (Next.js 16) β
β Dashboard β’ Datasets β’ Generators β’ Evaluations β
β Compliance β’ Audit β’ Billing β’ AI Assistant β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β REST API (JWT + OAuth)
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β Backend (FastAPI) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Auth β β Datasets β βGeneratorsβ β
β βJWT/OAuth β βProfiling β βSchema/ML β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β LLM β βEvaluationβ βComplianceβ β
β βChat/PII β βQuality β βReports β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββ¬ββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββ
βΌ βΌ βΌ
PostgreSQL Redis AWS S3
(metadata) (queue/cache) (files)
β
βΌ
Celery Workers
(generation, evaluation, exports)
Tech Stack:
- Frontend: Next.js 16, React 19, TypeScript 5, Tailwind, shadcn/ui
- Backend: FastAPI, SQLAlchemy 2, Celery, SDV
- ML/Privacy: CTGAN, TVAE, Opacus (DP), RDP accounting
- LLM: OpenAI/Anthropic (chat, PII detection, compliance)
- Infra: Vercel, Railway/AWS, Neon/Supabase π Complete Feature List
Data Generation
- Schema-based generation (no training data required)
- Dataset-based ML generation (CTGAN, TVAE, GaussianCopula)
- LLM-powered seed data generation
- Differential privacy with configurable Ξ΅/Ξ΄
- DP parameter validation & recommendations
- Model download & export
Data Management
- CSV upload with auto-profiling
- Schema detection & type inference
- PII/PHI column detection
- Distribution analysis & statistics
- Correlation matrices
- Missing value analysis
Quality & Privacy
- Statistical similarity scoring
- ML utility evaluation (classification/regression)
- Privacy risk assessment
- Membership inference testing
- k-anonymity checks
- Privacy budget tracking
AI Assistant
- Natural language queries
- Context-aware recommendations
- Code generation for API usage
- Error debugging
- Compliance guidance
Enterprise
-
HIPAA/GDPR/SOC-2 compliance reports
-
Immutable audit logs
-
Usage & billing dashboards
-
Role-based access control
-
OAuth (Google, GitHub)
-
Two-factor authentication πΊοΈ Roadmap
-
FHIR/HL7 medical data formats
-
Time-series synthetic data
-
Enterprise SSO (SAML 2.0)
-
Python & JavaScript SDKs
-
Self-hosted Docker templates
-
Real-time streaming generation
See CHANGELOG.md for version history.