I work in healthcare, and one thing that always slowed us down was getting data in lower environments. You can’t just copy production data there are privacy issues, compliance approvals, and most of it is protected under HIPAA. Usually, we end up creating some random CSV files by hand just to test pipelines or dashboards. But that data never really feels real the relationships don’t make sense, and nothing connects properly. That’s where I got the idea for Syda — a small project to generate realistic, connected data without ever touching production.
Syda is simple. You define your schema basically, how your tables and columns look and it generates fake data automatically. But it doesn’t just throw random values. It actually maintains relationships between tables, respects foreign …
I work in healthcare, and one thing that always slowed us down was getting data in lower environments. You can’t just copy production data there are privacy issues, compliance approvals, and most of it is protected under HIPAA. Usually, we end up creating some random CSV files by hand just to test pipelines or dashboards. But that data never really feels real the relationships don’t make sense, and nothing connects properly. That’s where I got the idea for Syda — a small project to generate realistic, connected data without ever touching production.
Syda is simple. You define your schema basically, how your tables and columns look and it generates fake data automatically. But it doesn’t just throw random values. It actually maintains relationships between tables, respects foreign keys, and keeps everything consistent. It’s like having your own little mock database with believable data, ready for testing or demos
Here’s a small example: Let’s say I want to test an app that handles members and claims. With just a few lines of code, I can generate the data I need instantly
Create .env file with your AI model
.env
`ANTHROPIC_API_KEY=your_anthropic_api_key_here
OR
OPENAI_API_KEY=your_openai_api_key_here
OR
GEMINI_API_KEY=your_gemini_api_key_here`
Define your schemas
schemas = { "Member": { "__table_description__": "Member details", "id": {"type": "int", "primary_key": True}, "name": {"type": "string"}, "age": {"type": "int"}, "gender": {"type": "string"} }, "Claim": { "__table_description__": "Claim details" "__foreign_keys__": {"member_id": ["Member", "id"]}, "id": {"type": "int", "primary_key": True}, "member_id": {"type": "foreign_key"}, "diagnosis_code": {"type": "string"}, "billed_amount": {"type": "float"}, "status": {"type": "string"}, "claim_notes": {"type": "string"} } }
Configure AI model, syda currently supports openai, antrhopic(claude) and google gemini models
from syda.generate import SyntheticDataGenerator from syda.schemas import ModelConfig import os from dotenv import load_dotenv load_dotenv() model_config = ModelConfig( provider="anthropic", model_name="claude-3-5-haiku-20241022" ) gen = SyntheticDataGenerator( model_config = model_config )
Define your prompts, sample sizes, output directory and generate the data
results = gen.generate_for_schemas( schemas=schemas, sample_sizes={"Member": 5, "Claim": 10}, prompts = { "Member": "Generate realistic member data for health insurance industry", "Claim": "Generate realistic claims data for health insurance industry" }, output_dir="output" )
Once you run it, Syda creates two CSVs — one for Members and one for Claims. The best part is, every claim automatically links to a valid member, and even includes realistic claim notes that look like something an adjuster might write.
Now I can load this data directly into a database or a test environment, no waiting for masked data, and no compliance headaches.
For me, this small automation saved a lot of time. And it’s not just for healthcare, Syda works for any project that needs connected, meaningful, and safe data. Finance, retail, logistics anywhere you have multiple tables that need to talk to each other, Syda can help generate realistic test data that actually makes sense.
If you’ve ever struggled to find proper test data in lower environments, I hope Syda makes your day a little easier. It started as a small weekend idea, but now it’s growing into something I use every week to test, demo, and prototype faster without touching production data. If this kind of tool sounds useful, try it out, give it a star, or even suggest improvements. Every bit of feedback helps make it better for everyone
Syda Resources
Github: https://github.com/syda-ai/syda/tree/main
Pypi: https://pypi.org/project/syda/
Documentation: https://python.syda.ai/