Abstract:We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of hea…
Abstract:We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.
| Comments: | 16 pages, 2 figures |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2512.15259 [cs.CL] |
| (or arXiv:2512.15259v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2512.15259 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Piyawoot Songsiritat [view email] [v1] Wed, 17 Dec 2025 10:04:36 UTC (1,243 KB)