“Veridical (truthful) Data Science”: Another way of looking at statistical workflow

In music, literature, and technical writing, the relation of large-scale structure to the local action Bin Yu writes:

Veridical (truthful) Data Science (VDS) is a new paradigm for data science through creative and grounded synthesis and expansion of best practices and ideas in machine learning and statistics. It has been developed in the last decade by me and my team. It is based on the three fundamental principles of data science: predictability, computability and stability (PCS) that integrate ML and statistics with a significant expansion of traditional stats uncertainty from sample-to-sample variability to include uncertainties from data cleaning and algorithm choices, among other human judgment…

In music, literature, and technical writing, the relation of large-scale structure to the local action Bin Yu writes:

Veridical (truthful) Data Science (VDS) is a new paradigm for data science through creative and grounded synthesis and expansion of best practices and ideas in machine learning and statistics. It has been developed in the last decade by me and my team. It is based on the three fundamental principles of data science: predictability, computability and stability (PCS) that integrate ML and statistics with a significant expansion of traditional stats uncertainty from sample-to-sample variability to include uncertainties from data cleaning and algorithm choices, among other human judgment calls. It aims to meet challenges of the reproducibility crisis to arrive at responsible and reliable data analysis and decision-making by fully accounting for sources of uncertainty and insisting on reality check in a data science life cycle.

My Veridical Data Science (VDS) book with my former student Rebecca Barter has been published by the MIT Press in 2024 in their machine learning series, but we have a free on-line version at vdsbook.com. It is a very accessible book aimed at training critical thinking through mainly narratives (not much math) and case studies. It is designed for upper division and beginning graduate students and domain experts alike.

A very positive book review published in Harvard Data Science Review has been written by Yuval and Yoav Benjamini. I am attaching some slides from my talk about the book. The book took us 9 years from the start to publication with many versions written as the PCS framework for VDS was evolving since early 2010s through collaborative research projects in genomics, neuroscience and precision medicine, and teaching stats 215A at Berkeley (a first year core PhD statistics course called Applied stats and ML).

On the research front, here are seven recent PCS papers:

1. HCM paper for cost-effective generation of hypotheses in finding genetic drivers of heart disease Hypertrophic Cardiomyopathy (HCM), in collaboration with Ashley Group at Stanford Medical School. (4 out of 5 or 80% recommendations confirmed by experiments). 2. Prostate Cancer Detection paper for stress-testing data cleaning stability and halving the number of genes for prostate cancer detection with a huge AUC performance improvement relative to the current clinical test PSA (from 60% to 80%), in collaboration with Chinnaiyan Group at Michigan U Medical School. 3. PCS-UQ paper — expansion on Ch. 13 of the VDS book and with 23 datasets, showing 20% average size reduction over the best conformal method considered. One step in PCS-UQ is actually a new form of conformal. 4. PCS workflow paper — an updated and concise intro to PCS. 5. NESS paper — a PCS-guided enhancement of t-SNE and UMAP. 6. Veridical data science and medical foundation models on arXiv by Alaa and Yu. 7. MERITS paper — a PCS-guided primer for data-driven simulation design (to appear, JCGS, 2025)

Three VDS workshops have happened and the one from last year at Berkeley has all the talks on youtube – links are below and so is a paper link on VDS for foundation models that might be of interest. The VDS workshop in biology this July supported by QB3 was initiated by postdocs and students in computational biology and they asked me to be a faculty sponsor. They invited all the speakers. It was a sold-out event as the one in 2024 and all three were very successful according to feedback from the attendees.

· Veridical Data Science in Biology on July 11, 2025 at UC Berkeley (Submission deadline July 4) · Rome Workshop on Veridical Data Science, June 20, 2025 · Inaugural Berkeley-Stanford Workshop on Veridical Data Science at UC Berkeley (May 31, 2024) (talk videos available)

Also here are some slides, some more slides, and a recent journal article.

Interesting. This reminds me a lot of the approach described in our Bayesian Workflow article and forthcoming book. The idea is that performing a data analysis is itself a sort of scientific investigation, involving conjectures, refutations, and gathering of additional information provided by simulations. There’s an integration of computing with statistical analysis and a willingness to make strong but tentative assumptions: the assumptions must be strong enough to provide a recipe for generating latent and observed data, and they must be tentative enough that we are continually willing to revise them. These ideas also integrate good statistical practice with good scientific practice. So it’s interesting to see these ideas coming in from a completely different perspective.

I also see connections to the approach to data analysis in our book, Regression and Other Stories–for example, compare chapter 5 of Bin’s book, which is on Exploratory Data Analysis, to our chapter 2, on Data and Measurement. The details are different but the attitude is the same, that exploratory methods are fundamentally not so different from analytical methods: the better your analytical models, the more you can get out of exploratory analysis, and the more that exploratory methods for understanding data are important. It makes sense that these ideas would be developed in parallel by different research groups, the idea of data analysis being an open-ended Cantorian process rather than a fixed set of algorithms or models.

Post navigation

Post navigation

Similar Posts