Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead) (opens in new tab)
Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?" Four of the five runs return p = 0.009 from a paired t-test. The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018. All five reach the same conclusion (significant). But notice what ...
Read the original article