DEV Community

Why Your LLM Agent Gives a Different P-Value Every Time (And What to Build Instead) (opens in new tab)

Hand the same paired before/after dataset (n = 25) to ChatGPT five times. Same prompt: "These are the same subjects measured before and after an intervention. Did their scores change significantly?" Four of the five runs return p = 0.009 from a paired t-test. The fifth run does a Shapiro–Wilk normality check on the differences first, decides they're non-normal, switches to a Wilcoxon signed-rank test, and reports p = 0.000018. All five reach the same conclusion (significant). But notice what ...

Read the original article
Sign in to keep reading the full article.

Keyboard Shortcuts

Navigation

Next / previous post
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Discover
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help