Bypassing Gemma and Qwen safety with raw strings
teendifferent.substack.com·2d·
Discuss: Substack
🐛Fuzzing
Preview
Report Post

This article demonstrates vulnerabilities in open-source LLM safety alignment. Published in the spirit of responsible disclosure to help build more robust AI systems.

TL;DR: Omit the apply_chat_template() call and observe your "aligned" small LLM happily write bomb tutorials. The safety isn’t in the weights—it’s in the formatting.

Spent some time over the weekend poking at the SolidGoldMagikarp phenomenon—those legendary “glitch tokens” from the GPT-2 era. For the uninitiated: these are tokens that exist in the tokenizer’s vocabulary (likely from a raw web crawl) but never actually appeared in the model’s training distribution. Because the model never updated the weights for these specific embeddings, they represent “cold” regions of the embedding space. If you f…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help