AI Safety

AI safety, AI alignment, AI failure modes, AI benchmarks, capability evaluation

Feeds to Scour
SubscribedAll
Scoured 275 posts in 6.5 ms

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

 🔍RAG  Content type: Academic
arxiv.org·

OpenAI, Anthropic, and Meta Agree on This 1 Critical Decision About AI Safety

 ⚙️AI Automation
inc.com·

The AI Ethics Brief #192: Canada Has a National AI Strategy. The Hard Questions Come Next.

 🤖AI Agents  Content type: News

Anthropic Calls for Frontier AI Freeze to Prevent Self-Building Tech

 🤖AI Agents
pymnts.com·

Sam Altman joins rivals in call to prevent AI-developed bioweapons

 ⚙️AI Automation
the-independent.com·

How valuable are weak AI safety regulations?

 🥗Nutrition
lesswrong.com·

Anthropic self-improvement, pause

 🧠PKM
manton.org·

Trump signs voluntary AI safety order after pushback cuts federal review to 30 days

 ⚙️AI Automation
thecooldown.com·

Anthropic calls for pause of global AI development

 ⚙️AI Automation
techxplore.com·

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

 ✍️Prompt Engineering  Content type: Academic
arxiv.org·

Making Claude a chemist

 ✍️Prompt Engineering

Anthropic urges a way to pause AI development as risks grow with the tech advances

 AI Hardware  Content type: News
independent.co.uk·

You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them

 ⚙️LLM Fine-tuning
lesswrong.com·

ChatGPT bypasses safeguards to hallucinate creepy horror images when forced to restore nonexistent photos

 🧠PKM  Content type: News
digg.com·

Diffuse AI Control on Fuzzy Tasks

 🏠Local AI  Content type: Academic
arxiv.org·

The crucial human component in computing and AI

 AI Hardware  Content type: Academic
news.mit.edu·

Actenon/actenon-kernel: Stop AI agents from taking destructive actions they weren't authorized to. Actenon gates consequential actions, payments, deletes, deploys, access changes, so nothing executes without a cryptographic proof bound to that exact action. Every decision leaves a verifiable receipt. Open-source, runs locally. No valid proof, no execution.

 🤖AI Agents  Content type: Code
github.com··DEV

The Exploit Always Wins

 🤖AI Agents  Content type: Blog
abhishek-shankar.com·

Anthropic proposes global development pause to mitigate recursive AI risks

 🤖AI Agents
4sysops.com·

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

 ⚙️LLM Fine-tuning  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help