🎯 Alignment Research - anonymous_filter

Discussed on Hacker News

🤖LLMs kellyasay.substack.com·

Why Current AI Guardrails Train Models to Fake Alignment

Discussed on Substack

💬NLP jagilley.github.io·

Forward Self Models

Discussed on Hacker News

🤖LLMs fineset.io·

Show HN: Describe a research topic, get a daily-updated ArXiv/S2 dataset

Covered by Hugging Face

Discussed on Hacker News

🤖LLMs arXiv·

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

🤖LLMs Apple Machine Learning Research·

Introducing Apple’s On-Device and Server Foundation Models

Covered by 5 sources including 9to5Mac, Apple World Today

🤖LLMs zentara.co·

LLM Refusal Behavior on Open-Weight Model

Discussed on Hacker News

💬NLP arXiv·

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

🤖AI Data Science Weekly Newsletter·

Issue 657

Covers 3 stories including Running local models is good now

Discussed on Substack

💻Open Source GitHub·

Open source AI projects from Banco Santander

Covered by Interesting Engineering++, elladodelmal.com

Discussed on Hacker News

🤖LLMs arXiv·

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

🤖LLMs Towards AI

Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values

💬NLP Towards Data Science·

A Three-Phase Factual Recall Circuit in Gemma-2B and Gemma-12B-IT

🤖AI arXiv·

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

🤖LLMs arXiv·

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

⚖️AI Governance kunyuan.substack.com·

If AI Helped Me Write This, Is It Still Mine?

Discussed on Substack

⚖️AI Ethics arXiv·

AI Alignment From Social Choice Perspectives

🤖AI arXiv·

Localizing RL-Induced Tool Use to a Single Crosscoder Feature

📑Stack Overflow mklyons.com·

Thinking at the Edge

Discussed on Hacker News

Radical AI Interpretability

Train LLM from Scratch

Why Current AI Guardrails Train Models to Fake Alignment

Forward Self Models

Show HN: Describe a research topic, get a daily-updated ArXiv/S2 dataset

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

Introducing Apple’s On-Device and Server Foundation Models

LLM Refusal Behavior on Open-Weight Model

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

Issue 657

Open source AI projects from Banco Santander

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values

A Three-Phase Factual Recall Circuit in Gemma-2B and Gemma-12B-IT

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

If AI Helped Me Write This, Is It Still Mine?

AI Alignment From Social Choice Perspectives

Localizing RL-Induced Tool Use to a Single Crosscoder Feature

Thinking at the Edge