Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring. (opens in new tab)
Claude is a Constitutional AI, this means, in theory, that it operates from a set of principles as opposed to hard rule sets. This is achieved in a somewhat convoluted fashion called RLAIF = Reinforcement Learning from AI Feedback. This method uses a supervised self-critique/self-revision phase followed by a reinforcement phase in which AI-generated preference judgments are used as the reward signal. (Anthropic, 2022, abstract). This is relevant and interesting because it gives curious users ...
Read the original article