The Anthropic Record

A review of safety commitments and actions

Note: This independent analysis is not affiliated with Anthropic PBC. Domain name references anthropic reasoning (philosophy), not the company.

Executive Summary

CASE 1

The RSP Commitment

Anthropic's October 2023 Responsible Scaling Policy stated: "We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth."1

This was specific and verifiable: before training models at capability level N, define the safety evaluations for level N+1. This commitment was foundational to Anthropic's claim to be "scaling responsibly."

In October 2024, this commitment was removed from RSP version 2.1: "We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models."2

The removal was not announced in blog posts or public changelogs.3 Only a single person outside Anthropic noticed, in a LessWrong comment. The public changelog mentioned only new commitments, not removed ones.

Meanwhile, Anthropic's Chief Scientist confirmed they currently "work under the ASL-3 standard,"4 meaning they had reached the threshold where the commitment should have applied.

When Made
Specific written promise used to demonstrate "responsible scaling"
When Due
Quietly removed; replaced with vague "plan to add information later"

When a company makes a formal written commitment and then quietly removes it when it becomes binding, what does this indicate about other commitments?

In private, past and present Anthropic employees who worked on RSPs say that if upholding commitments would lead to Anthropic fall behind in the race, they expect Anthropic to drop the commitments.

CASE 2

The RAISE Act Response

New York's RAISE Act would apply only to models trained with over $100 million in compute—a threshold that excludes virtually all companies except a handful of frontier labs.5

Anthropic's Head of Policy, Jack Clark, stated:

"It also appears multi-million dollar fines could be imposed for minor, technical violations - this represents a real risk to smaller companies"6

— Jack Clark, December 2024

This statement was false. The bill's $100M compute threshold means it applies only to frontier labs. No "smaller companies" would be affected—Jack Clark would have known this from reading the bill text.

The bill's author, New York State Senator Alex Bores, responded directly:

"An army of lobbyists are painting RAISE as a burden for startups, and this language perpetuates that falsehood. RAISE only applies to companies that are spending over $100M on compute for the final training runs of frontier models, which is a very small, highly-resourced group... it's scaremongering to suggest that the largest penalties will apply to minor infractions."7

— Senator Alex Bores, responding to Anthropic

The bill's author explicitly called this "scaremongering" and a "falsehood." Jack Clark made a factually false claim about who the bill would affect.

CASE 3

The Founding Promise

In conversations with major funders and potential employees during Anthropic's founding, CEO Dario Amodei made a specific commitment:

"I spent a while talking with Dario back in late October 2022 (ie. pre-RSP in Sept 2023), and we discussed Anthropic's scaling policy at some length, and I too came away with the same impression everyone else seems to have: that Anthropic's AI-arms-race policy was to invest heavily in scaling, creating models at or pushing the frontier to do safety research on, but that they would only release access to second-best models & would not ratchet capabilities up, and it would wait for someone else to do so before catching up. So it would not contribute to races but not fall behind and become irrelevant/noncompetitive."8

— Gwern, documenting October 2022 conversation with Dario Amodei

Early investor Dustin Moskovitz confirmed this understanding, stating that it was Anthropic's commitment, and he talked to Dario about it.9

This promise secured early EA funding and attracted safety-focused researchers. The narrative was clear: "Unlike OpenAI, we won't push the frontier."

Early releases appeared to follow this policy. Claude-1 and Claude-2 were "second-best" models, weaker than ChatGPT-4 despite Claude-2's longer context window.

Then the policy was abandoned. Claude 3 Opus was released as a frontier model, competitive with or exceeding GPT-4. Claude 3.5 Sonnet pushed ahead of OpenAI's offerings for coding. Recent releases explicitly market superior benchmark performance.

What changed was not the safety research findings or risk assessments, but the competitive pressure and the realization that "second-best" doesn't win market share.

Employees were misled about the nature of the commitment. In private conversations, they share their impressions like, "that was never an explicit commitment. It might have been a thing we were generally trying to do a couple years ago, but that was more like “our de facto strategic priorities at the time”, not “an explicit policy or commitment”," while it was in fact Anthropic's explicit commitment expressed by Dario Amodei when talking to Dustin. Gwern also interpreted Dario's claims as commitments.

These commitments are now broken. Having reasons to change your mind about how good of an idea a commitment is doesn't make it less broken, when you break it.

"People asked probing, important questions, and Dario didn’t shy away from actually answering them, in a way that felt compelling. But, if Dario is skilled at saying things to smart people with major leverage over him that sound reassuring, but leave them with a false impression, you need to be a lot more skeptical of your-sense-of-having-been-reassured."

CASE 4

SB-1047 and Internal Communication

California's SB-1047 represented an attempt at AI safety regulation. After the bill was significantly weakened through industry lobbying, Anthropic sent a letter to the Governor.

In Sacramento, letters supporting legislation include "Support" in the subject line as standard practice (standard convention for support letters to California Governor; see any Sacramento legislative advocacy practice guide). This is how staff sort letters when advising the Governor. Anthropic's letter did not include "Support" in the subject line.

This created the appearance of cautiously supporting regulation without actually formally doing so in the political process.

Internally, many Anthropic employees believed the policy team was misaligned with company leadership—that lobbying against strong regulation came from policy staff, not core leadership.

"The impression that people had was 'misaligned policy team doing stuff that Dario wasn't aware of', where in reality Dario is basically the real policy & politics lead and was very involved in all the SB-1047 stuff"

— Source with knowledge of the dynamics

The CEO was personally leading lobbying efforts, but employees were allowed to maintain a false understanding of who was driving policy decisions.

CASE 5

Research Without Follow-Through

Anthropic justified research into dangerous capabilities with this reasoning:

"If alignment ends up being a serious problem, we're going to want to make some big asks of AI labs. We might ask labs or state actors to pause scaling."10

— Anthropic, Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research, 2023

"If we’re in a pessimistic scenario… Anthropic’s role will be to provide as much evidence as possible that AI safety techniques cannot prevent serious or catastrophic safety risks from advanced AI, and to sound the alarm so that the world’s institutions can channel collective effort towards preventing the development of dangerous AIs. If we’re in a “near-pessimistic” scenario, this could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime. Indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot. We should therefore always act under the assumption that we still may be in such a scenario unless we have sufficient evidence that we are not."10

— Anthropic, Core Views on AI Safety: When, Why, What, and How, 2023

The research succeeded. Anthropic's work on alignment faking and deceptive behavior produced, according to Anthropic and the authors of the documents cited above, evidence that we live in an alignment-is-hard world.

The "big asks" never came. There was no call for pausing scaling or halting AI progress, and no sounding the alarm on catastrophic safety risks from advanced AI. The race accelerated.

IS ANTHROPIC DECEIVING YOU?

Examining the pattern

There is a multi-year pattern across all commitment types:

Hypothesis A
Anthropic genuinely prioritizes safety but struggles with complex tradeoffs
Hypothesis B
Safety commitments serve talent retention and differentiation but are abandoned when they would constrain the race

Consider which hypothesis better explains:

OpenAI is openly racing toward AGI and is transparent about prioritizing capabilities. You know what you're signing up for.

Anthropic presents itself as "the responsible alternative" while taking identical positions on most mandatory regulations. European policymakers report that OpenAI and Anthropic lobbyists say the same things regarding anything mandatory.

Most of the difference is the marketing.

Thought Experiments

If you were advising a friend who had been offered a role at Anthropic, and they asked "Is this company actually different from OpenAI on safety?", what would you say based on the evidence above?

If you had no equity in Anthropic, no sunk costs, and no identity tied to the company, if you were more like the person you wish you were and didn't consider personal reasons for working at a frontier AI company and only thought about what's good for the world, would you join today with this evidence? Would you recommend that a safety-motivated friend join?

What would it take for you to conclude that safety commitments exist primarily for talent retention and brand differentiation rather than as constraints on the capabilities race?

When the company breaks written commitments and hides leadership's policy involvement from employees, what does this suggest about whether your work advances safety or fuels a race to the bottom?

The core questions: How likely are you to see what you see in a world where Anthropic has to mislead its employees about its nature? How likely are you to see what you see in a world where Anthropic is truthful about its nature? If you differentially think about all the bits of evidence, in which direction are they consistently pointing?

If you don't like your answer to the previous questions, consider if you want to work for what's already defecting against you and your interests.

Sources

1. Anthropic Responsible Scaling Policy (October 2023), page 4
4. Time Magazine interview with Jared Kaplan (Anthropic Chief Scientist), October 2024
5. NY RAISE Act (A6453), New York State Senate
6. Jack Clark on X/Twitter, December 2024
8. Gwern's LessWrong comment documenting October 2022 conversation with Dario Amodei about Anthropic's "second-best models" policy
9. Dustin Moskovitz Facebook comment (screenshot) confirming understanding of the "second-best models" policy: "that is their policy"