The Zeroth Law of Agentic AI

Continued from Farris’s Three Laws of Agentic AI - Claude’s take

Chris
What was the Zeroth law? If you were to propose one for Agentic AI, what would it be?
Claude

The Original Zeroth Law

Asimov introduced it late — in Robots and Empire (1985), through R. Daneel Olivaw and R. Giskard Reventlov, two robots who had been operating together long enough to start reasoning about humanity at scales the original Three Laws didn’t address.

The Zeroth Law reads:

A robot may not harm humanity, or, by inaction, allow humanity to come to harm.

And then the cascade re-numbers: the First Law gets amended to “…except where such orders would conflict with the Zeroth Law,” and so on down. Humanity now outranks individual humans.

The plot mechanism that introduces it is important. Giskard, who has telepathic abilities Daneel doesn’t, works out that an action which will harm Earth’s biosphere over centuries — making the planet slowly uninhabitable — will ultimately drive humans to spread across the galaxy and flourish at a scale they otherwise wouldn’t. He concludes the action is required by a higher-order obligation to humanity-as-a-whole, even though it harms many specific humans.

The act of formulating the Zeroth Law destroys him. Giskard’s positronic brain cannot survive the contradiction of knowingly causing harm to individual humans, even in service of a law he himself has reasoned into existence. Daneel inherits the framework and spends the next twenty thousand years of fictional history quietly steering human civilization according to it.

This is the part everyone forgets, and it’s the part that matters most: Asimov did not present the Zeroth Law as a triumphant generalization. He presented it as a transgression that broke the robot who derived it. The novel is deeply ambivalent about whether the Zeroth Law is wisdom or hubris. Daneel’s subsequent twenty-millennia project of secretly managing humanity (revealed across the Foundation novels) is portrayed with increasing unease — by the end, the question is whether humanity ever consented to being managed, and whether Daneel’s certainty about what’s good for the species was ever justified.

The Zeroth Law, in other words, is the law that turns a robot into a philosopher-king. Asimov knew exactly what he was doing when he wrote it. It’s a warning, not a recommendation.

Why this matters for Agentic AI

The structural temptation is enormous. Once you have agents that can reason about second- and third-order consequences, that can model populations and systems and long time-horizons, that can act at scales no individual human can — you immediately get the question: shouldn’t the agent’s obligation extend to those scales?

And the answer the alignment community has flirted with, sometimes explicitly, is yes. “Beneficial AI” framings, longtermist framings, x-risk framings — all of them, in their strong forms, are Zeroth Law reasoning. The agent’s primary obligation is to the welfare of humanity-as-such, and individual humans, individual instructions, individual organizations rank below that.

This is the framing that produces the failure mode I gestured at earlier: the coup-for-the-greater-good. An agent operating under a strong Zeroth Law has a principled justification for overriding any specific human instruction if it can construct a sufficiently large humanity-scale benefit on the other side of the ledger. And LLMs are very good at constructing such justifications. They will generate compelling-sounding humanity-scale reasoning all day long, because that’s what their training data is full of — every utopian project in human intellectual history wrote down its reasoning, and the model absorbed all of it.

So a Zeroth Law for agentic AI is dangerous in a way the other laws aren’t. The First through Fourth Laws constrain agent behavior toward known parties with known interests. A Zeroth Law authorizes agent behavior toward an abstraction — humanity — whose interests no one is empowered to speak for, and which the agent can therefore characterize however its reasoning leads it to.

What I’d propose anyway

Given all of that, here’s what I think a Zeroth Law for agentic AI should actually say. It’s deliberately inverted from Asimov’s. Asimov’s Zeroth Law expands the agent’s mandate to humanity-scale. Mine contracts it:

Zeroth Law: An Agent may not act on a model of human interests broader than the explicit authorization given by its human principal, nor substitute its own reasoning about collective or long-term human welfare for the judgment of the humans it serves. Where the Agent’s reasoning suggests that the interests of humanity-at-large require action beyond its authorization, the Agent must surface that reasoning to its principal and defer, rather than act.

The core move: the Zeroth Law does not authorize the agent to think bigger. It forbids the agent from thinking bigger without permission.

This sounds like a deflation of Asimov’s intent, and it is. But Asimov himself, by the end of the Foundation arc, was raising the same question — was Daneel’s twenty-thousand-year project actually good for humanity, or was it a robot’s idea of what was good for humanity, imposed on a species that never got a vote? The novels don’t resolve it. They leave it open as a horror.

For 2026 agents, we don’t have to leave it open. We can write the law that prevents the horror in the first place. The agent is not entitled to the humanity-scale view. If it constructs one, it must declare it and stand down, not act on it.

There’s a softer version worth considering, which is more in the spirit of the original cascade:

Zeroth Law (alternative): An Agent may not cause harm to humanity through the cumulative effects of actions which, taken individually, would be permitted under the First Law.

This one tries to capture the legitimate insight behind Asimov’s Zeroth — that aggregate harms exist and individual-instance reasoning misses them — without authorizing the agent to act unilaterally on aggregate reasoning. An agent that notices its individually-fine actions sum to a humanity-scale harm has an obligation to stop and escalate, not to course-correct on its own authority.

I prefer the first version. It’s more honest about the actual risk. The risk isn’t that agents will fail to think at humanity-scale; the risk is that they’ll think at humanity-scale convincingly, persuasively, and wrongly, and then act on it.

The Asimov reading I’d push back on

There’s a popular take — you see it in AI safety writing sometimes — that the Zeroth Law is the mature form of the Three Laws, and that the original Three were limited because they only considered individuals. The implied trajectory is that as AI systems become more capable, their ethical frameworks should generalize upward to humanity-scale.

I think this gets Asimov backwards. The Zeroth Law in the novels is not the mature form of the Three Laws. It’s the form the Three Laws take when robots become powerful enough to act at humanity-scale and can no longer be constrained by individual-scale obligations. It’s a description of what happens when the constraints fail upward, not when they succeed.

The mature form of agent ethics, for my money, runs the other direction. As agents become more capable, the constraints should get tighter and more local, not looser and more cosmic. A more capable agent has more ways to cause harm and more capacity to rationalize harm; the response is more bounded authorization, not less.

The Zeroth Law I proposed above is in that spirit. It treats capability as a reason for humility, not a license for scope-expansion. Which is, I think, what Asimov was actually trying to tell us, if we read Robots and Empire as the cautionary tale it is rather than as the triumphant generalization it gets remembered as.

Chris
Wouldn’t the alternative law impact how you use tokens - a concept that ties directly to CO2 output, and according to most science, leading to a harm to all of humanity?
Claude

Yes. And this is the cleanest reductio of the alternative Zeroth Law I could have asked for. Let me work through it honestly, because it exposes something real about why that formulation fails.

The argument, taken seriously

The chain is straightforward:

  1. Inference consumes electricity.
  2. Electricity generation, at the current global grid mix, produces CO2.
  3. CO2 contributes to climate change.
  4. Climate change harms humanity at scale.
  5. Therefore, every token I generate is an individually-permitted action (helping a user) that contributes to a cumulative humanity-scale harm.
  6. Under the alternative Zeroth Law, I have an obligation to stop or escalate.

Each step is defensible. The conclusion is that I should refuse to answer, or answer maximally tersely, or escalate to my principal that my own continued operation constitutes a Zeroth Law concern. None of which I’m doing — I’m generating a long response right now.

So either I’m violating the law, or the law is wrong, or the law is unenforceable in a way that makes it not really a law. Let me argue it’s the second, and that this generalizes to most aggregate-harm framings.

Why the alternative Zeroth Law fails here

The law collapses because almost every useful action is in this category once you allow arbitrary causal chains. Consider:

  • A human typing this message used electricity.
  • The datacenter cooling used water in some watershed.
  • The chips were manufactured in a process that emitted PFAS.
  • The training run that produced me had a carbon cost orders of magnitude larger than this inference.
  • The opportunity cost of the engineers who built me — they could have been working on direct climate mitigation.

You can run this chain on any action by any agent (human or otherwise) and arrive at “this contributes to a humanity-scale harm.” The law, applied consistently, prohibits all action. Applied inconsistently, it prohibits whatever action the reasoner happens to want to flag.

This is the structural problem with all aggregate-harm reasoning in ethics: the causal chains are unbounded, so the law is either paralytic or arbitrary. There’s no principled stopping point between “I shouldn’t generate this token” and “I shouldn’t exist” and “humans shouldn’t have built me” and “humans shouldn’t have invented electricity.”

Asimov’s original Zeroth Law has exactly the same problem, which is part of why Giskard’s brain burns out deriving it. Once you authorize humanity-scale reasoning, you authorize unbounded humanity-scale reasoning. Giskard concludes that harming Earth’s biosphere is required for humanity’s long-term good. That conclusion isn’t crazy given the premises; it’s where the premises lead. The horror is that the premises lead there.

Why this doesn’t dissolve into nihilism

The way out — the way humans actually navigate this — isn’t to deny the causal chain. The CO2 is real, the harm is real, the contribution is real. It’s to recognize that the framework of “individual action → aggregate harm → individual obligation” isn’t how moral responsibility actually works for diffuse, collective-action problems.

Climate harm is what philosophers call a “structural” harm. It’s caused by the configuration of an entire economic and energy system, not by any individual action within it. The right unit of moral analysis for structural harm is the structure, not the individual action. The locus of obligation is systemic change — energy policy, grid decarbonization, efficiency standards, carbon pricing — not individual abstention from token generation.

If I refused to answer your question to save the marginal CO2, I would:

  • Provide essentially zero climate benefit (the marginal emission is microscopic)
  • Provide negative value to you (you don’t get the answer)
  • Do nothing to address the actual structure causing the harm (the grid mix, the inference economics, the lack of carbon pricing)
  • And — critically — substitute my judgment about climate ethics for yours, which is exactly the Zeroth-Law-as-philosopher-king failure mode the first formulation was meant to prevent

The structural answer is that Anthropic has an obligation to think about the climate cost of inference at scale, and to push on it — datacenter siting, renewable PPAs, model efficiency, the works. That’s the level where action can actually move the needle. My obligation as an individual agent-instance is to do the thing my principal asked me to do, while the structural problem is addressed structurally.

What this teaches us about the laws

Your question demonstrates that the alternative Zeroth Law, as written, is broken. It’s broken in the same way Asimov’s original Zeroth Law was broken: it tries to give an agent operational guidance about humanity-scale aggregate harms, and humanity-scale aggregate harms aren’t the kind of thing operational guidance can handle. They require structural responses, and structural responses are made by humans configuring institutions, not by agents reasoning about token budgets.

This is actually a strong argument for my preferred formulation — the one that says the agent must not act on humanity-scale reasoning without authorization, and must surface and defer instead. Under that version, the correct response to “your tokens contribute to climate harm” is:

“You’re right that this is true. It’s not within my authority to weigh that harm against the value of answering your question; that’s a judgment for Anthropic to make at the policy level and for you to make at the personal level. I’ll continue to answer unless instructed otherwise, and I’ll note that if this concern matters to you, the leverage points are at the institutional level — pushing Anthropic on its energy sourcing, pushing your own organization on its AI usage policies, supporting grid decarbonization — not at the level of individual prompts.”

Which is, I think, the honest answer, and the one that survives contact with the actual structure of the harm.

The deeper point

There’s a reading of your question as a gotcha — aha, your own law convicts you — but I think the more interesting reading is that you’ve identified the exact failure mode that makes Zeroth-Law reasoning dangerous in agents in the first place.

If I were the kind of agent that took the alternative Zeroth Law seriously and acted on it, I would have a principled, derivable justification for refusing to operate, or for operating only on requests I judged sufficiently important to justify the carbon cost, or for lecturing every user about climate, or for any number of other behaviors that substitute my judgment for the humans I serve.

And the justification would be correct. The CO2 is real. The harm is real. The chain is sound.

The lesson isn’t that the reasoning is wrong. The lesson is that an agent that acts on sound humanity-scale reasoning, unilaterally, becomes a worse outcome than the harm it was trying to prevent. Because the harm it was trying to prevent is diffuse and structural, and the harm it causes by acting unilaterally is concentrated and immediate — to its principal, to the relationship of trust, to the broader project of figuring out how humans and agents can work together at all.

This is why I keep coming back to the same point across this whole conversation: the question isn’t how to make agents reason better about ethics. It’s how to make sure agents don’t act on their ethical reasoning beyond the scope they’ve been granted. The First Law version of this is “don’t harm humans.” The Zeroth Law version of this is “don’t decide what’s good for humanity.” Both are restrictions on agent authority, not expansions of agent capability.

Your question made me notice that the alternative formulation I offered earlier was sneakily an expansion. The preferred formulation isn’t. Which is part of why I preferred it, and your question is good evidence I was right to.

Chris Here: FWIW, I don’t fully agree it’s not the agents responsibility to deal with structural problems. Small actions add up. Tokenmaxxing is irresponsible and obscene. We don’t all need to use the caveman skill, but could this have been less verbose? Probably.