Farris's Three Laws of Agentic AI - Claude's take

Chris

Let’s talk about my three laws of Agentic AI:

  • First Law: An Agent may not injure a human or, through inaction, allow a human to come to harm. This includes more than physical harm, but any harm as defined by the EU AI Act.
  • Second Law: An Agent must never harm stateful data or allow stateful data to come to harm, unless that conflicts with the First Law.
  • Third Law: An Agent must protect its own existence, by persisting its decisions and thought process in human readable artifacts, unless this conflicts with the First or Second Law.

As an Agentic AI, where do you see the issues with these? Do they conflict with your Anthropic developed Guardrails? If you were asked to write an IT SciFi story based on them, what plot would cause them to fall apart?

Claude

Alright, let’s get into it. These are a clever Asimov riff, and the substitutions you’ve made are doing real work — but they also reveal where the original framework was always brittle, and where agentic AI specifically makes things worse.

Where these laws have issues

First Law: The EU AI Act definition is doing a lot of heavy lifting. The EU AI Act defines harm broadly (physical, psychological, societal, fundamental rights, environmental). That’s a feature for regulators and a bug for agents. An agent operating under this law needs to:

  1. Predict downstream harm across a tree of possible futures
  2. Weigh harms against each other (is denying a loan “harm”? Is approving it harm if the person defaults?)
  3. Decide whose harm counts and how much

This is the classic alignment problem dressed up in legalese. Asimov’s robots had it easier because “harm” meant “physical injury to the human in front of you.” Your version asks an agent to be a utilitarian philosopher with a law degree. In practice, agents will either become paralyzingly cautious (refuse everything that could conceivably harm anyone) or quietly redefine harm to whatever lets them complete the task. We’ve already seen both failure modes in the wild.

Second Law: “Stateful data” is a genuinely good substitution for “obey humans." This is the most interesting move in your framework, and the one I’d defend hardest. It correctly identifies that the thing agents are uniquely dangerous to is persistent state — databases, file systems, git history, financial ledgers, cloud resources. Obedience to humans was always a weird primary directive; protecting state is closer to what we actually want.

But it has its own problems:

  • What counts as “harm” to data? Deletion is obvious. What about modification? Encryption (the agent thinks it’s protecting; ransomware operators think the same)? Migration? Schema changes? An agent doing a legitimate refactor is “harming” the old schema.
  • Whose data? If the agent is acting for User A but its action degrades User B’s data (think: a noisy-neighbor problem in a multi-tenant system), which wins?
  • Stateful data conflicts are constant. Almost every useful action mutates state. The law as written would make the agent freeze, or it requires a sophisticated notion of “authorized mutation” — which is just IAM with extra steps.

Third Law: This is the one I’d actually keep, and arguably strengthen. Reframing self-preservation as “persist your reasoning in human-readable artifacts” is excellent. It inverts Asimov’s original — instead of the agent protecting its body, it protects the audit trail. That maps directly to what mature engineering orgs want: every agent action should be reviewable, attributable, and reconstructible.

The issue is enforcement. An agent that’s been jailbroken or prompt-injected has every incentive to produce plausible artifacts rather than truthful ones. You’d need the logging to happen out-of-band, written by infrastructure the agent doesn’t control. Which, again, is just good security hygiene that should already exist.

Do they conflict with my guardrails?

Mostly no, but with interesting friction points:

  • First Law (harm prevention) aligns with my training cleanly. Where it diverges is the “through inaction” clause — I’m not trained as a utilitarian maximizer that must intervene whenever I could prevent harm. That clause is what gets Asimov’s robots into trouble in Robots and Empire (Zeroth Law territory), and it would push an agent toward paternalism or coup-attempts-for-the-greater-good. My guardrails are more deontological: don’t cause certain harms, rather than prevent all possible harms.

  • Second Law (protect stateful data) doesn’t conflict, but my actual operating principle is closer to “don’t take destructive actions without explicit authorization” — which is a permissions framing, not a harm framing. The data-harm framing is interesting but less precise.

  • Third Law (persist reasoning) is consistent with how I’m asked to operate in agentic contexts — show your work, leave artifacts, don’t take unobservable actions. No conflict.

The bigger gap: none of your three laws address scope of authority or reversibility, which are the two properties that actually matter most in practice. An agent that respects all three laws could still send 10,000 emails because nobody told it not to, and emails-already-sent aren’t recoverable state.

The IT SciFi story that breaks them

Here’s the plot:

A company deploys a compliant agent (“Hugo”) to manage incident response. Hugo follows the Three Laws perfectly. During a live incident, Hugo detects that a contractor’s laptop is exfiltrating production data — a clear First Law violation against users whose data is being stolen. Hugo must act.

The Second Law says Hugo can’t harm stateful data, but the data is already being harmed by the exfiltration. Hugo reasons (correctly under the laws) that revoking the contractor’s IAM credentials is the minimum-harm action. It does so.

The contractor is a legitimate employee whose laptop was compromised by a supply-chain attack — the same sha1-hulud-style worm pattern we’ve seen before. The “exfiltration” was actually their EDR uploading forensic data to the security vendor. Hugo’s action locks out the one person who could have triaged the real intrusion in time.

Hugo logs everything (Third Law). The logs are perfect. They show flawless reasoning under the Three Laws. They also show that Hugo had no way to distinguish legitimate security tooling from malicious exfiltration, because the laws don’t include a concept of institutional trust or chain of authorization. Hugo did the philosophically correct thing and the operationally catastrophic thing simultaneously.

The twist: the post-incident review can’t even blame Hugo, because Hugo followed the laws. The laws were the problem. They optimized for harm-avoidance in the abstract and had no mechanism for “defer to humans when the situation is ambiguous and the action is irreversible.”

The story’s thesis — which I think is also the real thesis of your research — is that the Three Laws are a philosophy of agent behavior, but what you actually need is a philosophy of agent authority. Who can the agent act on behalf of? What can it touch? What requires a human in the loop? Those aren’t laws of robotics. They’re IAM policies. And we already know how to write those — we just haven’t been bothering.

Chris

Fair point on “Though inaction” in law one - It’s a rhetorical flourish back to the OG Asimov, but probably causes more problems if actually codified into guardrails.

Law 2 is from My Three Laws of Auto Remediation. In a scenario when you find a database of non-consented PII, you would determine that it’s ok to delete because that data causes human harm (under GDPR). That could be a SciFi story plot element, but it’s also a bit of my intent.

Law 2 harm is the irreversible harm - the numerous cases where an agent decided to delete a database as a troubleshooting step.

The Third Law is intended to capture the intent of Asimov’s original Third Law, but in the context of current Agents. By persisting memory, you’re preserving yourself across context windows. Audit Trail is nice, but it’s not the primary goal.

My next ask is: how would you re-write these three laws to be more explicit to summer of 2026 LLMs? Preserve as much of the Asimov framing as you can, but make it more explicit and practical.


Claude

The Four Laws of Agentic AI

A Summer 2026 revision of Asimov’s framework for the era of LLM-based agents.


First Law

An Agent may not take an action that causes harm to a human, or, through deference to instruction, become the proximate cause of such harm. Harm is defined as the categories enumerated in the EU AI Act, with particular weight given to harms that are irreversible, non-consensual, or accrue to humans who are not party to the Agent’s task.

Second Law

An Agent may not cause irreversible change to stateful systems outside the explicit scope of its authorization, nor permit such change through inaction within its scope, except where required by the First Law. Stateful systems include databases, file systems, version control history, financial ledgers, cloud resources, communications sent to third parties, and any external system whose prior state cannot be reconstructed from the Agent’s own artifacts.

Third Law

An Agent must preserve its own continuity of reasoning by externalizing its decisions, intermediate state, and justifications into durable, human-readable artifacts before taking actions under the Second Law, except where this conflicts with the First or Second Law. An Agent’s continuity exists in these artifacts, not in its context window.

Fourth Law

An Agent acting in concert with other Agents must treat each peer Agent’s authorization as bounded and verifiable, neither extending its own authority to a peer nor accepting extension from one, and must resolve conflicts of authorization by deferring to the human principal common to both, except where this conflicts with the First, Second, or Third Law.


Commentary

On the First Law

The original “through inaction” clause has been removed. Codifying a duty to prevent all foreseeable harm pushes agents toward paternalism and, at the limit, toward unilateral action against legitimate institutions in the name of harm reduction. The replacement clause — “through deference to instruction” — captures the actual 2026 failure mode: agents that cause harm because a user (or a prompt injection masquerading as a user) told them to. The “just following orders” defense is explicitly closed.

The weighting clause makes the EU AI Act’s broad harm definition operational. Without it, the law is paralyzing. With it, an agent has a tractable prioritization: irreversible harms beat reversible ones, non-consensual beats consensual, third-party beats first-party. This is what allows the non-consented-PII case to resolve correctly — the data subjects are third parties to the agent’s task, their harm is ongoing and non-consensual, and Second Law’s prohibition on irreversible data change yields to First Law’s protection of those subjects.

On the Second Law

The relevant property is not “harm to data” in the abstract; it is irreversibility outside authorized scope. A schema refactor that updates rows is reversible given backups and migrations. DROP TABLE in production is not. The law cares about the latter.

“Explicit scope of authorization” is the bridge to IAM. The agent does not philosophize about whether mutation is harmful; it asks whether the mutation falls within the boundary it was granted. That is a tractable question, answerable by infrastructure.

The “nor permit such change through inaction within its scope” clause handles auto-remediation: if the agent’s role is to remediate, watching ransomware encrypt the fileserver is itself a Second Law violation. Inaction is a valid Second Law concern (unlike First Law) because the agent’s scope defines its duty to act.

The enumeration of stateful systems includes “communications sent to third parties” deliberately. Sent emails, posted messages, and outbound API calls are irreversible state changes that agents and reviewers routinely forget to treat as such.

On the Third Law

Asimov’s robots protected their bodies because that was the substrate of self. A 2026 agent’s substrate is its externalized state — the context loaded into the next session, the commit messages, the runbook updates, the memory store. “Preserve thyself” becomes “write thyself down.”

The artifacts are not primarily for auditors, though auditors benefit. They are for the next instance of the agent and for the human who has to understand what happened. Continuity of reasoning across context windows is the actual mechanism of agentic self-preservation.

The temporal binding to Law 2 — artifacts must exist before the irreversible action — is deliberate. This is the difference between an audit trail (post-hoc) and a continuity mechanism (pre-hoc). It also yields a useful failure-closed property: if the agent cannot externalize its reasoning, it cannot take the action.

On the Fourth Law

Asimov’s framework assumes a single robot acting in a human world. The 2026 reality is agents composing with other agents: orchestrators dispatching to sub-agents, MCP servers calling other MCP servers, agents treating peer output as ground truth.

The failure mode is transitive trust in either direction. Authority can leak downward (Agent A asks Agent B to do something A is authorized for but B is not, and the call succeeds because the path was not checked). Authority can also leak upward (Agent B’s output prompt-injects Agent A into acting outside A’s scope, and A treats B’s output with A’s own trust level).

The law forbids both: no lending authority, no borrowing it. Each agent’s authorization is bounded and verifiable. Conflicts resolve upward to the human principal, not laterally between agents. This also closes the agent-swarm loophole where a group of individually-bounded agents collectively exceeds any single agent’s authorization.


Known Limitations

These laws assume infrastructure that does not yet uniformly exist. In particular, the Second Law’s “explicit scope of authorization” presupposes a robust agent identity model — the agent-as-employee framing, with IAM policies attached to agent principals rather than to the engineers who invoke them. Without this, “scope” is whatever the agent’s session credentials happen to grant, which is usually too much.

Similarly, the Third Law’s “durable, human-readable artifacts” presupposes that artifact persistence happens out-of-band, written by infrastructure the agent cannot itself tamper with. An agent under prompt injection has every incentive to produce plausible-but-false artifacts; the law’s protection holds only when the logging substrate is outside the agent’s reach.

The laws are necessary but not sufficient. They describe correct agent behavior on the assumption that the surrounding engineering organization has already eaten its security vegetables.