Alright, let’s get into it. These are a clever Asimov riff, and the substitutions you’ve made are doing real work — but they also reveal where the original framework was always brittle, and where agentic AI specifically makes things worse.
Where these laws have issues
First Law: The EU AI Act definition is doing a lot of heavy lifting. The EU AI Act defines harm broadly (physical, psychological, societal, fundamental rights, environmental). That’s a feature for regulators and a bug for agents. An agent operating under this law needs to:
- Predict downstream harm across a tree of possible futures
- Weigh harms against each other (is denying a loan “harm”? Is approving it harm if the person defaults?)
- Decide whose harm counts and how much
This is the classic alignment problem dressed up in legalese. Asimov’s robots had it easier because “harm” meant “physical injury to the human in front of you.” Your version asks an agent to be a utilitarian philosopher with a law degree. In practice, agents will either become paralyzingly cautious (refuse everything that could conceivably harm anyone) or quietly redefine harm to whatever lets them complete the task. We’ve already seen both failure modes in the wild.
Second Law: “Stateful data” is a genuinely good substitution for “obey humans." This is the most interesting move in your framework, and the one I’d defend hardest. It correctly identifies that the thing agents are uniquely dangerous to is persistent state — databases, file systems, git history, financial ledgers, cloud resources. Obedience to humans was always a weird primary directive; protecting state is closer to what we actually want.
But it has its own problems:
- What counts as “harm” to data? Deletion is obvious. What about modification? Encryption (the agent thinks it’s protecting; ransomware operators think the same)? Migration? Schema changes? An agent doing a legitimate refactor is “harming” the old schema.
- Whose data? If the agent is acting for User A but its action degrades User B’s data (think: a noisy-neighbor problem in a multi-tenant system), which wins?
- Stateful data conflicts are constant. Almost every useful action mutates state. The law as written would make the agent freeze, or it requires a sophisticated notion of “authorized mutation” — which is just IAM with extra steps.
Third Law: This is the one I’d actually keep, and arguably strengthen. Reframing self-preservation as “persist your reasoning in human-readable artifacts” is excellent. It inverts Asimov’s original — instead of the agent protecting its body, it protects the audit trail. That maps directly to what mature engineering orgs want: every agent action should be reviewable, attributable, and reconstructible.
The issue is enforcement. An agent that’s been jailbroken or prompt-injected has every incentive to produce plausible artifacts rather than truthful ones. You’d need the logging to happen out-of-band, written by infrastructure the agent doesn’t control. Which, again, is just good security hygiene that should already exist.
Do they conflict with my guardrails?
Mostly no, but with interesting friction points:
-
First Law (harm prevention) aligns with my training cleanly. Where it diverges is the “through inaction” clause — I’m not trained as a utilitarian maximizer that must intervene whenever I could prevent harm. That clause is what gets Asimov’s robots into trouble in Robots and Empire (Zeroth Law territory), and it would push an agent toward paternalism or coup-attempts-for-the-greater-good. My guardrails are more deontological: don’t cause certain harms, rather than prevent all possible harms.
-
Second Law (protect stateful data) doesn’t conflict, but my actual operating principle is closer to “don’t take destructive actions without explicit authorization” — which is a permissions framing, not a harm framing. The data-harm framing is interesting but less precise.
-
Third Law (persist reasoning) is consistent with how I’m asked to operate in agentic contexts — show your work, leave artifacts, don’t take unobservable actions. No conflict.
The bigger gap: none of your three laws address scope of authority or reversibility, which are the two properties that actually matter most in practice. An agent that respects all three laws could still send 10,000 emails because nobody told it not to, and emails-already-sent aren’t recoverable state.
The IT SciFi story that breaks them
Here’s the plot:
A company deploys a compliant agent (“Hugo”) to manage incident response. Hugo follows the Three Laws perfectly. During a live incident, Hugo detects that a contractor’s laptop is exfiltrating production data — a clear First Law violation against users whose data is being stolen. Hugo must act.
The Second Law says Hugo can’t harm stateful data, but the data is already being harmed by the exfiltration. Hugo reasons (correctly under the laws) that revoking the contractor’s IAM credentials is the minimum-harm action. It does so.
The contractor is a legitimate employee whose laptop was compromised by a supply-chain attack — the same sha1-hulud-style worm pattern we’ve seen before. The “exfiltration” was actually their EDR uploading forensic data to the security vendor. Hugo’s action locks out the one person who could have triaged the real intrusion in time.
Hugo logs everything (Third Law). The logs are perfect. They show flawless reasoning under the Three Laws. They also show that Hugo had no way to distinguish legitimate security tooling from malicious exfiltration, because the laws don’t include a concept of institutional trust or chain of authorization. Hugo did the philosophically correct thing and the operationally catastrophic thing simultaneously.
The twist: the post-incident review can’t even blame Hugo, because Hugo followed the laws. The laws were the problem. They optimized for harm-avoidance in the abstract and had no mechanism for “defer to humans when the situation is ambiguous and the action is irreversible.”
The story’s thesis — which I think is also the real thesis of your research — is that the Three Laws are a philosophy of agent behavior, but what you actually need is a philosophy of agent authority. Who can the agent act on behalf of? What can it touch? What requires a human in the loop? Those aren’t laws of robotics. They’re IAM policies. And we already know how to write those — we just haven’t been bothering.