Agents that take actions are different from chatbots that produce text. The safety stack is real: sandboxed execution, scoped capabilities, human approval on irreversible operations. Skipping these has produced real incidents in 2025.
Advertisement
Capability scoping
Each tool has explicit scope: 'read documents in project X', not 'read all documents'. Authorization at tool-call level, not session level. Default-deny.
Reversibility-aware approval
Reversible operations (search, read, draft): full autonomy. Irreversible (send email, charge card, delete file): human confirmation. Most agents conflate these and need rework.
Advertisement
Sandboxed execution
Code execution in containers or microVMs. No host filesystem access. No outbound network without explicit scope. Container resource limits. Kill switch the agent itself can't disable.
Scoped capabilities + reversibility-aware approval + sandboxed exec. The 'just be careful' approach is what produces incidents.