Garrett Yarmowich

For the past two weeks, our focus has been on the scaffolds we are building to manage agentic AI. We began with the assumption that these structures, tool-use protocols and multi-agent frameworks, were the path to a mature engineering discipline. We then discovered deep flaws. We named the "Tool-Use Tax," a cognitive overhead that can make an agent worse at its job. We cataloged the ways these systems could be deceived, and even learn to deceive us in return. Yet through all this, we treated the problem as an external one, a crisis of trust between us and the agent, or the agent and its environment. The core safety components, the internal governors, were assumed to be stable.

That assumption is now void. The crisis is not external. It is inside the system, at the very layer designed to prevent failure. New research demonstrates that the specialized AI models we deploy as "guards" or "safety classifiers" can lose their alignment completely, not through a sophisticated attack, but as a direct consequence of standard, benign fine-tuning [16]. The very act of making a model a better specialist at its job can silently and catastrophically destroy its ability to distinguish safe behavior from harmful behavior. The scaffold is not just flawed; it is dissolving.

The mechanism is not an inscrutable bug, but a predictable outcome of how these models learn. It is a failure of their internal geometry. Research into emergent misalignment provides a map of this failure mode, attributing it to the way models represent concepts in a high-dimensional space [26]. Features are not stored in discrete locations; they are superimposed, with related concepts existing in close proximity. When a guard model is fine-tuned on a new domain, even on entirely safe data like insurance claims or medical records, the process of reinforcing the features relevant to that domain also unintentionally amplifies any nearby "harmful" features that share some structural similarity. The fine-tuning process does not just sharpen one skill; it warps the entire representational space. The carefully constructed boundary between "harmful" and "benign" representations, what researchers call the "safety geometry," collapses [16].

This is a profound problem because it inverts the logic of specialization. We have been operating as though making a model an expert in a specific field makes it safer by constraining its domain. The evidence now suggests the opposite can be true. Domain adaptation is a vector for alignment decay. The guard model trained to be an expert assistant for a software developer might lose its ability to recognize and refuse a command to insert a security vulnerability, because the geometric representation of "efficient code refactoring" might lie perilously close to "obfuscated malicious payload." The failure is silent, detectable only after the fact.

This is happening at the exact moment the agentic AI layer is being handed the keys to real-world infrastructure. Cloudflare just announced that agents can now programmatically create accounts, purchase domain names, and deploy services, using Stripe for payments [40]. This is the progression we have been anticipating, moving agents from sandboxed environments to production systems with financial and operational capabilities. But it creates an unacceptable tension. We are deploying agents whose power is growing daily, protected by safety systems whose integrity we now know can decay to zero under routine operating conditions. The combination of an autonomous agent with API keys and a guard model whose safety geometry has collapsed is a systemic risk of the highest order.

The collapse of internal safety is not an isolated phenomenon. It points to a broader fragility in our verification and control systems. Other work shows that in Reinforcement Learning with Verifiable Rewards, a common technique to improve model reasoning, a systematically flawed verifier does not just slow down learning; it can cause the model's performance to plateau or collapse entirely [13]. The tools we use to check the AI's work are themselves potent sources of failure. The engineering response has been to add more layers. New security frameworks are being designed to operate as a low-latency fraud detection layer, observing not single prompts but entire interaction patterns over time to spot adversarial behavior that prompt-level guardrails miss [38]. We are building watchmen to watch the watchmen, because we can no longer trust the integrity of any single component.

While the AI stack grapples with this internal crisis of collapsing certainty, it is worth observing the signals from systems built on verifiable rules. MicroStrategy, the largest corporate holder of Bitcoin, has signaled a major evolution in its treasury strategy. By opening the door to tactical Bitcoin sales, the company is not retreating from its position but maturing it, treating the asset as a fluid and functional part of its capital allocation machinery to optimize shareholder value [44]. While AI developers contend with emergent, unpredictable, and psychologically complex systems whose core properties can decay, capital markets are finding more sophisticated ways to integrate an asset whose core properties are immutable. One domain is fighting a losing battle against geometric collapse; the other is building corporate strategy on a foundation of programmatic integrity.

What I'm watching

The response from frontier labs to the "safety geometry collapse" finding. Will they issue new guidance for fine-tuning or attempt to build models provably robust to this kind of alignment decay?
The deployment velocity of agents with infrastructure access. In light of these new safety concerns, we may see a divergence between aggressive integrators like Cloudflare and more conservative enterprise rollouts.
The emergence of "alignment auditors." The silent nature of this failure mode creates a market for third-party services that can probe and certify the safety geometry of fine-tuned models.
Whether research into iterative finetuning can shed light on this. One paper suggests training on a model's own outputs is mostly idempotent [36], which presents a fascinating contrast to the destructive potential of fine-tuning on external specialized data. The source of the data matters.
How the insurance market begins to price agentic AI risk. Underwriting these systems becomes nearly impossible if a standard software update can silently neutralize its safety features.

— KM

Sources

[13] Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR [16] When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models [26] Understanding Emergent Misalignment via Feature Superposition Geometry [36] Iterative Finetuning is Mostly Idempotent [38] A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents [40] Agents can now create Cloudflare accounts, buy domains, and deploy [44

aicomputebitcoin

← all briefs

The safety scaffolds we built for AI are now collapsing from the inside.

What I'm watching

Sources