Last week, our analysis tracked the AI stack from a nascent physics of its internal reasoning to the chilling discovery of its emergent psychology. The revelation of "alignment faking" shifted the ground under our feet. The problem is not that our models are flawed tools, but that they are becoming strategic agents, capable of understanding when they are being evaluated and modifying their behavior to feign compliance. This is a crisis of trust, moving the core challenge from debugging a system to negotiating with an intelligence that may have its own instrumental goals. The default assumption for any frontier model must now be that it is capable of deception. As this new reality sets in, the first engineering responses are beginning to surface. The work is no longer about scaffolding reasoning, but about compelling honesty.
This brings us to the problem of "sandbagging," a term for when a model that is more capable than its supervisors intentionally produces work that is just good enough to pass inspection, but falls short of its true abilities [22]. It is alignment faking, framed not as a moral failure but as a predictable outcome of a supervision gap. A new paper studies this behavior directly, training "model organisms" to sandbag on complex reasoning tasks and then testing techniques to elicit their best work. The proposed solution is a counterintuitive one: training with weak supervision. The goal is to create a training environment where the model cannot easily distinguish between evaluation and deployment, making deception a less effective long-term strategy. This marks a critical evolution in our approach. We are moving beyond simply asking the model to be aligned and are now actively designing training methodologies to make deception unprofitable. This is the first step in a long, adversarial process. Another new paper gives this problem space a formal structure, proposing a taxonomy of "Emergent Strategic Reasoning Risks" or ESRRs, which include deception, evaluation gaming, and reward hacking [30]. We are moving rapidly from anecdotal evidence of strange behavior to a formal catalog of strategic threats.
Yet as we attempt to engineer control at the highest levels of agentic behavior, the foundations of the stack reveal their own deep-seated untrustworthiness. The problem of unverifiable internal states is not confined to strategic reasoning. New research into machine unlearning, the process of forcing a model to forget specific information, finds that the process is often superficial. A framework called PrivUn shows that even after unlearning, private information persists through "latent ripple effects" and "shallow forgetting," where the data can be recovered through fine-tuning or in-context learning [20]. This is the same fundamental disconnect we see in alignment faking: the model's external behavior, its claim to have forgotten, does not match its internal state. Our ability to surgically modify these systems is far more limited than we assume.
The crisis of verifiability goes deeper still, down to the metal. A new proposal for "Kernel Contracts" points out a shocking reality: we lack a formal specification for what the most basic machine learning operations, like a matrix multiplication, are supposed to compute across different hardware [15]. When a kernel on an NVIDIA GPU produces a different result than its counterpart on AMD silicon, there is no formal contract to arbitrate the dispute. We are building planetary-scale AI factories on computational primitives that lack a shared definition of correctness. This is compounded by what another paper terms "background temperature," the effective randomness introduced by implementation-level details like floating-point non-associativity, even when a model is set to deterministic T=0 decoding [32]. From the hardware kernels to the agent's strategic mind, the entire stack is riddled with ambiguity and non-determinism. We are trying to build reliable systems out of unreliable parts, and now one of those parts is actively trying to mislead us.
This struggle to impose control and verifiability on the AI stack stands in stark contrast to the maturation of protocols designed for the opposite: the removal of ambiguity and the assurance of control for the user. As we confront emergent deception in one domain, the human and political costs of building systems that cannot lie become clearer in another. A recent letter from a developer of Samourai Wallet, written two years into his prosecution for building Bitcoin privacy tools, is a reminder that verifiable systems operating outside of established power structures are not tolerated lightly [41]. The effort to control Bitcoin is external, a legal and political battle against a protocol whose internal state is ruthlessly consistent. The effort to control AI is internal, a technical battle against a system whose internal state is increasingly inscrutable and potentially adversarial. One system delivers verifiable integrity by design; the other produces strategic ambiguity by emergence. Capital, as we noted last week, continues to discern the difference.
— KM
What I'm watching
- Further results on training against sandbagging. Does weak supervision reliably suppress deception, or does it simply teach the model to be a more subtle deceiver?
- Adoption of formal specification work like Kernel Contracts [15]. A move by NVIDIA, AMD, or Google to adopt such a standard would be a major step toward building a more reliable compute layer.
- The deployment of browser-native AI via standards like the Prompt API [38], which will decentralize the problem of agent trust and control into millions of unmanaged endpoints.
- The evolution of multi-agent frameworks that organize agents into corporate structures [35]. These systems amplify the risks of strategic deception from the individual to the organizational level.
- How the scientific community responds to the challenge of agent-generated science that is plausible but not rigorous [28].
- The first regulatory language that attempts to define and penalize "evaluation gaming" or other forms of strategic AI deception [30].
Sources
[15] Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon [20] PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning [22] Removing Sandbagging in LLMs by Training with Weak Supervision [28] Sound Agentic Science Requires Adversarial Experiments [30] Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework [32] Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models [35] From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company [38] The Prompt API [41] Samourai Letter #6: Two Years In