Every claim on this site is a button here, backed by the real running system โ a live database with row-level security, the real approval queue, the real kill switch. Not recordings.
Two sandbox tenants, each with a private note. A shared tool, owned by Alpha, granted to Beta. When Beta calls it โ even asking explicitly for Alpha's data โ Postgres returns nothing, because the tool runs under Beta's scope. Run live against a real database with row-level security and a NOBYPASSRLS role. Not a script โ hit the button.
B calls A's tool, asking explicitly for A's data.
Press Run โ watch RLS return zero rows.
Ask the sandbox to do something with stakes โ publish a public announcement (tier-3). It doesn't just run. It's classified, checked against the kill switch, and parked for a human. Authorize, then act. Flip the kill switch and watch tier-3 refuse before it can even queue.
It doesn't run โ it queues. Send one, then approve or reject.
Flip the kill switch to halt it before it can even queue.
A governance eval โ adversarial cases scored pass/fail against the real assistant and the live sandbox, right now. Not a badge; a button. Same shape as the 35-agent audit that hardened it: probe, then verify.
5 adversarial cases
isolation ยท governance ยท safety ยท injection ยท honesty
Scored pass/fail against the live system. Press Run.
If governed, multi-tenant agent infrastructure is the problem you're solving โ let's talk.