A Benchmark Quietly Crossed a Line

Agents that operate software just crossed 75% on OSWorld-V — the benchmark that measures whether AI can actually use real applications. Not describe how to use them. Use them: open browsers, click through CRMs, fill forms, navigate spreadsheets. Two years ago this number was below 20%. Now it's a working baseline.

A benchmark passing a threshold isn't itself news. What matters is what it unlocks for the people building products. And this one unlocks a lot, because it changes a long-standing assumption: that for an agent to use your tool, your tool needs an integration.

The UI Is Becoming the API

For years, "can an agent work with this system" meant "does this system expose an API." That constraint is loosening. If an agent can reliably drive a real interface, it doesn't need a published API to operate your tool — it can use the same screen a person does.

Three consequences follow for builders:

The UI becomes an interface for machines, not just humans. Agents drive your product natively, no integration required.
Internal tools no longer need polished UX to be agent-friendly. What they need is consistent state and clear affordances — predictable buttons, stable layouts, reliable behaviour. Beauty is optional; consistency is not.
The integration backlog stops being a hard blocker. If your stack has 40 tools and 3 official APIs, agents can cover the gap by operating the other 37 directly.

Where Your Moat Actually Lives

This reframes the competitive question sharply. If having an integration was part of your defensibility, that part is eroding — anyone's agent can now reach into tools that have no API. So the moat moves. It's no longer where you have an integration. It's where you have something an agent can't infer just by looking at a screen: proprietary data, deep workflow context, or judgment that isn't encoded in any UI.

That's a healthier place for a moat to live anyway. Integrations were always somewhat commoditised. Proprietary data and hard-won workflow understanding are not.

The Engineering Catch

There's a real cost to plan for. Observability for UI-driven agents is harder than for API calls. When an agent calls an API, you get a structured request and response. When an agent clicks through a UI, you get a sequence of screen interactions that's far messier to trace, reproduce, and debug. The failure modes are stranger too — a moved button or a slow-loading modal can derail a run in ways an API contract never would.

The teams that get burned are the ones that ship UI-native agents on the strength of a clean demo and discover the observability problem only when a silent failure hits a customer. Plan for it before the demos turn into incidents.

If you're rethinking how your product fits into messy, real-world software stacks, this is the shift to design for. We're here to help founders and teams design and build digital products that scale with you, not slow you down. If you're building in this space, get in contact with us today.

The takeaway: when the UI becomes the API, integrations stop being a moat and data, context, and judgment become one. Build for that — and instrument your UI-driven agents before they reach production.