Which Windsurf alternative is best for large codebases?

Cursor and Claude Code handle large codebases most effectively. Cursor supports up to 8 parallel agents on Pro+. Claude Code offers 200K token context on Pro (1M tokens via API). For the largest context window at no cost, Gemini Code Assist's 1M-token individual free tier is the standout option.

Builder.io Blog

Developer experience is dead. Long live agent experience.

Mon, 08 Jun 2026 18:00:00 GMT

For the past ten years, we've obsessed over the developer feedback loop.

We didn’t do this just to make local setups feel fast (though that was always a nice side effect). We did it because friction is the enemy of momentum. Developers don't write great software through sheer willpower; they write it when the system around them makes the right path the easy path.

Good developer experience (DX) assumes people are tired, that they skipped the migration guide, and that they're one cat-on-the-keyboard away from taking down prod. It wraps you in fast, deterministic safety nets—compilers, linters, hot reloading, and branch previews—so you can focus on building.

Lately, we've been inviting a new kind of contributor to our repos: AI agents. But for some reason, we've been treating them like wizards instead of the completely stateless tools they are.

An agent has no lived memory of your product. It doesn't know about the legacy bugs your team learned to fear, and it has no tribal knowledge of your codebase. Left to its own devices, a stateless agent will walk directly into the same architectural wall five times in a row unless the system around it provides a better feedback loop.

If coding agents are going to do meaningful work in real-world codebases, we have to stop only optimizing our prompts and start actively engineering their environments. We need to transition from developer experience to agent experience (AX).

What is agent experience (AX)?

Agent experience is the discipline of designing the layer between a model and a real codebase: the context, tools, permissions, tests, and review loops that tell the agent what matters, what it can touch, and how it knows it worked.

It’s about a simple, first-principles question: How do we build a fast, secure, and deterministic feedback loop for agents?

From there, we have seven core tenets.

1. Context is onboarding

Developer onboarding used to be a seasonal event: how quickly can a new human understand this repo and make a safe first contribution?

With agents, onboarding happens at the start of every single task. Repository instructions, setup commands, component APIs, schemas, and known failure modes shape what the agent sees, tries, ignores, and verifies.

The temptation, then, is to over-document. To make sure the agent has all the context it could ever need.

But left unchecked, teams accumulate a graveyard of skills, AGENTS.md rules, stale definitions, prompt snippets, and hidden tool instructions. In a world of too much context, when an agent makes a bad choice, the human reviewer has to debug a massive, non-deterministic input history to figure out which stale instruction or conflicting rule led the model astray.

Good agent context instead behaves like good code. It should be minimal, transparent, and tested.

Minimal: Global context stays thin and, anywhere possible, points back to the code itself.
Transparent: A reviewer can easily audit which rule, skill, or setup note shaped the work.
Tested: Team skills are self-explanatory enough that the agent can invoke them at the right time, and the reviewer can understand why.

Agent rules, skills, and other context must be a team discipline rather than a pile of experiments.

2. The environment is part of the prompt

We accept that large language models (LLMs) are inherently non-deterministic. Because their cognitive engine is probabilistic, the rest of their execution environment must be aggressively deterministic.

The environment is literally part of the prompt. Dependency versions, local scripts, environment-variable shapes, seed data, auth setups, browser access, and local services determine what the agent can observe and correct.

"I couldn't run the tests locally, but this should work" is a massive DX red flag when a human says it. When an agent does it, we tend to just hope for the best.

A human developer who hits a missing environment variable, a broken database seed step, or a cryptic Docker error will stop and investigate. An agent will route around the failure, change the wrong file, and ship a guess with a polite commit message. If the agent can't compile the code, run the dev server, seed the database, or hit a local API, its output is not trustworthy software.

Agents must have a reliable, consistent, observable workspace before its output can be trusted.

3. No handoffs without verification

Agents shoudn't stop at generating code. They need to prove their work.

We're currently trading the friction of writing code for the exhausting cognitive load of reviewing it. If a you have to spend thirty minutes manually QA-ing an agentic PR, checking edge cases, and cleaning up generic layout styles, the agent didn't save you time. It just shifted the labor.

And when devs have too much review labor, quality slips drastically.

Good AX means the agent does more of that work before handoff. It's task, when complete, should present evidence: tests run, screenshots captured, browser flows checked, logs inspected, accessibility trees reviewed, and edge cases explored. Developers shouldn't have to rediscover all of that from scratch.

Typechecks, unit tests, and linting are still the bones of a serious workflow. But product work also fails where compilers are blind: responsive layouts that collapse, loading states that trap users, or broken flows that technically compile. Giving agents access to tools like Chrome DevTools MCP, Playwright, and automatic branch previews allows them to gather empirical evidence before handing off the work.

Spend tokens before spending reviewer attention. Tokens are cheap, pretty much unlimited, and run 24/7. Senior developer focus is precious, expensive, and burns out. If an agent can check its own work in a closed loop, it should.

And when the agent presents its work, the handoff should be as easy as possible to act on. A reviewer should be able to leave a visual note on a preview branch or point to a failed check and send that context back into the agent's execution loop, letting it self-correct in the workspace it already understands, rather than forcing a developer to copy-paste feedback across tools to restart the workflow.

4. Safety needs to be deterministic

Good DX made dangerous actions hard. Good AX needs to make dangerous actions impossible. Agents act quickly, literally, and at scale.

A prompt like, "Make sure not to mess with the database!" doesn't really help. Prompts are easily bypassed. Safety must be structural: sandboxing, scoped credentials, file and network limits, separate development and production data, environment-variable approval gates, and human-in-the-loop validation for high-risk actions.

Any time I see drama on Twitter where an agent dropped a database, I think, "Why did your system allow for that? Why did the agent have that access?"

The same bounded-environment principles apply as agents move beyond isolated developer terminals. If a PM, designer, or marketer uses an agent to iterate on a product surface, they shouldn't accidentally inherit root access to a local machine or production credentials. Safety shouldn't depend on whether a non-technical teammate understands what rm -rf, OAuth scopes, or production environment variables can do.

As agents move beyond developers, non-technical teammates will start using them to ship real changes. Safety can't depend on whether a PM or marketer understands what rm -rf , Oauth scopes, or production environment variables can do. The sandbox must be the absolute security boundary.

5. Model routing as boring infrastructure

We spend too much time focusing on the weekly frontier model horse race, but the real questions for teams are:

Who can access agents?
Against what systems?
With what context?
Under what review process?
Through which model route?
And with what audit trail?

Model routing should be boring in the best possible way. Teams need provider flexibility and task-appropriate models, but every developer shouldn't have to become a model-selection expert just to get work done. Cheap, fast models can help with low-risk summarization, triage, classification, scaffolding, or routine review support. Deterministic syntax validation belongs to linters, typecheckers, and test runners. Expensive reasoning models should be reserved for harder multi-file judgment work.

Good governance belongs directly inside the agent path. Admins should see usage and costs, reviewers should see execution evidence, and teams should have the flexibility to switch providers without rewriting their entire application logic.

6. Design systems and code architecture are the agent's best source of truth

Your codebase isn't just for human maintainability anymore. For an agent to execute tasks reliably, the codebase must be the most accurate record of how the product actually works.

If your codebase is messy—if the docs say one thing, the components do another, and Storybook is three versions behind—the agent will synthesize that confusion into elegant-looking garbage.

To make your codebase ready for AI, you have to design it using classic software engineering principles: deep modules with thin public interfaces, typed APIs, predictable routing, and clean directory structures.

This is progressive disclosure for machines. By keeping implementation details hidden behind clean interfaces, we lower the cognitive load on the agent, which translates directly to lower token usage and fewer logic errors.

The same applies to design systems. The more your codebase forces the agent to reuse human-made components, tokens, and accessibility patterns, the less likely the output is to become generic sludge. The agent shouldn't be inventing a custom button when your team's button is right there.

7. Agents become cross-functional glue

Agent experience starts with developers, but it's actually an organizational coordination engine.

When agents make Version 1 of a feature cheap to generate, they don't solve Version 2, 3, or 4. Without a shared workspace, the developer becomes a high-priced human router who copies visual feedback from Slack, translates it back to the agent's prompt interface, runs the workspace locally, and manually manages the branch.

But the best AX moves away from pre-code abstractions toward shared iteration around live product surfaces.

With interactive preview deployments and role-aware controls, a designer, marketer, or PM can talk to the agent directly inside a safe, bounded preview. They can test responsive states, iterate on copy, or tweak layouts on a branch preview, while developers retain ownership over architecture, safety, and system integration. The developer is no longer the copy-paste bottleneck; they are the platform engineer who owns the system design.

Agent experience deserves a platform

You can certainly cobble these pieces together yourself: a coding agent extension, an isolated sandbox service, a rules file in one repo, a browser testing tool, a manual review workflow over Slack, etc.

But when you do that, the seams often become the work. You will likely find yourself spending more time maintaining your internal agentic infrastructure than shipping features.

At Builder, our bet is that agentic work becomes truly valuable when the whole team can collaborate on real code with shared context, deterministic environments, visual previews, governance, and a clear path back to human review.

By treating the codebase, the execution environment, and the team as a single collaborative workspace, we make it possible for agents to run tests, compile code, and generate visual branch previews automatically. It changes the interaction from a disconnected code-generation tool to a reliable, structural team contributor.

You can learn more about this philosophy by seeing how Fusion works or reaching out to one of our AX experts.

The golden rule of agent experience

LLMs should do the glue work. People should do the interesting work.

If humans are copying feedback between tools, re-explaining repo context, manually checking whether the agent broke the obvious thing, policing stale docs, and cleaning up generic output while the model makes the creative decisions, the system is upside down.

Good agent experience gives creative people more room to use judgment, taste, architecture, strategy, care, and craft.

Developer experience became a discipline because we realized that software quality is a function of the systems we build. Agent experience will become a discipline for the exact same reason.

The point isn't to replace the people who understand the system. The point is to give them back the time to do the work only humans can do.

Read the full post on the Builder.io blog

How POGR Cut $30K and a Year of UI Work with Builder

Thu, 04 Jun 2026 18:00:00 GMT

POGR sponsors Blazium Games, a gaming community that is managed by ten developers, each of whom ships their own title. Two years ago, they built Blazium, a lag-tolerant engine that holds up at 300ms ping under network conditions that break most multiplayer games. They publish across Steam, iOS, Google Play, Epic, and itch.io, with a GOG pipeline in progress. Ten developers carrying multiple titles on a startup budget means every dollar and every week has to count.

UI was POGR's most expensive surface

Game UI is the most expensive surface in game development. It touches the interface a player interacts with every second, the environment they move through, the character art on screen, and the marketing pages that sell the game in the first place. Every one of those surfaces requires design, animation, and engine integration before anything can ship.

POGR's process compounded that cost at every step. The team designed UIs in Figma and ran them through custom conversion tools to produce structured HTML and React components. From there, engineers manually ripped components apart, wrote CSS by hand, extracted every SVG, and rebuilt the whole thing inside the game engine. Animation came at the end of that chain, scripted from scratch on top of static screens.

The numbers on their flagship project, Depths, tell the story:

$5K spent to reach 25% UI completion
$30K+ projected to finish the same project
2.5 to 3 months for a quarter of the UI alone
A full year of development for the complete UI

The gaming industry offered no tooling to compress the cycle, and the AI tools that existed produced images that looked like UI but generated nothing usable in an engine. For a small studio shipping multiple titles a year, the math forced a constant tradeoff between shipping late, shipping over budget, or shipping with a UI that fell short of the game's ambition.

One workflow from design to engine

POGR adopted Builder, starting with the Figma plugin to automate Figma-to-React conversion on web projects. The team quickly recognized that Builder's output, clean CSS, SVGs, and built-in animation, was the same raw material a game engine needs.

That insight reshaped the workflow. POGR now has the Blazium team prototype their game UIs directly in Builder, exports the output as React or HTML, and converts the assets into native game engine formats, including C# for the Blazium engine. Animation logic carries through and is replicated in the engine's scripting language, like GDScript, meaning motion ships with the design from the start rather than being bolted on at the end.

Builder now runs the entire UI operation across ten developers and multiple games in production. The output is real CSS, real components, and real animation logic that survives the trip into a game engine, which is the gap no other AI tool fills for game studios.

From cost center to shipping advantage

The savings compound across Blazium's roadmap. With multiple titles shipping a year, $30K saved per project, and a year of recovered dev time per project, this adds up to a fundamentally different operating model. Money that used to disappear into UI conversion now flows into gameplay, networking, and the systems that set a Blazium game apart.

Quality moved up alongside the cost and timeline gains. Blazium's team describes the result as making interfaces "feel alive," with motion players notice, polish reviewers reward, and presentation that holds up across storefronts.

"We were looking at a year and $30K to finish the UI on one project. Now it takes weeks. There's nothing else doing this for game studios."

— Randolph (Randy) Aarseth II, CTO & Co-Founder, POGR.io

The market opportunity POGR sees

POGR sees a wider opening here. UI is the highest-cost surface in game development, and tooling that compresses it produces compounding returns across every game a studio ships. Indie developers in particular have been entirely priced out of high-quality animated UIs, and POGR considers Builder the closest thing the industry has to a fix.

They're advocating for a dedicated Builder gaming module with native GitHub deployment, integrations across Unity, Unreal, and Godot, and multi-user editing with centralized source control. A module like that would automate the asset conversion and deployment steps POGR still handles manually, turning Builder into an integrated game UI pipeline. Tooling at that level would let a studio of 10 ship like a studio of 50, and would unlock the same workflow for thousands of indie developers still stuck in the Figma-to-engine grind POGR escaped.

POGR is now promoting Builder across the Blazium engine community, game jam circuits, and indie developer networks on social media. Their pitch is direct: no other tool is doing this for game studios, and the math on the other side of the switch speaks for itself.

POGR helps game developers and communities connect through player profiles, stats, and shared gaming experiences. Visit the community website, explore the platform, and join the Discord.

Get Builder’s new engineering guide on AI-native development, and start building for free.

Read the full post on the Builder.io blog

The AI Product Ladder (and why most apps are stuck on Rung 1)

Wed, 27 May 2026 18:00:00 GMT

There's an AI feature in many SaaS products right now that looked incredible in the all-hands demo.

It's got a button. Maybe even a loading spinner. And, probably, it spits out a summary of something: an email, a ticket, a spreadsheet. The demo kills, so it ships as a feature. What follows is six months of near-zero usage.

It wasn't a bad idea. It wasn't bad UX. And it wasn't bad timing. It was a Rung 1 product, a single LLM call dressed up as a feature, and users found its ceiling in about thirty seconds. The fix is not a better prompt. The fix is an agent-native architecture.

There's a ladder that every AI product climbs. Three rungs. But most teams stop one rung too early, if that.

Here's the ladder:

Rung 1: the single LLM call

This is the anti-pattern.

A text box sends a prompt. The AI returns a string. You display it. Maybe with a loading spinner, maybe with a "copy" button if you're feeling fancy. There's no way for the user to course-correct. No way for the AI to take action. No way to see what happened or why.

You see this everywhere. The "Summarize" button bolted onto a CRM. The "Generate description" field in an e-commerce admin. The "Draft reply" widget in a support tool. They look impressive in a demo, but they break the moment reality gets messy. An edge case the model wasn't expecting. An output that's close but wrong. Users who need to iterate or are left with no options.

And they often break invisibly. The user doesn't get an error. They get a bad string, shrug, and go back to doing it manually. Then they stop clicking the button entirely.

Three months later, the team schedules a meeting to "improve the prompts." They adjust temperature settings. They rewrite the system message. They get the output from 65% good to 75% good, but that's still not reliable enough to replace the manual workflow. Another sprint, another tweak, another flat usage graph. Eventually, it gets cut in a cleanup ticket. Nobody files a bug. Nobody notices it's gone.

Teams ship products like this because it's fast. It clears the "AI feature" box on the roadmap. It demos clean to a non-technical stakeholder who's never actually used it for real work. It's the path of least resistance from "we should add AI" to "we shipped AI."

That's not a product. That's a toy.

The tell: if removing the AI feature would barely change how users do their job, it's Rung 1.

Rung 2: a chat with tools

Rung 2 is a real improvement.

Now the AI has tools (draft email, search contacts, run query, create ticket) and a chat interface where it works in front of you, showing tool calls and results as it goes. You can watch it reason. You can push back. You can see why it did what it did. This is what Claude, ChatGPT, and Cursor look like under the hood as a chat interface with tools.

For general-purpose assistants, Rung 2 is the product. Claude is a chat interface with tools. That's not a limitation; that's the point.

But for a domain-specific app (a project management tool, a customer support platform, a dev workflow product), Rung 2 is a ceiling.

Here's why. There's still no real UI. No dashboards. No lists. No forms. No keyboard shortcuts. No team collaboration features. If the AI gets confused, the user's only recourse is to retype the request differently and hope for a better result. Non-developers especially struggle here; when the interface is a blank text box, and the AI is your only affordance, you're one ambiguous output away from being stuck.

There's also a subtler problem. The AI has no real context. It sees what's in the conversation thread, but it doesn't know what you're looking at in the app. It doesn't know what you've selected or what you just did. It's reasoning about your work from the outside.

Rung 2 is a great chatbot. It's a mediocre app.

The tell: if your "AI feature" is a chat panel that floats over the rest of your product and never touches the same state the rest of your product reads from, you're on Rung 2.

Rung 3: agent-native (agent and UI as equal partners)

Rung 3 is what agent-native means: every action the agent can take is also a button in the UI, and every button the user clicks runs the same logic the agent uses. Your app becomes truly agent-native.

You build a full-featured app around the agent. And crucially, every action the agent can take is also a UI button, and every button the user clicks runs the same logic the agent uses. One implementation. Two ways in.

Here's what that looks like in practice. Imagine you're building a customer support tool. A ticket comes in. A human agent clicks "Suggest reply" and gets a draft: one button, one action. The AI handling the overnight queue calls the same action to draft and send replies autonomously. The logic is identical. The difference is who invoked it.

That's the agent-native architecture. Not a chat panel bolted onto an app. Not an app with AI sprinkled in. One system where humans and agents are both first-class operators.

That single design decision changes three things:

You stopped adding buttons to a chatbot. You added an agent to an app.

The quality bar on both sides goes up. The UI is a real UI, full-featured, fast, familiar to users who don't want to type. The agent is a real agent. It can take every action in the product, not just the ones you wired up to a chat panel. Neither side is a watered-down version of the other.

The agent has real context.

It sees what you're looking at. It knows what you've selected, what you've recently done. It writes to the same database the UI reads from, so when it creates a record or updates a status, that change shows up immediately in the interface, not in a separate "AI output" box, but in the actual app. The agent isn't advising you from the outside anymore. It's working inside the same product you are.

External agents can use it too.

This is the one most teams don't anticipate. Because the app's actions are first-class objects, not prompt hacks, and not one-off API endpoints, they can be projected into any surface a host understands. Claude Code, Cursor, ChatGPT custom apps, and other MCP hosts can drive your app as an MCP server. Other agent-native apps can call yours over A2A. You build the domain operation once; the framework handles MCP tools, A2A endpoints, HTTP actions, deep links, and CLI entry points from the same definition.

You don't become a protocol expert. You just build the action.

This is also why agent-native apps handle so many protocols without making developers' lives more complicated: the architecture is a single-action model with multiple entry points, not a separate integration per surface. One implementation, many ways in. For users, for your own agent, and for the agents of every other app in the ecosystem.

That's Rung 3.

Where is your product?

Most developers know the honest answer. The question is whether staying on Rung 1 or Rung 2 is a deliberate choice or just the path of least resistance.

What does agent-native mean?

Agent-native means an app where every domain action is a first-class object that humans (via UI), the in-app agent, and external agents (via MCP or A2A) can all invoke through the same single implementation. It's Rung 3 of the ladder: not a chat panel bolted onto a product, but one system where humans and agents are equal operators.

Rung 1 is fast to ship and easy to kill. Rung 2 is a real product for the right use case. Rung 3 takes more architecture up front. But it's the only rung where the AI feature is indistinguishable from the product itself, where users who don't want to chat can still benefit from everything the agent can do, and where the rest of the agent ecosystem can find you and use you.

One rung at a time is fine. Just don't stop climbing because the demo killed in the all-hands.

Read the full post on the Builder.io blog

5 Questions to Ask Before Implementing an Agentic Development Platform

Thu, 28 May 2026 18:00:00 GMT

Before you bring on an agentic development platform, run the vendor through these five questions that reveal how the tool holds up in real workflows.

Most platform purchases look fine on the demo and fall apart six months in. The tool worked. The integration shipped. Adoption stalled because the platform solved a problem the team did not actually have, or solved it in a way that created two new ones nobody saw coming.

Better evaluation questions can prevent most of this. Vendors will steer you toward feature questions, which are convenient for them and useless for you. The questions that matter are the ones that surface how a platform behaves once it touches your real workflow, your real codebase, and the people who have to live with it for the next three years.

Here are five questions to run every vendor through before you sign anything.

1. Does the platform reduce your handoffs, or just rename them?

Every agentic development platform claims to cover the full lifecycle, but none really do. There is always a seam, usually more than one, where work has to leave the platform and go somewhere else. That seam is where projects die.

Ask the vendor to walk you through a real customer's workflow from idea to production. Skip the slide and ask for the actual sequence:

Where does design happen, and how does it get into code?
Where does code get written, and by whom?
Where does QA loop back when something breaks?
Where does a copy change made by marketing end up in the repo?

If the answer involves three different tools talking to each other through a Zapier-style connector, what you are buying is a coordination problem with a logo on it. The handoffs you have today will still exist under different names, and the team that owned them before will still own them after. This is the backlog problem AI didn't solve, and most platforms make it worse by adding another tool to the chain.

The platforms to take seriously have a clear answer about what they own end-to-end and an equally clear answer about where their tool stops. A vendor who claims to own everything is bluffing. A vendor who can show you exactly where the handoff happens and how it works deserves more of your time.

2. Does the user reaction match the buyer enthusiasm?

Buyers and users are different people. The buyer is usually a VP or director who sat through the polished demo with the sales engineer. The user is a frontend developer, a designer, or a PM who will open this tool 15 times a day for the next 2 years, and their reaction is what determines whether this purchase works.

Get the user in front of the platform before you buy. Skip the guided walkthrough and sit them down with a real task from their backlog. Watch what happens when they try to do it. A few things to pay attention to:

Where do they get stuck on the third or fourth thing they try, rather than the polished happy path the vendor showed you?
How does it feel when they have to fix a mistake, because most platforms are great at the create flow and clumsy at the edit flow?
Do they want to keep using it after thirty minutes, or are they quietly reaching for the tool they already know?

If your frontend engineers shrug and say it is fine, treat that as a failure signal. Engineers get opinionated about the tools they want to use. The absence of an opinion usually means the answer is no. The platforms that survive are the ones built when agents work for the whole team, not just the developer who runs the demo.

3. Will the platform work with the codebase we actually have?

Vendor demos run on clean, simple codebases that the platform was designed around. Your codebase has six years of accumulated decisions, two competing component libraries, a half-finished design system migration, and a directory called legacy_DO_NOT_TOUCH that has been touched many times.

Ask for a proof of concept of your code. Pick a real project that includes the messy parts:

A page with conditional rendering
A form with custom validation
A component that pulls from three different state sources

What you are looking for is whether the platform respects what you already have or tries to replace it. Some platforms are happy to read your existing components, work within your conventions, and produce output that looks like the rest of your code. Others want to generate everything from scratch, which means everything they produce will feel like a foreign object that someone has to manually rewrite to match the house style. The first kind of compound grows in value over time. The second kind generates work you have to undo.

A platform built on agent-native architecture handles this better than one with AI bolted onto an older foundation.

Pay attention to what happens when the platform gets something wrong. Can a developer open the generated code, fix it directly, and have those fixes persist? Or does the next round-trip overwrite their work? A tool that cannot be corrected by its users will be abandoned within a quarter, no matter how good the initial output looks.

4. How does the platform fail, and who catches it?

Every AI-powered platform is wrong sometimes. The interesting question is what happens when it is.

Good failure modes are loud, easy to spot, and easy to fix. The tool produces something obviously broken; a developer notices it in 5 seconds, and the fix takes a minute. Bad failure modes are quiet. The tool produces something that looks correct but is subtly wrong, ships to staging, and gets caught by QA two days later. Worst case, it gets caught by a customer. This is how agent productivity creates a quality debt that compounds faster than teams realize.

Ask the vendor how they think about quality and what guardrails the platform has:

How does the platform handle ambiguous instructions?
What does it do when it does not know the answer?
How does a team review and approve output before it ships?

You also want to know what happens at scale. A platform that produces decent output for one component might produce inconsistent output across fifty. Ask to see what a real customer's repo looks like after six months of using the tool. Is it cohesive, or does it look like seventeen different developers with different styles all worked on it?

The boring answer here is governance, and boring is what you want. A platform that has thought hard about how teams maintain quality over time will hold up at team scale, where a platform that demos beautifully on a single screen will not.

5. What does pricing look like at the scale we actually want to reach?

Most platforms are priced to make the entry point feel reasonable. Per-seat pricing for the first ten users looks fine. Then you try to roll it out to forty people, and the math changes.

Run the numbers for the scale you actually want to reach, not the pilot:

How many seats do you need in year two if this platform works?
What does that cost?
Are there usage-based components that scale with the volume of your team's builds, such as per-generation, per-build, or per-deployment?

Those usage-based costs are easy to ignore during a pilot with three users, but painful once a team of forty is using the tool every day. Ask what happens when you exceed limits. Does the tool degrade gracefully or stop working entirely? Is there a meaningful conversation to have with the vendor about pricing at your scale, or are you stuck with whatever they put on the website?

Then ask about the second-order costs:

How much engineering time is required to integrate?
How much ongoing maintenance?
How much training for new hires?

A platform that costs $20 per seat but requires a dedicated platform engineer to maintain ends up costing more than a $100-per-seat tool that runs itself. The cheapest tool on paper often turns out to be the most expensive one in practice. The number that matters is the total cost of getting real value out of this, sustained over three years.

The platforms most teams actually need

Run these five questions against the current market, and most agentic development platforms will fail at least two of them. They own a narrow slice of the workflow while claiming to own more. They demo well and feel clunky in daily use. They work on greenfield projects and struggle with real codebases. They fail quietly. They get expensive at scale.

The platforms that survive this kind of evaluation share a few traits. They are honest about where they fit in the workflow. They produce code that respects what teams already have. They give developers a way to correct mistakes that stick. They have thought about governance before you asked. Their pricing makes sense at the scale where the platform actually has to work.

Builder.io was built for teams that hit these questions hard. It owns the path from design and prototype to production code that lives in your repo, your component library, and your conventions. Developers can edit the output directly and have their changes persist. Marketing and design can make changes to live pages without filing a ticket. The platform was built to work inside existing engineering systems, so the code it produces feels like code your team would have written.

If you are evaluating agentic development platforms now, run these five questions against every vendor on your list, including us.

For a deeper look at how this works end-to-end, see the idea-to-production workflow.

Try Builder on your codebase with the work you actually have to ship. Get started for free, or speak with a Builder expert.

Read the full post on the Builder.io blog

Why the Best Agent-Native Apps Use Less AI

Tue, 26 May 2026 18:00:00 GMT

The mark of a great agent-native application is what it doesn't send to the model.

It was 2 AM, and I was still grinding away on my 'free design.md' hackathon project, so I barely noticed it at first. But my own agent had just routed a string into a frontier model so it could parse a few dozen lines of JSON.

The response, to be fair, was perfectly formed. The schema was understood. But here's the thing that slowly dawned on me, with some horror. That whole string I'd passed in was 12 fields and about 400 bytes. But the way my system was currently wired, the agent was the only execution layer for anything that the underlying code couldn't directly support.

With only one option, the agent passed my request to an expensive frontier model, chewed on it for over 50 seconds, burned through about 50,000 tokens (because it was carrying a lot of context it didn't need), and then returned the solution. I was stunned. That same answer could have come from a JSON.parse call that would have returned in under one millisecond, with zero risk of hallucinating, and for zero cost.

That was the moment I understood I needed to write this piece. You see, the dominant agent-native discourse right now has the quality signal backward. Agent-native applications are the future. There's no denying it. There's no going back. But we've been measuring agent-native applications by their agentic surface area, by how much the agent can do, by how many tools it can reach, by how autonomous its loops are.

Instead, we should be measuring success by the inverse. By how much work an agent-native system can route back to production code, or to actions, which are newly written, reusable snippets of code that run on the backend. My app shouldn't need Claude to parse 20 lines of JSON.

So here's my argument: AI restraint will become the true quality signal for all future software.

The two-surface trap: the biggest problem with agent-native apps

It's worth asking why well-engineered AI products keep routing string parsing, arithmetic, and field lookups through the most expensive component in their stack.

The answer is architectural, not behavioral. Most agent-native applications, as the term is used today, give their users exactly two execution surfaces.

The first surface is the UI. Buttons, forms, and flows are the things the developers thought to build. Fast, predictable, testable, deterministic. But fixed. If a workflow isn't in the UI, it's not available to the user.

The second surface is the agent. It can access an LLM along with whatever tools the developers chose to wire up at build time. This surface is infinitely flexible, in the sense that I can describe anything I want, and the model will attempt it. But it's also slow, expensive, non-deterministic, and prone to confident hallucinations.

The UI is finite. The user's intent is infinite. And the agent is the only available bridge. So everything that falls into the gap, every parse, every filter, every date calculation, every status lookup, every sort, gets routed through inference by default. Not because anyone decided it should be. Because there's no other place for it to go.

This is the architectural defect at the heart of most agent-native applications. They've made the model the universal solvent for anything the UI can't do. And the universal solvent is, predictably, overkill for most of the things it dissolves.

The third execution surface

Agent-native applications, if the term is to mean anything precise, introduce a third execution surface.

This third execution surface is where users can define deterministic actions that the UI or agent can call. These actions are authored at runtime, not at build time. They are defined by people who are not the original engineers. They become available immediately to both the human UI and the agent loop. They are cheap, fast, testable, and correct by construction.

The crucial detail is the unification. In an agent-native architecture, the 'actions' surface is the same surface the agent calls and the same surface the human invokes through the UI. The agent doesn't know, and doesn't need to know, whether a given action was shipped by the original developer six months ago or written by a power user last Tuesday afternoon. The agent just sees an action it can call. The user just sees one thing the product can do.

This is the architectural move that makes restraint possible. Without it, restraint is a discipline that one team practices and another team forgets. With it, restraint becomes a property of the system itself. Every time someone notices that the agent is being asked to do the same deterministic thing repeatedly, they can crystallize that work into an action, and from that moment forward, the agent calls a fast, free function instead of running a slow, expensive inference.

My JSON.parse debacle stops being a hackathon embarrassment the moment somebody adds a parseResponse(endpoint, schema) action. Neither requires a code deploy. Neither requires the original engineers to be in the room. The agent learns about the new action through the same registry that exposes everything else, and from then on, the work happens at the speed of a function call. This is agent-native at its best.

Agents are the prototype, actions are the product

There's a useful way to think about how the agent and the action surface relate to each other over the lifetime of an application.

The agent is the prototype. It's where novelty gets handled. It's where unfamiliar requests get reasoned through. It's the surface that absorbs the long tail of user intent that nobody anticipated and nobody coded for. It's, in a sense, the runtime version of an engineer thinking through a problem for the first time, which is exactly what makes it valuable and exactly what makes it expensive.

The actions are the resulting production code. They are what the prototype crystallizes into once the work becomes repeatable. The pattern is roughly this: when a user, a team, a metric, or an on-call log notices that the agent is being asked to do the same shape of thing repeatedly, that shape becomes a candidate for promotion to an action. Once promoted, the agent stops re-deriving the answer from first principles and starts calling the function. The reasoning moves from runtime to design time. The cost per invocation drops by 5×, 10×, or 100×. The variance collapses to zero.

This isn't a new idea in computing. It's how every other layer of the stack already works. Hot paths get optimized. Interpreters give way to compilers. Manual queries become stored procedures. Repeated business logic gets extracted from one-off scripts into shared libraries. What's new is that the AI era compresses this entire process. The "prototype" can be authored by typing a sentence in natural language. The promotion to "production code" can be done by a non-engineer at runtime. The crystallization happens at the speed of usage, not at the speed of sprint planning.

A great agent-native application is, structurally, one that makes this crystallization easy and continuous. A mediocre one keeps everything routed through the model forever, because there's no third surface to crystallize into.

Why AI restraint compounds into a moat

It's tempting to read my argument so far as a developer-experience point. It's more than that. The economics of restraint compound in a way that becomes a massive advantage at scale.

In any market where the AI cost structure is a meaningful fraction of the unit economics, which is to say all of them, the company that aggressively cultivates AI restraint will end up with structurally better margins, faster products, and higher trust. They'll be able to undercut the price, ship faster, and operate at a scale that the maximalist competitor can't match without setting their gross margins on fire.

Restraint compounds. The companies that figure this out first will look, from the outside, like they have a moat that competitors cannot cross. But the moat is just the cumulative effect of years of crystallization on the third surface.

What this means for how I build

The practical implication for anyone building in this space is that the architecture of the third execution surface (Actions) is not a feature you add later. It's what determines whether your product can ever become great.

The Builder team has been making this argument under the name "agent-native architecture," and the framework at agent-native.com is one concrete instantiation of the third-surface pattern. There will be others. The pattern is more important than any single implementation. What matters is recognizing that the third surface exists as a distinct architectural choice, and that the choice to build it (or not) determines whether your product can practice restraint in any serious way.

Build the action surface early. Make actions a first-class primitive, not a power-user escape hatch. Expose the same surface to humans and agents, so that anything the agent can call, a human can also invoke, and vice versa. Make it cheap and obvious for non-engineers to author new actions. Build the registry that lets the agent discover newly authored actions without a deploy. Track which actions get called most often, because that telemetry is your roadmap. Track which agent invocations could have been actions but weren't, because that telemetry is your debt.

The signal

We're in a moment where the agent-native discourse rewards the wrong things. Demos optimize for what the agent can do on its own. Investor decks measure agentic surface area. Engineering blog posts brag about loop depth and tool count. None of these is the quality signal.

The quality signal is, and will be, whether the system uses inference judiciously. Whether the architecture admits a third execution surface where deterministic work can crystallize. Whether the team treats every recurring agent call as a candidate for promotion to a function. Whether, over time, the ratio of work the agent could do to work the agent actually does is moving in the right direction.

In a few years, the gap between agent-native applications that practice restraint and the ones that don't will be one of the most visible quality signals in software. The restrained ones will be cheaper, faster, more correct, and more trustworthy. They'll feel, to users, like products that respect their time and money. The maximalist ones will feel like products that are constantly thinking out loud at the user's expense.

This is the case for restraint. The best agent-native applications use less AI. Not because AI is bad, but because the architecture that earns the agent-native label in the first place is the architecture that makes restraint possible.

Read the full post on the Builder.io blog

Designing Generative UI in an Agent-Native World

Tue, 26 May 2026 18:00:00 GMT

If you caught any of the buzz around Google’s recent announcements about "improved AI Mode" for search, it’s official: Generative UI has officially entered the mainstream.

The promise sounds incredible: an application that completely throws out rigid, boring sitemaps and instead morphs, shifts, and builds custom workspaces on the fly based on whatever you type. Imagine asking an app to compare your quarterly sales alongside a flight itinerary, and watching it magically assemble the perfect dashboard for that exact moment.

But what does it all mean for designers? Especially in a world where, let's be real, AI sucks at making good design.

First, let's get on the same page about the terminology.

What is Generative UI?

At its core, Generative UI (often called GenUI) is a design pattern where parts of a user interface are dynamically generated, selected, or rendered by an AI agent at runtime.

Unlike traditional graphical interfaces that rely on hardcoded sitemaps and static templates, a generative interface adapts instantly to the user's specific context and intent.

Instead of just returning text or markdown inside a chat bubble, the AI uses structured data to assemble live, functional application surfaces like forms, interactive charts, and custom dashboards.

By mapping an AI's tool calls directly to functional UI components, the software shifts from a static wrapper into a fluid workspace built on demand.

The big problem with Gen UI

But if you talk to anyone actually trying to build this dream right now using pure text-to-code generation, they’ll tell you the reality is a bit of a nightmare. It’s incredibly slow and painfully fragile.

When an app tries to generate brand-new code, custom CSS, and data architecture from a completely blank canvas every single time you hit "enter," the user experience breaks. You're stuck staring at a loading spinner for 30 seconds, then end up with a clunky, half-broken layout that may or may not work on mobile.

It turns out that inventing software from scratch on the fly is a really great way to break UX.

Think in primitives, not pages

To be fair, the engineering world isn’t blind to this. Dev teams are already using tools like Vercel’s AI SDK to basically tell the the AI, "Hey, don't write raw HTML from scratch; just pick from this list of hardcoded React components we already built, and then fill them (hydrate them) with data."

But the design world is lagging way behind. We’re still spending our days in Figma wireframing hundreds of static, beautifully polished, edge-case page templates meant for human coders, completely missing the fact that the primary user of our design systems is about to be an AI.

Because the actual future of Generative UI (and UI in general) is text-to-hydration, where, instead of an AI agent trying to invent a UI on a blank canvas, its primary job will be to instantly arrange, toggle, and pipe real-time data into a flexible, hyper-modular "kit of parts" that I like to call elastic primitives. (You might just call it a fancy design system.)

This completely upends our day-to-day workflow; we have to stop thinking about fixed 1200-pixel desktop grids and start designing the literal rules of elasticity. Your job shifts to building perfect components, like an isolated metric card, a data table, or an audio player, while defining the exact auto-layout constraints, responsive behaviors, and spatial guardrails so that the component looks stunning no matter how a chaotic AI decides to stack them together.

Write out your design “taste” so AI can read it

And AI really is a chaotic user of your design system.

When you hand off your designs to developers, they have a lot of unspoken shorthand and innate "taste." When you give a dev a design system, you don't have to write a 10-page essay telling them not to overlap a card component onto a header, or to avoid cramming five dense line graphs into a tiny sidebar. They just know better. (Well, usually.)

But when you hand that exact same component library to an AI agent? It has literally zero intuition. If you don't give it hyper-explicit rules, it will gladly grab your gorgeous, pixel-perfect primitives and stitch them together into a cluttered, unusable mess that violates every basic rule of visual hierarchy.

This means we have to overhaul how we document our design systems. We need to stop writing casual, fluffy prose meant for people and start translating our design philosophy into highly structured, machine-legible metadata. (Which AI can definitely help us do.)

In practice, you’ll be spending less time tweaking individual layouts and more time explicitly teaching the machine why, when, and how to deploy specific visual patterns. You’ll find yourself writing programmatic guardrails into your component properties, explicitly defining things like exactly how much visual compression a container can take before it must collapse, or dictating that a complex analytics chart should only appear if the user is comparing more than three specific data streams over time.

By baking your taste directly into the component's API schema, you ensure the AI plays by your rules. This keeps the user experience polished even when you aren't there to oversee it.

Globally cascading design

Stepping into this new world means we have to stop thinking in isolated frames or linear user flows and start thinking like frontend developers. We need to look at our products through the lens of global variables, dynamic states, and relational inheritance.

In the old days, if you changed a button style or a card layout, it just updated a static template or a few specific screens. In an agentic, Generative UI world, when you modify an elastic primitive, you are dynamically altering the structural DNA of what the AI can build across the entire application ecosystem instantly.

Every rule you tweak cascades globally. This changes how the machine responds to thousands of different user prompts simultaneously.

Thankfully, we aren't just shouting these rules into a void. Modern design tooling is making this level of deep, code-level control possible. Tools like Figma’s MCP server or Builder’s Figma-to-code features bridge the gap, turning design tokens and components into active, site-wide context, syncing your visual edits directly into production code.

It shifts design from a passive style guide to a living conversation with the codebase. By mastering this app-wide system loop, we make sure that the software of tomorrow doesn't start from an unpredictable, chaotic text box. Instead, it starts with predictable, beautifully synchronized templates that give both human users and AI agents a safe, gorgeous playground to interact inside.

It’s time to design in code

If all of this systemic, global synchronization logic sounds a little intimidating from the comfort of a Figma canvas, here’s the biggest secret: you don't have to build all of this in Figma. It’s time to step out of the static vector sandbox and start designing directly in code using today’s AI tools.

While Figma is still the undisputed king for sketching out an initial concept from scratch, trying to force it to mimic dynamic, live data and machine-legible constraints is fragile, to say the least. But luckily, the painful "designer-to-developer handoff" is basically dead these days, if you let it be, because AI allows you to bridge that gap yourself.

Designing in code forces you to practice coarse-grained design. Instead of obsessing over changing one isolated edge or pixel at a time, you're interacting with the entire system at scale.

When you work with an AI visual design tool like Builder, which is built to give designers a visual handle on real code primitives, or use other AI agents like Claude Code, you can spin up dynamic environments that act like a supercharged, living Storybook.

You can instantly see all your elastic primitives sitting side-by-side, watching how they dynamically resize, break, and interact with each other in real-time as you flex the screen.

Basically, you get to see the actual finished product immediately. And from there, you can adjust its rules on the fly by talking to the AI, moving pieces around, and observing how your brand logic holds up in the wild.

The bet we’re making: Open-source good design taste

So, where can you put all this into practice? Hopefully, first, your own team’s codebases. These are conversations to start having with your team sooner than later.

But if you wanna flex your generative UI design muscles a little bit sooner, we’ve been heads-down building the Agent-Native Framework here at Builder. All open-source, all free, and very much wanting your contributions as a designer.

Basically, we got tired of seeing AI shoved into a clunky, disconnected sidebar chatbot that can't actually touch the app it lives in. We wanted to see highly polished, predictable visual primitives and UX that already works in SaaS married to powerful AI orchestration to supercharge daily work.

In an agent-native architecture, the human interface and the AI agent are completely unified under one single, shared database state and action model. Because every single component a human can click in the UI is bound to the exact same underlying tool call that the agent can execute, the AI never has to generate slow, fragile code from scratch; it just hydrates your beautifully designed, existing primitives in milliseconds.

While our team has spent months building out the heavy-duty backend pipes, state engines, and data coordination loops for cloneable templates like Mail, Content, Calendar, Analytics, and Slides, we’re the first to admit that the backend is only half the battle.

The ultimate success of an agentic world doesn't depend on how smart the LLM is; it depends on exceptional, human-centered product design. We’re officially inviting the design and product community to jump into our open-source repositories, clone these templates, ruthlessly critique our user flows, and help us contribute back to an ecosystem where AI makes software infinitely more adaptive without ever making it ugly.

Feel free to join our Discord, try out some templates, and read more about Agent-Native.

Let's stop designing static screens and start building the open, fluid future of the web together.

Read the full post on the Builder.io blog

You Know Your AI Adoption Rate. Do You Know Your Governance Rate?

Thu, 21 May 2026 18:00:00 GMT

Ask any engineering leader for their AI adoption rate, and the answer comes back fast. Seat counts, license tiers, daily active usage, the percentage of devs running Cursor or Copilot, and the latest productivity scores from the analytics dashboard. The data is clean and ready for the next board update.

Now ask for their governance rate. How much AI-generated code is sitting in production right now, who reviewed it, whether it followed component standards, and what changed between the prompt and the merge. The answer usually trails off into something polite about good intentions and an evolving review process.

That asymmetry is the thing worth paying attention to. Adoption is easy to measure because it has clean numbers attached to it. Control is harder to measure because the infrastructure needed to produce those measurements doesn't yet exist in enterprise AI tooling. So adoption metrics get reported up the chain as evidence that the AI strategy is working, and the harder question gets pushed to next quarter.

That deferral has a name. It's the governance gap, and it's getting wider as AI tooling expands beyond a few early-adopter developers into broader product teams. The cost of leaving it open isn't just risk in the traditional compliance sense, though that's part of it. The higher cost is that teams without governance infrastructure end up restricting AI use to protect themselves, thereby capping the productivity gains that justified buying the tools in the first place. It's the same dynamic behind the quality debt agent productivity creates when generation outpaces the systems that govern it.

The teams getting the most out of AI right now are the ones that built the infrastructure to trust what AI produces, which lets them run AI with less friction across more of the organization. The teams that skipped governance are quietly capping their own upside without realizing it, and it shows up in the backlog problem AI didn't solve at the org-wide delivery level.

Get our guide on the governance gap. It covers how the gap opened, the four questions every framework needs to answer, what distributed review looks like in practice, and the compliance dimension for regulated industries.

How the gap actually opened

The adoption story at most enterprises followed the same arc. A handful of engineers started using AI coding assistants in their local environments. Output quality was uneven early on, but the productivity gains were real enough that usage spread by word of mouth, mostly without formal approval. By the time IT or engineering leadership noticed, the tools were already embedded in how a meaningful chunk of the team worked.

Leadership response usually broke one of two ways. Some organizations endorsed retroactively, which meant buying enterprise licenses, adding the tools to the approved list, and calling the question resolved. Others restricted retroactively, banning unapproved tools and issuing a usage policy. Neither response touched the underlying question of what was actually being produced and merged into production.

The enterprise license response is the more common path, and it creates a sense of resolution that feels useful. The organization now has visibility into seat usage, a real contract with a vendor, and an AI tool on the official approved list. The harder question of whether the code these tools generate meets organizational standards, follows the architecture, uses approved components, and receives meaningful review before it ships remains open.

The restriction response fares no better in practice. Developers route around it, use tools on personal machines, and the code still gets merged with even less paper trail than the licensed alternative would have produced.

Both responses manage perception while the actual problem compounds beneath the surface.

Where it shows up

The governance gap isn't one problem. It surfaces in a few distinct ways across an enterprise, and most organizations are dealing with all of them at once without recognizing them as related.

Design system drift

The most visible symptom is design system drift. AI tools that generate code without access to your actual component library produce output that looks correct on the surface but quietly diverges from the system your design team maintains. Generic implementations replace approved components. Hard-coded values show up where design tokens should be. New variants are created when an existing one would have worked. The code passes review because it works, renders correctly, and passes linting, so nobody flags it.

Each merged component that bypasses the design system sets a precedent. The reviewer who approved the first one set a bar, and the next engineer reviewing similar output implicitly has permission to merge it the same way. Over enough cycles, the design system becomes less authoritative, and nobody made an explicit decision to abandon it. The tooling simply stopped enforcing it.

Review processes built for a different era of authorship

The second place the gap shows up is in review processes that were built for a different era of authorship. Traditional code review assumes a human author with stakes in the outcome, institutional memory of why certain decisions were made, and accountability for the change. AI-generated code disrupts each of those assumptions. The author is an agent with no stake in the outcome; the context lives in a prompt that nobody else saw; and the scope can be large enough to generate a full-page layout or a set of API integrations in seconds.

The volume problem compounds the reasoning problem. Single-agent AI development is manageable with existing infrastructure because a single developer running a single session on a single branch lands in the review queue alongside everything else. Multi-agent parallel development breaks this entirely. When a team is running ten agents simultaneously, one per ticket and each on its own branch, the PR volume runs an order of magnitude higher than the review capacity. Engineering becomes the bottleneck because generation throughput outpaced review throughput, regardless of how quickly the reviewers work.

Expanded authorship, unchanged governance

The third place the gap opens up is the one most organizations haven't fully reckoned with yet. For the first few years of AI coding tool adoption, governance was primarily an engineering question because developers were the ones generating code. That framing is becoming less accurate every quarter. Product managers are building working prototypes in production codebases. Designers are submitting PRs from visual editors with AI handling the code translation. QA teams are generating fixes for the bugs they find. Marketing teams are publishing pages through systems that access the same component libraries engineers maintain. The whole code-as-canvas shift is real, and it's expanding the governance surface faster than most enterprises have planned for.

This expansion is broadly a good thing and turns AI development into the company-wide workflow change it was always supposed to be. It also creates a governance surface that's much larger and more varied than what enterprise security and engineering teams originally designed for. The developers using Cursor went through onboarding, know the codebase, and understand when to follow conventions and when to escalate. The PM who generated a prototype in a production branch last week may not know that the component they used has a deprecated variant or that the API they called has a rate limit nobody documented.

The cost most organizations underestimate

When organizations do address the governance gap, they usually frame it as a risk problem. That framing captures part of the picture. The downside costs are real: security vulnerabilities that survive review because the reviewer assumed AI-generated code had been checked, design system fragmentation that makes future UI work harder, technical debt that accumulates in AI-generated output that nobody owns, and compliance exposure in regulated industries where AI-generated code may not meet documentation requirements for production systems.

These costs matter, and for most organizations, they haven't yet materialized catastrophically, which is part of why the gap persists. The debt is accumulating quietly.

The opportunity cost is less visible and probably larger. Teams that don't trust AI-generated output restrict its use, require extra review cycles, and limit which roles can generate code and what it can touch. These are rational responses to an absence of control infrastructure, and they cap the productivity gains that motivated AI adoption in the first place. The organizations getting the largest gains from enterprise AI development are running the playbook in reverse: build the governance infrastructure first so AI can run with less friction across more of the team and more of the codebase.

What closing the gap actually requires

Closing the governance gap is infrastructure work, not policy work. Documenting that designers should review AI-generated UI before it merges is a policy. Building a workflow that requires designer review before a PR can even be opened is governance. The full picture across context, review, traceability, and volume is what we walked through in the new guide.

A few things worth previewing:

Design system enforcement can't happen at the review stage if the AI generating the code never had access to the current design system in the first place. The governance work starts upstream, with accurate context as an input to generation.
Single-queue PR review breaks down as a governance mechanism when you run multiple agents in parallel. The teams scaling AI development well are distributing reviews across roles: designers validating visual output, QA validating correctness, and product validating requirements before a PR reaches engineering. Engineers receive work that has already passed domain-specific checks, allowing the engineering review to focus on code quality rather than functional correctness from scratch.

For organizations in regulated industries, the audit trail problem is sharper than most legal and compliance teams have recognized. Current AI tooling does not produce the generation records that may eventually be required by change control. The organizations not thinking about this now will be retrofitting it later under worse conditions.

Close the governance gap with Builder

Builder is the AI product development platform built for teams that need to govern what AI produces. It connects to your real codebase and design system, enforces standards before generation happens, and gives every role on your team the access they need to review and contribute without creating a new governance gap in the process.

Design system context is a first-class input to every generation. Review workflows are multi-role by default. Every agent runs in an isolated environment with a shareable preview, and agent work stays visible at the team level, so the people accountable for what ships can see what's in flight.

Get the guide on the governance gap.

Schedule a demo or connect with a Builder expert.

Read the full post on the Builder.io blog

I Didn't Become a Developer to Review AI Slop

Thu, 21 May 2026 18:00:00 GMT

But lately, that's exactly what the job feels like.

My PR queue fills with work that, yes, technically compiles. The summary sounds plausible. It might even have some tests. Then when I open the diff, the real work starts.

What was the change supposed to do? Did anyone actually run the flow? Why is this helper duplicated six times? Is this actually fixing a bug, or did the AI just run around in circles and call it done?

AI made it effortless for anyone on my team (and yours) to create code, but it didn't make that code trustworthy.

Stack Overflow's 2025 Developer Survey found the most common frustration with AI tools is output that's "almost right, but not quite." Sonar's 2026 State of Code report found that 96% of developers don't fully trust AI-generated code, and 38% say reviewing it takes more effort than reviewing human-written code.

That's because AI code looks fine, but you have to really dig in to see what it's doing well. Straight up bad code is much easier to reject.

I'm annoyed. Maybe you are, too. Let's dig into this and solve it together.

A PR is... almost too cheap now

AI agents can spin up branches from Jira tickets, patches from Slack threads, or even full PRs from a bug report before anyone even agrees that the bug is real. It's honestly a pretty awesome world.

But the thing is, developers aren't the only ones using these tools. PMs will prototype the feature they've been trying to explain for three sprints, mostly with vague, unhelpful hand waves. Designers will tweak UX flows and fix layouts that keep getting deprioritized. Marketers will update landing pages and forms. (Constantly.) Support will patch the customer pain points they know best.

And all that is a win. Small fixes shouldn't sit in backlog hell waiting for an engineer who happens to know that part of the code. Product knowledge should be turning into working software faster.

But the easier it gets to open a PR, the more developers are obligated to review them. And PRs aren't valuable just because they exist. They're only valuable when they can be trusted.

And trust is still really expensive

AI is really good at writing code. For a recent hackathon, I had GPT 5.5 spin up 10,000 lines of working code in about 45 minutes. The app mostly worked. Sure, the UI was a nightmare, but the core functionality was there.

But writing code and writing trustworthy, scaleable code are two different things. A model can generate a diff, explain it, and even run some happy-path tests. But someone's still accountable to the stuff that actually matters:

Did this code actually fix the stated problem?
Did the author really understand the system, or is this creating tech debt for later?
Is the diff bigger than it needs to be? (Almost definitely.)
Does this fix silently break some other flow in the code, that would be obvious if a single user just tried it out?
Does the UI actually work for real users in real browsers?
Will this fix survive past a demo?
Is this actually a fix to the root problem, or just a bandaid?
Is this security tradeoff acceptable?

These aren't syntax questions. They're trust questions. And right now, they all land on you and me, the developers. @richiemcilroy put it well in a viral tweeted video the other day:

The numbers tell the same story. LinearB's 2026 benchmarks found AI PRs sit waiting 4.6x longer for review and get rejected way more than human-written ones. METR's study of experienced open-source developers found early-2025 AI tools actually made devs 19% slower, partly because real work includes style, tests, docs, and review—not just typing.

That's not saying AI is useless. And the tools really do keep getting better everyday. But the real work of software was never just typing code into files. It's knowing what should change, what shouldn't, and when a surface-level patch that technically fixes the problem is actually going to haunt your team for the next six months.

That's where your attention needs to go. You should be weighing the stuff that needs taste and context, not manually rediscovering the basics after the PR is already in your queue.

Developing feels bad right now

Even though AI tools are making everyone more productive, being the bottleneck feels terrible. Everyone else gets to accomplish more than they've ever done before, because suddenly code is open to them.

You as a developer just experience the hype as incoming review debt. You aren't building. You're reviewing. You aren't designing the system; you're policing its edges. You aren't solving the hard problem directly; you're reverse-engineering what an agent or teammate was trying to do, then betting your afternoon on whether the diff is safe to keep.

The AI gets to do the fun part. You get to be a robot.

That doesn't mean you're useless. If anything, your judgment matters way more now.

But the workflow is spending your judgment terribly. It's taking the scarcest resource in the system—experienced engineering attention—and aiming it at mystery diffs, bloated patches, missing context, and generated code that only looks correct.

So yeah, it's boring. Yeah, it's frustrating. When someone says "now everyone can ship code," what you and I hear is "now everyone can create work for us."

Thus, the burnout.

Locking down the repo solves the wrong problem

So, what do we do? Well, the obvious reaction would be to lock up the repo. Devs only.

And I get that. You're the one who gets paged at 2am when prod goes down. Being protective of the code isn't elitism. You just have a memory.

But limiting access solves the wrong problem.

Cross-functional PRs aren't automatically bad. In fact, in many ways, they're exactly what we've wanted for years: product knowledge turning into small fixes without waiting on an engineer's calendar.

But the problem is that, even though everyone can now open PRs, PR intake itself hasn't evolved. Teams still treat a PR like a dev-to-dev handoff: here's the diff, here's the description, good luck. That worked great when the author was another engineer with the same local context, the same testing habits, and the same gut sense of what reviewers needed.

But that assumption falls apart now. Not because non-devs are careless. In fact, designers, PMs, marketers, support teams—they all have the best user context since they're closer to the problem. But they probably don't know what you need as a dev to evaluate risk. And when AI generated the actual implementation, even the person opening the PR might not know the full scope of what changed.

Mystery diffs aren't a reasonable way to collaborate. So, how do you change the way you work with PRs?

Raise the bar for evidence on PRs

No dev should open a generated or cross-functional PR and have to reverse-engineer it from scratch. Every PR needs to show up with receipts:

Clear intent.
A small, scoped diff.
A summary of meaningful changes.
Relevant tests and results.
Browser-based QA on the affected flow.
Screenshots, replay, or other behavioral proof.
Console and network logs when something is failing.
Known risks, skipped cases, and open questions.
A path to fix issues on the same branch.

But that’s the problem. We say we want PMs, designers, marketers, and support to directly contribute, but then we expect them to act like senior engineers before we'll even review it.

A PM shouldn't need to know how to scope a tight diff. A designer shouldn't read network traces. A marketer shouldn't be QA. Support shouldn't write a perfect test plan just to propose a fix.

The entry bar needs to stay low. The review bar needs to go up.

Those aren't in conflict if there's an interpretation layer to bridge the gap. We already have amazing AI, so why aren't we using it, per PR, to review the quality and interpret intent before engineers waste their time?

The contributor can bring the product context: what hurts, why it matters, what good looks like. And they can be the ones who work with an AI agent to send the PR in the first place. Then, a review toolchain should translate that implementation into something a dev can trust.

The toolchain should keep diffs scoped, summarize real changes, run checks, open the product in a browser, click the flow, capture screenshots, surface console errors, and flag what it didn't test. It should let the contributor fix issues on the same branch without turning them into a release engineer.

And it should spare the developer from being the first person to discover the button doesn't work.

So, what does review automation look like in practice?

Everyone's starting to wake up to this problem. And PR review automation seems to be the best answer. That said, I've found that a lot of the existing PR review tools are pretty surface-level in what they do, mostly just acting as another AI agent to see if the code makes sense in context.

What you actually want is an agent that runs the code in the browser and tests real edge cases to spot failure modes. You can definitely piece it together yourself with enough CI glue. Or, you can get it off the shelf.

That's the point of our (Builder's) Quality Review Agent. It opens your app in real browsers, walks the affected flow, and returns evidence of what it clicked, what happened, and what failed, complete with replay links, console errors, network traces, and specific findings tied back to the change.

So now, instead of reviewing hundreds of PRs that start as mystery diffs, you get a product-specific review packet:

The affected flow, replay, and screenshots.
The console and network signals from the run.
The specific failures tied back to the change.
The risks, skipped cases, and remaining judgment calls.

After all, the goal isn't to remove developers from review. We still need to be there to raise the quality of the code in ways only we know how. But the goal is to stop burning developer attention on prep work that machines can handle without complaint.

Build the trust layer

Look. AI made it dead simple for anyone to ship code. What it didn't do was magically make that code trustworthy. And that means devs are feeling the burden the most right now, having to review all that slop.

Locking everyone out of the repo isn't the answer. We just need every PR to show up with enough context and proof that we can actually use our brains for judgment instead of wasting afternoons playing detective.

With today's agentic tools, that's a trust layer you can either try to assemble yourself, or you can get it from another company. Our take on it is the Builder QR Agent.

Regardless, it might be best to prioritize that pain before you turn into your company's human merge queue.

Read the full post on the Builder.io blog

AI Agent vs Chatbot: Key Differences and Examples

Tue, 19 May 2026 18:00:00 GMT

What is the difference between an AI agent and a chatbot? A chatbot responds to prompts. An AI agent can pursue a goal, choose steps, use tools, and complete work.

That short answer is useful, but it misses the thing you usually care about when you are evaluating software: can the AI actually do the job, or can it only talk about the job?

In most software, the progression looks like this: a chatbot answers questions, and an AI agent can operate parts of the workflow.

This article starts with the practical AI agent vs chatbot comparison. Then it shows what the next level of AI agents looks like: agent-native architecture, where the product is built so humans and agents can operate the same underlying capabilities.

What is the difference between an AI agent and a chatbot?

A chatbot is a conversational system. It receives a message and returns a response. It may answer questions, summarize information, draft text, retrieve documentation, or guide a user through a simple support flow.

An AI agent is a goal-directed system. It can break a request into steps, inspect context, use tools, take actions, and adapt based on the result. In a product context, the difference is whether the AI can move the workflow forward instead of only talking about it.

For example, in an email product, a chatbot can summarize a thread or draft a reply. An AI agent can find the relevant messages, classify them, apply labels, archive low-value items, draft responses, and ask for approval before sending.

In analytics software, a chatbot can explain what a chart means. An agent can change the query, apply filters, generate a new chart, save the dashboard, and share it with the right team.

That is why "chatbots respond, agents act" is directionally right but incomplete. A weak agent can still be little more than a chatbot with tool calls. A strong agent needs meaningful product actions, relevant state, and safety rules.

So if the difference is that clear, why do so many SaaS products call something an agent when it still behaves like a chatbot?

Why many SaaS AI agents still feel like chatbots

Many SaaS AI agents still feel like chatbots because they are added beside an existing application rather than designed into the workflow. The product was built for humans clicking through screens, then an AI sidebar was added later.

That pattern creates a ceiling. The AI can answer questions about what is visible. It can summarize a page, draft a message, explain a setting, or suggest a next step. But when the user asks it to complete the actual workflow, it often stalls.

The limitation is usually architectural, not just model quality. The common reasons are easy to spot:

Action access: The AI has a small set of helper tools, while the real product actions are buried in UI-specific code, internal endpoints, admin-only flows, or services that were never designed for delegated execution.
State: The AI may know the conversation, but not the object the user is working on, the selected record, the active workflow step, or the changes that already happened elsewhere.
Safety: The product relies too much on prompts. Real agents need product-level constraints: permissions, previews, approval gates, audit logs, rollback paths, and policy checks.
Workflow continuity: The AI can handle one request, but it cannot keep working across a multi-step workflow until the goal is complete or blocked.
Drift: The UI can do one set of things. The public API exposes another. The AI tool layer exposes a third. Each surface works until the product changes, then they drift apart.

This is why a product can market an "AI agent" while users experience it as a chatbot. The name changed, but the software did not give the AI enough product capability to act.

Chatbots still have a clear role. They are good for support triage, documentation lookup, onboarding, lightweight drafting, and conversational discovery. They are especially useful when the correct output is information rather than a durable product change.

The problem is category confusion. If a user asks, "What does this setting do?", a chatbot is enough. If a user asks, "Update these accounts, notify the owners, and prepare the renewal plan," they expect an agent.

There is one more term that often gets mixed into this conversation: copilot. It is worth separating because a copilot can be much more useful than a chatbot without becoming a full agent.

AI Copilot vs. AI Agent: Where copilots fit

An AI copilot assists a person inside a workflow. An AI agent can execute a workflow, or prepare it for approval, through tools and product actions.

You can think of a copilot as the middle stage between chatbot and agent. A chatbot is mostly conversational. A copilot is contextual and assistive. An agent is operational. A writing copilot may suggest edits. A coding copilot may autocomplete a function. A sales copilot may draft a follow-up email. These are useful because they keep you directly in the loop.

An agent goes further. It can research the topic, change the code, update the CRM, run checks, prepare the next step, and ask for approval when judgment is needed. The point is not that agents are always better. The point is that some jobs need chat, some need assistance, and some need delegation.

Once you start building for delegation, the hard question changes. It is no longer just "Is this a chatbot, copilot, or agent?" It is "Is the product actually built for an agent to operate it?"

The next level for AI agents: Agent-native architecture

Agent-native architecture is the next level because it stops treating the agent as a feature bolted onto the side of the product.

An agent-native application is built so humans and AI agents can operate the same product through shared actions, data, permissions, and context. You may use screens, buttons, forms, and review flows. The agent may use natural language and tool calls. But both paths work inside the same application model.

That is a different bar from adding a chatbot or exposing a few tools. If the user can archive, approve, publish, reschedule, assign, merge, refund, or invite, the agent should be able to reach the same operation with the same permissions and safeguards.

That usually comes down to five architectural principles:

Agent UI parity: Anything meaningful the UI can do, the agent can also do through the same product capability. The agent is not screen-scraping buttons or using a weaker side-channel.
One shared action model: The UI, agent, API, and automation layer call the same actions instead of four slightly different implementations that drift over time.
Shared state, data, and context: If you are looking at a customer, thread, chart, or task, the agent can understand that context and change the same underlying state.
Protocol-ready by design: The app is built so agents and other tools can reach its capabilities through standard interfaces, not just one custom chat integration.
Governed execution: The agent acts inside the product's permission, approval, logging, and review model. It can be powerful without bypassing the rules that make the product trustworthy.

That is why agent-native is the next step after basic AI agents. A normal agent can use tools. An agent-native app gives the agent a real product to operate, while still giving humans a real product to use.

Bad AI apps put the agent in a sidebar. Good AI apps make the sidebar the agent.

Over time, applications an agent can fully operate will have a clear advantage over applications where AI can only talk about the work.

How to get started with agent-native architecture

You do not need to rebuild the whole product at once. Start with one workflow where delegation would obviously help, then make that workflow agent-native from end to end.

Pick a workflow users already repeat. Good candidates are triage, scheduling, reporting, routing, drafting, approvals, or cleanup work.
List the product actions inside that workflow. In email, that might be search, label, archive, draft, and send. In analytics, it might be query, filter, visualize, save, and share.
Turn those actions into shared capabilities. The UI should call them. The agent should call them. Any API or automation surface should call them too. This keeps behavior consistent and reduces drift.
Give the agent the state it needs. If the user is looking at a record, thread, dashboard, or task, the agent should not need the user to restate everything the app already knows.
Build the review path into the product. Some actions can run automatically. Some should show a preview. Some should require approval. The important part is that the product owns those rules, not just the prompt.

This is the shift from "AI feature" to agent-native application. Instead of adding a smarter text box, you are designing the product so the human and the agent can work through the same system.

Start with open source agent-native templates

You can build those primitives yourself: shared actions, shared state, shared permissions, review flows, and agent tools. Or you can start from an open source template where the architecture is already in place.

That is what agent-native templates gives you. The templates are cloneable agent-native apps with a real human interface, real actions for agents, and one application model underneath both.

If you want to build an agent-native app today, you do not have to start from a blank text box or bolt AI onto an old workflow. You can clone a template, inspect how the UI and agent share capabilities, and adapt it to the workflow you care about.

Build for agents, not just conversations

The AI agent vs chatbot distinction starts with behavior. Chatbots respond. Copilots assist. Agents act.

But for software teams, the distinction eventually becomes architectural. If the AI cannot reach the product's real actions, state, permissions, and approval paths, it will keep behaving like a chatbot no matter what the feature is called.

Agent-native architecture is the next level. It gives the agent a real product to operate and gives humans a real interface for supervision, review, and collaboration.

Read the full post on the Builder.io blog

Code is the Canvas: Bring the Whole Team to It

Wed, 13 May 2026 18:00:00 GMT

The cost of writing code dropped while the cost of handoffs stayed the same. See how teams are closing the gap by bringing every role into the code.

AI is making the cost of generating code trend toward zero. Features that used to take a sprint can happen in an afternoon, and bugs and updates that sat in the backlog for months cost cents to fix. The economics of writing software have shifted dramatically over the last couple of years.

The way most teams work has not changed with them. Look at how a typical product organization is structured: weeks of planning to decide what's worth building, a sprint to build it, another to test it, another to ship it. That whole rhythm was designed around the assumption that code is expensive, so you spend most of your time deciding what's worth coding before you actually code it. The assumption stopped being true a while ago, and the rhythm built around it is still running.

You can see the results by taking an inventory of your backlog. There are features customers have asked for a dozen times, bugs everyone knows about, and polished work that never makes the sprint. The gap between what your team should be shipping and what it actually ships keeps widening, even as coding speeds up.

The workflow is older than most of the tools your team uses

The standard software development workflow is over 25 years old, and it predates AI by decades. You know the shape of it: idea, spec, design, prototype, code, review, ship. Each step hands off to the next, and each handoff is a translation between tools and between teams. The translation is where the work loses its shape, and it is also where most of the time goes.

The translation problem

Spec gets translated into design, design into prototype, prototype into code, and every translation along the way loses some of the original intent. The friction is so familiar that the language teams use to describe it has become a script. You have probably said one of these in a review:

"That's not what I meant."

"That's not how it was designed."

"That doesn't work in our application."

So the team loops back. They update the spec, rebuild the prototype, rework the code, and start over. Week after week.

What most teams have done with AI is layer it on top of this same workflow. Designers reach for Figma Make, PMs spin up something in v0, and engineers run Cursor or Claude Code. Each function gets faster in isolation, and the choreography between them stays exactly as it was. The translation problem persists, with agents now handling some steps in between, and the handoffs remain where things go wrong.

Why MCP isn't the answer

A technical objection comes up here, especially from architects: doesn't MCP solve this? Connect the tools, share the context, and you're done. The answer is that MCP accelerates the handoff itself, which is helpful, and the translation problem is something different.

A developer, with their own context and intent, is still trying to verify whether what was built matches what was originally intended, and they're doing that work in a separate environment from where the original work was done. Stitching better connectors between isolated tools has limits because the tools remain isolated.

What changes when everyone starts in the same place

Imagine a workflow where the team starts together. A PM has an idea and tells an agent what the experience should be, and the agent builds it using your real codebase, components, design system, and coding standards. A designer opens the same project and refines it directly. When an engineer steps in, they pick up a project that already carries context from every role that has touched it. The work has been moving forward in one place, in code, the whole time.

Code is the canvas. When the whole team builds on it together, the product meant to be built is the one that gets built.

The shift sounds abstract until you look at what it changes for each role. The work each person does, and the artifacts they hand off, look meaningfully different when code becomes the starting point for the whole team:

Builder gives your team a workflow built around that idea

Code as the canvas is the principle. Building it into a workflow your team can actually run is a different problem, and the rest of this post walks through how Builder approaches each piece, from where work starts, to how it moves through the team, to how it gets validated and shipped.

Ideas start where they live

The work begins wherever the idea lives, whether that's a Jira ticket, a Slack thread, a customer support escalation, or a problem someone spotted on the live site. The Builder agent picks it up, takes a first pass, and because it is connected to your codebase and design system, what it builds uses your real components, patterns, and tech stack.

That matters because most ideas die in the gap between where they show up and where the work happens. By the time a Slack thread becomes a ticket, becomes a sprint item, becomes a design, the original spark is buried. Builder connects to the tools your team already works in, so that gap closes:

A marketer who notices something off on the live site clicks the Chrome extension and starts a fix on the spot.
A designer who gets a request in Slack tags Builder in the thread, and it reads the context and builds a first pass.
A PM who files a ticket in Jira assigns it to Builder, and it picks up the work directly.

First drafts that are actually usable

The real measure of an agent is how far along the work is when your team picks it up. The closer the first draft is to something your team can actually ship from, the more time the agent has saved you. That depends on whether the draft is built from your real components and patterns, because that's what determines whether the team can refine it or has to rebuild it.

Builder builds with the components your team already uses, in the patterns they already follow, and it pulls in the context that shapes the work: Figma designs, PRDs, product specs. The agent knows your code, what you are trying to build, and why, so the team spends its time refining the concept while the agent handles the mechanical work of putting it together.

One branch, one team

From there, the team takes over. Product, design, and engineering iterate on the same branch, each in their preferred environment.

The most expensive part of building software these days is the feedback loop around the code, where a designer reviews a screenshot and files a comment, a PM reads a spec and flags a misunderstanding, a developer gets a PR and rewrites half of it, and each round costs the team days.

Builder collapses that loop by giving every collaborator a link to the live implementation. A designer adjusts spacing in the style tab, a PM tests a user flow and has the agent resolve a logic gap, and an engineer reviews the diff in the code tab and makes changes directly. Everyone works on the same thing, at the same time, in the way they prefer, and feedback happens on what is actually going to ship.

Developers stay in their IDE

Collaboration has to work for every role, and developers are the role most likely to push back on a new tool. They live in their IDE for good reason, and asking them to leave it for review or refinement work is how adoption stalls. Builder is built so they can stay there.

The branch the team has been working on syncs directly into Cursor or VS Code. A developer pulls it, reviews the diff, makes changes, and pushes them back to the same branch the team is using. Builder's MCP connects the platform to Claude Code and Cursor, so whatever the team is building in Builder (a prototype, a page, a component), the developer can pull it into their IDE and work with it as code. When they push changes back, the team picks up where they left off. The two environments remain linked, and no one has to switch contexts to participate.

Real validation, before anyone ships

Code-as-a-shared-canvas changes how the team builds together. It also changes what's possible at the validation step, which is the part of the process that tends to get the least attention and produces the most expensive mistakes when it goes wrong.

Why teams skip it

The step most teams skip is the one that matters most: getting real feedback before shipping. Teams skip it because every way to get it adds friction. Screenshots get marked up with notes that no one can act on. Prototypes built outside the codebase lack the fidelity to surface real issues. So validation gets dropped, and the team hopes for the best.

Validation by preview link

Builder changes the cost of validation. You send a preview URL with no account required, no staging environment to spin up, just a link. What the recipient sees is the real implementation, built from your design system, and they can interact with it, leave feedback, and have the team resolve it on the spot. Put it in front of a customer, and the feedback you get is specific: what broke, what confused them, what they would change. When validation is this easy, teams stop skipping it.

Speed without sacrificing the quality bar

The faster teams move, the more leadership worries about what slips through. New tools and new contributors generate AI code at volumes nobody is quite sure how to police, and the question is always the same: how do we know this is safe to ship? The answer cannot be "trust the team" alone. It has to be built into the workflow.

Approvals and review agents

Builder's approval workflows require the right people to sign off before a PR is submitted. The QA Agent validates the implementation in a real browser, writes test cases, and posts a video walkthrough. The Code Review Agent checks every PR and flags issues by severity. Your team has already defined what good code looks like, with linting rules, formatting standards, test suites, accessibility checks, and Builder follows all of it. The code that comes out is held to the same quality bar your team set before Builder was introduced.

Trust and compliance

On the trust side, Builder is SOC 2 Type 2 compliant, we don't train on your data, and you own your inputs and outputs. We work with Fortune 500 companies that hold us to their standards.

Where teams usually start

Change like this doesn't happen overnight, and it doesn't have to. Teams usually find their way in through one of a few starting points, and where you begin depends on what your team needs most. Each of these is a low-risk way to test the workflow on real work without committing the whole organization on day one:

Most teams pick one of these and grow from there. Prototyping is a common entry point because every team prototypes anyway, and getting prototypes built on the real codebase means the feedback you collect is feedback on what will actually ship. Internal tools work well because the team gets value immediately, with low risk to production systems. Software development is where the long-term payoff lives, and once the team is comfortable with the workflow, this is where most of the throughput gains compound.

Closing the gap

Code is cheap to write now, and the teams that have adjusted to that, the ones that treat code as the place where the whole team works together from the start, are the ones closing the gap between what they want to ship and what they actually ship.

Connect your repo and run a prototype, an internal tool, or a backlog item through the workflow yourself. Try Builder for free.

Read the full post on the Builder.io blog

Agent-Native: The Next Architecture for Software

Fri, 08 May 2026 18:00:00 GMT

Most software today gives you one of two compromises: a polished interface an agent cannot fully use, or a powerful agent with no real interface for humans. Agent-native architecture removes that trade-off.

Agent-native applications are software built so humans and AI agents can operate the same product through shared actions, data, permissions, and context. You may use a visual interface, while the agent may use natural language and tool calls, but both paths work inside the same application model.

This is the architectural line between an AI feature and an agent-native product. The agent is not bolted onto the app after the fact. It is part of how the app is built.

Why SaaS and raw agents solve different halves
AI-enabled to AI-native to agent-native
What makes an application agent-native?
What agent-native apps need as they grow
Why agent-native apps should be cloneable
Where agent-native fits in the software stack
What agent-native looks like in practice
How to get started with agent-native

The problem: SaaS and raw agents solve different halves

SaaS gave developers and teams a clean bargain: stop maintaining software, rent a polished product, and accept whatever shape the vendor gives you. That bargain worked for a long time, especially when software mostly needed to give people a workflow, a database, and a UI.

AI agents changed the bargain.

The problem is not that SaaS products lack AI features. Almost every software company is adding them. The problem is that most products were not designed for an agent to operate them completely. A chatbot in the corner can summarize a document or draft a response, but it usually cannot do everything you can do in the product. It cannot reliably see the same state, use the same workflows, or change the product through the same primitives as the interface.

That is why bolt-on AI eventually hits a ceiling.

Raw agents have the opposite problem. Tools like Claude Projects and general-purpose coding agents can be extremely powerful, but they often start as a blank text box. That blank canvas problem is intimidating for teams. There are no buttons, no durable workflows, no obvious starting points, and no domain-specific interface that makes the right action feel natural.

The result is a split:

Raw agents give you power without enough product shape. SaaS gives you product shape without full agent access, ownership, or customization. Agent-native apps combine the structure of SaaS with the flexibility of agents.

The evolution: AI-enabled to AI-native to agent-native

"AI-native" is already used across the industry, but the term is too broad to describe the next architecture of software. Some teams use it to mean infrastructure optimized for AI. Some use it to mean products where AI is central. Some use it to mean any product with an AI workflow.

Agent-native is more specific.

It is the architectural discipline of building applications so agents and humans can operate the same product from the start.

Adding AI to your app does not make it AI-native. AI-native means the product does not exist without the AI. Agent-native goes one step further: AI is central, and the product still has a real interface for humans.

The distinction matters because you should not have to choose between software you can use and software an agent can use. The same product should work both ways.

Mobile-native apps are the closest historical analogy. A mobile-native app was not a desktop website squeezed onto a small screen. It was designed around the constraints and strengths of mobile from the beginning: touch, camera, location, limited screen space, intermittent attention.

Agent-native apps are the same kind of shift. They are not SaaS products with AI squeezed into the corner. They are designed around the constraints and strengths of agents from the beginning: natural language, tool use, context, background work, and human supervision.

What makes an application agent-native?

An application is agent-native when the human interface and the agent are two ways of operating the same product. You may use screens, forms, buttons, keyboard shortcuts, and visual review flows. The agent may use natural language, tools, protocols, and background execution. But both are grounded in the same actions, data, permissions, and context.

The distinction comes down to five architectural principles.

1. Agent UI parity

Agent UI parity means anything the UI can do, the agent can do. And anything the agent can do should be visible, inspectable, or controllable through the product's interface, logs, permissions, or state.

The core test is simple. If you can archive an email, create a dashboard, schedule a meeting, update a record, or render a video, the agent should be able to perform the same action through the same application capability. The agent should not be screen-scraping the UI or using a fragile side-channel. It should call the same underlying capability that powers the product.

A chat panel on the side can be useful, but it cannot be the architecture. The agent needs access to the product's actual capabilities.

Consider an email app. A normal AI feature might draft a reply. That is useful, but shallow. An agent-native email app lets the agent draft the reply, inspect the thread, apply labels, archive notifications, route customer messages, pull context from a CRM, and leave the final send decision to you when needed. The agent is operating the email product, not merely commenting on it.

2. One shared action model

Agent UI parity only works when the same capability is not rebuilt for every surface.

In traditional software, a team might implement a UI action, then an API endpoint, then an automation hook, then an LLM tool definition, then a CLI command, then documentation explaining how all of those relate. Every copy creates drift. The UI can do one thing. The agent can do a narrower thing. The API exposes something slightly different. Eventually, nobody trusts the abstraction.

Agent-native architecture needs one action model.

Define the action once: archive an email, create a dashboard, render a video, schedule a meeting, invite someone, update a record. From that single definition, the UI can call it, the agent can see it as a tool, external clients can reach it, and other agents can route to it through the supported protocols.

In code, the pattern looks like an action definition rather than a pile of one-off integrations:

// actions/reply-to-email.ts
import { defineAction } from "@agent-native/core";
import { z } from "zod";

export default defineAction({
  description: "Reply to an email thread",
  schema: z.object({
    emailId: z.string(),
    body: z.string(),
  }),
  run: async ({ emailId, body }) => {
    await db.replies.insert({ emailId, body });
  },
});

That single action can become a UI mutation, an agent tool, an HTTP endpoint, a CLI command, an MCP tool, and an A2A tool. The product capability is defined once, then exposed through every surface that needs it.

3. Shared state, data, and context

An agent-native app cannot treat the agent as a separate background process. The agent needs to know what you are looking at, what is selected, which filters are active, and what changed while it was working.

In practice, context awareness means the UI writes navigation state as you move through the app; a view-screen action gives the agent a fresh snapshot of the current view; and a navigate action lets the agent move the UI when you ask it to open a record, thread, chart, document, or task.

This is why agent-native apps feel different from chatbots attached to products. If you highlight a paragraph and ask for a rewrite, the agent should know which paragraph. If you are looking at a customer account, the agent should operate on that account. If the agent creates a draft, updates a dashboard, or marks a task complete, the UI should refresh because both sides read and write the same database-backed state.

Live sync does not need to mean fragile browser automation or long-lived infrastructure. The framework pattern can stay intentionally simple: actions write to SQL, a version changes, and the UI polls for updates and invalidates the right data. The important principle is not the polling interval. It is that the database is the coordination layer between the human interface and the agent.

4. Protocol-ready by design

Agent-native applications are not isolated chatbots. They are software nodes that agents and other apps can use.

That means protocols matter. An agent-native app should be reachable through standard agent interfaces such as MCP, so tools like Claude Code, Codex, Cursor, Builder.io, or other MCP-compatible clients can understand and operate it. It should also support agent-to-agent communication, so one app can ask another app to do work.

The important part is that protocol support is not a one-off integration project. It is a property of the app architecture. If actions are already the shared unit of product behavior, exposing those actions to MCP, A2A, a CLI, or an internal API becomes a routing problem rather than a second product.

An analytics app should be able to ask a slide app to turn a dashboard into a deck. A calendar app should be able to coordinate with an email app to propose meeting times. A dispatch agent should be able to route work across an ecosystem of apps without each team writing bespoke glue code every time.

5. Governed execution

The final test is whether the agent can act inside the same permission model as the product.

If you cannot access a customer record, the agent should not be able to access it on your behalf. If sending an email, deleting a file, publishing a page, or changing a billing setting requires confirmation, the agent should respect that same boundary. If a team needs to know what happened, the product should expose logs, audit trails, and state changes in a way humans can inspect.

This is where agent-native stops being a clever interface and becomes an application architecture. The agent is powerful because it can act. The product is trustworthy because those actions are scoped, reviewable, and reversible where needed.

What agent-native apps need as they grow

The principles above define the minimum. But the real promise of agent-native software is not just that an agent can click the same buttons you can. It is that the product can become more personal, more programmable, and more collaborative as you use it.

These layers are not all required for the first version of an agent-native app. A personal prototype can be agent-native before it has team governance, runtime tools, or an observability dashboard. But as soon as agent-native apps move from demos into repeated work, these capabilities start to matter.

Workspace customization

Agent-native is not only about exposing product actions to an LLM. It also gives each person and team a customization layer normally reserved for developer tools.

A mature app should ship with a workspace: AGENTS.md for shared instructions, LEARNINGS.md for durable team memory, personal memory, skills, custom sub-agents, scheduled jobs, and connected MCP servers. The important architectural detail is that these resources live in SQL rather than on a local filesystem.

That changes the economics of customization. Claude Code and Codex already show how powerful agent workspaces can be when instructions, skills, memory, and tools travel with a project. But that model is usually organized around developer workflows: repos, local environments, source control, and project files. An agent-native app brings the same pattern into the product itself, where workspaces are database-backed, scoped by person or organization, and editable inside the app.

This matters for adoption. You do not just want a smarter default app. You want an app that learns your workflow, remembers team conventions, supports reusable instructions, and lets specialists shape the agent without waiting on a product roadmap.

Runtime tools and automations

As agent-native apps mature, people start wanting a layer for work that is smaller than a permanent product feature but more durable than a one-off chat response.

Runtime tools fill that gap. A tool can be a private dashboard, calculator, monitor, data lookup, or small interactive utility the agent creates inside the app without a code change, build, deploy, or migration. If it becomes core to the product, it can later graduate into a template feature. Until then, it gives you a way to customize the app immediately.

Automations do something similar for background work. You should be able to say, "When an enterprise lead books a meeting, post the details to Slack," or "Every Monday, summarize last week's support threads," and have that become a scheduled or event-triggered workflow with the same actions, permissions, secrets, and audit surfaces as the rest of the product.

Progress and observability

Long-running agent work also needs product-grade visibility. Progress state, notifications, traces, feedback, evals, and cost/latency metrics are not enterprise extras; they are how humans supervise autonomous software. If the agent is triaging 128 emails, importing a dataset, or rendering a video, you should see what is happening, where it is stuck, what it cost, and what it changed.

Observability is especially important because agent-native products do not fail like normal SaaS products. A button either worked or it did not. An agent may choose the wrong tool, skip a step, spend too much, get confused by stale context, or do the right thing for the wrong reason. Traces, evals, feedback, and audit surfaces give teams a way to improve the agent instead of guessing.

Team readiness

The individual developer path matters because bottom-up adoption is how many developer tools spread. Someone clones an app on a weekend, uses it for a real personal workflow, then brings it to work because it is already useful.

But the team layer still cannot be an afterthought.

Once a company has several agent-native apps, unmanaged autonomy becomes chaos. Who has access to which app? Which LLM key is being used? Which data can the agent read? Which actions require approval? How do you audit what happened? How do you share a workflow without making everything public?

Teams running agent-native apps eventually need team primitives: users, organizations, roles, permissions, shared workspaces, private-by-default data, auditability, and governance.

Open source templates make it possible to clone and build agent-native apps. As those apps spread inside a company, the architecture needs an operational layer for hosting, auth, database management, branching, provisioning, and controls.

Why agent-native apps should be cloneable

Cloneability is where the agent-native idea becomes practical for developers and teams.

SaaS products often make your own data feel rented. Your calendar, analytics, email history, support tickets, calorie logs, and customer records sit behind a vendor's product assumptions. You can export some of it, query some of it, automate some of it, and customize very little of it.

Agent-native apps push in the other direction: clone the software, own the code, own the database, and change the product when the default shape no longer fits.

At some point, many SaaS interfaces become walls around your own data.

That sounds abstract until you hit the first question the product did not anticipate. A calorie tracker might show you weekly trends, but not answer, "Which foods correlate with my worst sleep when I eat them after 8 p.m.?" A dashboard tool might show revenue by segment, but not run the exact exploratory analysis your team needs today. A SaaS email client might help you move faster, but it will not let you rebuild the inbox around your company's internal routing logic.

When you own the database and have an agent, you can ask questions the original developers never thought to answer.

That is the economic and practical argument for cloneable SaaS: full products you can clone, own, and reshape instead of endlessly subscribing to generic software. Cloneability is how agent-native software breaks out of one-size-fits-all SaaS.

Where agent-native fits in the software stack

Once agent-native is defined, the comparison becomes more practical. SaaS, raw agents, and internal tools each give you something useful, but each leaves a gap. This table shows where those gaps appear: control, UI quality, agent access, customization, ownership, team readiness, observability, and cost.

Over time, applications an agent can fully operate will replace applications where AI can only talk about the work.

That does not mean every app becomes a text box. It means every serious app needs to expose its real capabilities to an agent while preserving the interface humans need to inspect, supervise, correct, and collaborate.

What agent-native looks like in practice

The easiest way to understand agent-native is to look at the kinds of products it makes possible.

Email

Traditional email clients optimize for faster human triage. Agent-native email changes the unit of work. You can still use the inbox manually, but the agent can also summarize threads, draft replies, apply labels, archive low-value notifications, route customer messages, and pull relevant context from connected systems.

You can try this pattern in the agent-native mail template: a familiar inbox you can clone, customize, and operate with an agent.

Analytics

Traditional analytics tools make teams define dashboards, queries, charts, and permissions through a UI. Agent-native analytics lets you ask for a dashboard in natural language, inspect the result visually, then ask follow-up questions against the same underlying data.

The point is not just "chat with your data." The point is that the chart, the query, the dashboard, and the agent's analysis all belong to the same application. The agent can build the dashboard, and you can edit it.

You can try this pattern in the agent-native analytics template.

Calendar

Calendar software is already full of repetitive agent-shaped work: find a time, reschedule this meeting, protect focus blocks, propose slots to a customer, coordinate across teammates, and follow up when nobody responds.

An agent-native calendar gives you a real calendar UI while giving the agent the ability to manage scheduling through the same actions. It can also talk to other apps, such as email or contacts, because scheduling rarely lives inside the calendar alone.

You can try this pattern in the agent-native calendar template.

Clips

Video and screen-recording tools are inherently shareable. When someone sends a clip, the recipient sees the product in the act of consuming the content. That is why this category spreads naturally.

An agent-native clips app can preserve that viral loop while making the workflow programmable. The agent can help cut, summarize, title, organize, and route videos, while you still have a normal interface for recording and sharing.

You can try this pattern in the agent-native clips template.

Video

Agent-native video turns motion graphics into software the agent can operate. Instead of opening a heavy editor for every change, you can describe an animation: add a title card, retime this section, change the easing curve, render an MP4, or create a new composition.

The UI still matters. You still need a timeline, preview, controls, and export state. But the agent can manipulate the same composition model directly. That is the agent-native pattern: natural language control without losing the product surface.

You can try this pattern in the agent-native video template.

How to get started with agent-native

There are two practical paths.

The individual path is to clone a template at www.agent-native.com/templates, use your own LLM key, and try it on a real workflow. Email, calendar, analytics, clips, video, content, forms, and other templates are useful because they start from software you already understand. The blank canvas problem disappears when the agent lives inside a working app.

The team path starts when a template becomes valuable enough to share. The question shifts from "Can I use this myself?" to "Can my team trust this?" That requires hosting, database management, auth, permissions, branching, governance, and shared key provisioning. Builder.io is the team layer for that step: a way to host, share, govern, and manage agent-native apps once they move beyond a personal clone.

Clone a template this weekend. Use your own API key. The agent is already there.

Frequently asked questions

What is an agent-native application?

An agent-native application is one where the UI and the agent are two surfaces over the same product. If you can create, edit, approve, delete, search, schedule, or publish something in the interface, the agent should be able to perform that same action through the application's own action model, with the same permissions and auditability.

How is agent-native different from AI-native?

AI-native means AI is central to the product. Agent-native is more specific: the product is built so an AI agent and a human-facing UI share full access to the application's capabilities. An AI-native product may be only a chat interface. An agent-native product combines agent power with a durable app interface.

What is agent UI parity?

Agent UI parity is the principle that anything the UI can do, the agent can do too. If you can archive an email, create a dashboard, schedule a meeting, or render a video, the agent should be able to perform the same action through the same underlying application capability.

What programming stack do agent-native apps use?

The stack depends on the framework and template, but the important pattern is that application actions are defined once and exposed to both the UI and the agent. In practice, an agent-native app commonly combines a modern web UI, a database the owner controls, typed actions, and agent protocols such as MCP.

Is agent-native only for enterprises?

No. Agent-native adoption should start with individual developers because personal workflows are the fastest way to prove value. Enterprise needs appear later: team permissions, shared keys, governance, auditability, hosting, and compliance. The same app model should support both paths.

TL;DR

Agent-native applications are built so humans and AI agents operate the same app.
The core principles are agent UI parity, one shared action model, shared state and context, protocol readiness, and governed execution.
Agent-native is more specific than AI-native because it requires both agent capability and a real human-facing interface.
Cloneable apps change the economics: own the code, own the data, and use one LLM key across many apps.
SQL-backed workspace resources, automations, runtime tools, progress, and observability make agent-native apps stronger as they grow.
The open source framework and templates are the starting point. Team governance, hosting, auth, and sharing can come later without changing the core model.

The next version of the apps you use every day will be agent-native. The question is whether you clone them, build them, or wait for someone else to define the category.

Explore templates at www.agent-native.com/templates. They are free and open source.

Read the full post on the Builder.io blog

Agent Productivity Is Creating a Quality Debt

Fri, 08 May 2026 18:00:00 GMT

AI writes more code than your review process was designed to handle. Why every PR now needs an agent who opens your product and uses it.

For twenty-five years, software teams optimized around one assumption: code was expensive to write. That single fact shaped every process we built. Teams planned carefully, spec'd thoroughly, and designed before anyone touched a keyboard. By the time a developer started writing, every decision had been pressure-tested through meetings, documents, and design reviews. The cost of writing the wrong code was high enough to justify all that overhead.

That assumption no longer holds. AI agents now turn ideas into working code in minutes. A bug report can be fixed before the standup ends. A PM's idea can become a working prototype before the spec gets written. In the most aggressive shops, agents open as many PRs in a week as the whole engineering team.

So the constraint moved. Writing code used to be the slow, expensive step in the lifecycle, and the cost of doing it has fallen close to zero. The slow step now is something that always took time and always will: knowing whether the code actually works.

The shape of the new gap

When a human engineer writes a feature, they test it as they build. They clicked through the flow. They tried the edge cases. They knew, in their hands, that the thing worked before they pushed. Review was a second pair of eyes on a piece of code that already had a careful set of eyes.

Agent-generated code arrives at review without that step. The agent writes the diff and runs the unit tests. The product itself stays unopened. Nobody fills out the form. Nobody watches what happens when the network call fails. Nobody notices the empty state is broken. The code looks right. It compiles. The tests pass. Then a customer hits a bug nobody walked through.

This is the failure mode of the agent-native shift. The industry's response so far has been to add more code review at the diff layer. AI code reviewers read the diff. Coding assistants reason about the changes. These tools are real improvements, and most teams should use them. They cover the layer the human engineer used to cover by reading their own code, which means they catch the things a careful diff would catch.

What stays uncovered is everything that only shows up once the product runs. Buttons that don't fire, forms that submit empty, redirects that 404, error states that nobody designed. None of that shows up in a diff. It shows up when something uses the product in a real browser with real interactions, which is exactly the step that gets skipped when agents are doing the writing. The kind of thing nobody scripted because nobody thought it would break.

The review job that doesn't scale

Every team has someone who clicks through every PR before it merges. Sometimes it's a QA engineer. Sometimes it's the developer who opened the PR. Sometimes it's whoever's free that day. The job is always the same: open the change, walk the flows, look for what broke.

That job worked when ten PRs landed a day. It struggles when a hundred do, and it falls apart when half of those PRs come from agents acting on tickets, Slack threads, and customer feedback the human reviewer never saw. Volume is one part of the problem. Context is another. The reviewer doesn't have what the agent had: the customer message in Slack, the ticket the PM filed, the conversation that led to the change. They have a diff, a description, and forty-five seconds before the next PR lands.

So most teams quietly let it slide. They write end-to-end test suites for the critical paths and hope the rest holds. They rely on customer bug reports as the real QA layer. They ship faster than they ever have and absorb a steady trickle of regressions as the cost of doing business in the AI era.

This is the bill for agent productivity coming due. Faster generation paired with the same review capacity means more behavioral coverage gets skipped on every PR, and the compounding effect is already visible: shipping anxiety on every frontend change, brittle test suites that nobody trusts, customer-found bugs creeping up, and engineering leaders who can't tell their CFO whether the productivity gains from AI are real or whether they're being paid back in support load and churn.

Why the existing toolbox leaves a gap

The natural response to this is to reach for what's already in the toolbox. Most teams have unit tests, integration tests, and end-to-end suites for critical flows. Many added an AI code reviewer in the last twelve months. A few have tried wiring up browser automation themselves, giving an agent the primitives to drive a browser and see what happens.

Every one of these layers carries weight in a modern codebase, and each has a defined edge for what it covers. The hole opens in the same place each time. Here's what each layer covers, and where it stops:

The pattern is consistent. Each layer covers a slice of the problem, leaving a widening gap in behavioral coverage every time an agent opens a PR. Scripted suites only catch what someone took the time to script. Code review agents work at the wrong layer for the failure mode that matters now. DIY browser automation gives an agent the primitives to drive a browser, and turning that into coordinated coverage across a team's PRs means building flow inference, a severity policy, a replay viewer, and PR integration on top. Underneath all of it sits a question about engineering time: building QA infrastructure pulls that time away from shipping product, and most teams that go down that road find their bespoke version drifts out of sync with the rest of their stack within a quarter.

Agents need to use the product

Something has to walk through the product on every PR. The volume of changes has exceeded what humans can absorb, which means something has to act as an agent.

What that agent does sits at the behavior layer of the stack, one rung up from where diff-reading agents work today. It opens the product in a real browser. It clicks. It types. It walks the flows the change touches, and reports what broke, with enough evidence that a human reviewer can verify the finding in seconds. This is the layer that's been missing from agent-native development, and it's the layer we built Quality Review Agent to fill.

On every PR your team opens, Quality Review Agent spins up a real browser, loads your product, and uses it. It reads the PR title, description, and diff to figure out what changed, then walks the change end-to-end across three layers of coverage:

Critical flows. The happy path for whatever the change touches. If a PR modifies a checkout step, the agent walks through the checkout process.
Edge cases. Empty states, invalid input, rate limits, error paths. The boring failure modes that humans skip when they're tired, and the suite skips when nobody scripted them.
Regressions. Whether this change broke anything nearby. A tweak to a dashboard filter re-tests the charts that depend on it.

When the run is complete, the agent posts what it found back to the PR, including a video replay, network calls, and console logs synced to the timeline. Every flagged bug comes with the receipts: a frame-by-frame replay showing exactly what broke, with the failed network request and the console exception at the right second. The reviewer scrubs to the moment it fired, sees the agent's reasoning at that step, and decides whether to merge or fix.

Code review and quality review cover different layers of the same PR, which is why teams need both to run. Both run in parallel, so the total latency tracks the slower of the two, and full behavioral and code-level coverage on every change comes without slowing the team down.

When a flagged bug comes back, anyone on the team can resolve it. Every finding has a "Fix in Builder" button that lets the person closest to the problem describe the fix in plain English, hand it to the agent, push the update to the same branch, and re-run. The PM who opened the ticket can fix the copy themselves. The designer who refined the layout can fix the spacing. The engineer reviewing the diff can fix the architecture. The work doesn't bottleneck on whoever happens to own the code path.

Why the trust layer must scale with generation

Agent-native development is still early enough that most of the conversation is about the generation side. Faster code, more PRs, more roles contributing. The trust side moves more slowly through the news cycle because the breakdown is slow, too. Regressions trickle in. Customer-found bugs creep up. Engineering leaders start hedging when the CFO asks about productivity gains. The compounding cost is real, showing up in support load and churn over months, in small recurring damage that rarely makes the headlines.

The teams that get ahead of this build the trust layer alongside the generation layer. A code review agent on every PR. A quality review agent on every PR. Both run in parallel, giving the human reviewer the receipts they need to merge or fix in seconds. The work that used to take a human reviewer an hour now takes them two minutes, and the work that used to be skipped entirely now happens by default.

This is the bet behind Quality Review Agent, and the bet behind the rest of Builder. The first wave of the platform was collaboration: putting designers, PMs, and engineers on the same branch, working from real code, with real components. Quality Review Agent is the next layer of the same thesis. Once a team is shipping code from across roles and across agents, trust becomes the live question on every PR. Did this work, in a real browser, the way a customer is going to use it? An agent doing the work humans used to do, at the volume the platform now produces, is how we answer that question on every PR.

The teams that win the AI-native era will be the ones shipping the most working software. With code generation as abundant as it is now, the deciding factor is whether all that generated work reaches users in working order. The trust layer is how we get there.

Check out the announcement on Builder’s Quality Review Agent, and sign up for free to try it on your own codebase.

Read the full post on the Builder.io blog

How to Create Free, On-Brand LinkedIn Carousels

Wed, 06 May 2026 18:00:00 GMT

LinkedIn carousels (PDF "documents") are one of the highest-performing native formats on the platform — and one of the most annoying things to actually produce. The constraints break every normal slide tool: they have to be 1:1 or 4:5, every frame has to read like a billboard on a phone screen at arm's length, and the whole set has to stay on-brand across 8–10 slides. Your options today are Canva (manual and slow), generic AI deck tools (off-brand, wrong aspect ratio, very "slide-deck-brained"), or a designer (slow and expensive).

Here's a faster path: open the Agent Native Slides template, point it at your design system, drop in the source material you want the carousel to be about, and prompt it. Free.

This is a concrete example of our take on cloneable, agent-native software — instead of bending a generic SaaS app to your brand, you clone a small, focused template and wire it into your own context.

The 4 steps

1. Open the Agent Native Slides template

Easiest path: open the hosted version at slides.agent-native.com and start in the browser.

Prefer to run it locally? One command:

npx @agent-native/core create my-slides-app --template slides

More on the template and how it's wired together in the Agent Native Slides template page and the getting started docs.

2. Sign in to Builder for free credits (or bring your own Anthropic key)

In the agent panel, click Connect Builder.io. You'll get free credits to use Claude (Opus/Sonnet) and free hosting — no API key juggling. This is the recommended path for most people.

If you'd rather use your own quota, paste an Anthropic API key instead. Both paths are covered in the onboarding & API keys docs.

3. Point it at your design system (or design inspiration)

Drop a link or file into the chat that defines what "on-brand" means for you. Anything works:

A link to your design system docs
A Figma URL
A brand PDF or one-pager
A GitHub repo
Screenshots of carousels you like

The agent uses these as style reference only — colors, typography, spacing, rendering technique. It will not copy scenes or content from them.

Then tell it the rules LinkedIn actually cares about, in plain English:

Use a 1:1 aspect ratio for every slide. Use billboard typography — one big idea per frame, large type, minimal supporting text. Stay on-brand using the design system above.

This is the part every generic AI deck tool gets wrong. Lock it in once.

4. Drop in the source material and prompt the carousel

Paste the link or document the carousel is about — a blog post, an internal memo, a podcast transcript, a PDF, anything. Then prompt it. A copy-pasteable starting point:

Turn [this article] into a 7-slide LinkedIn carousel. 1:1 aspect ratio. Billboard typography — one big idea per slide, max 12 words on screen. On-brand using the design system above. Slide 1 is a hook. The last slide is a CTA.

From there, iterate conversationally:

"Make slide 3 bolder."
"Swap the color scheme to the secondary palette."
"Tighten slide 5 to 6 words."
"Add a slide between 4 and 5 with the key stat."

The agent and the canvas update live. Export to PDF, upload to LinkedIn as a document, done.

Why this works

This isn't "a slide app with AI bolted on the side." The agent and the UI are one system, both reading and writing the same deck against your real brand context. Because the whole thing is an open template, you own the tool: fork it, hard-wire LinkedIn-specific defaults (aspect ratio, typography rules, your brand tokens, your footer), and you've solved this problem permanently for your team.

That's the agent-native shape — small, cloneable, opinionated apps wired directly to an agent and to your context. The carousel tool is one example. You can build the next one yourself in an afternoon. More on the why in The future of SaaS is cloneable.

Try it

Make a carousel now: slides.agent-native.com
Read the deeper why: The future of SaaS is cloneable
Build your own template: Agent Native docs

Read the full post on the Builder.io blog

The Future of SaaS Is Cloneable

Wed, 06 May 2026 18:00:00 GMT

For a long time, AI-coded software has had an obvious ceiling.

You could vibe code a demo polished enough to post to Twitter, like a dashboard, content editor, CRM, or half-working internal tool.

Then, you tried to use it. Daily.

Things would break. You'd try to fix them with the AI, but the second pass made it worse than before. The data model was wrong. The UI got fragile. The agent couldn't really tell which parts mattered, and what looked like a product turned out to be mostly a screenshot.

Thing is, that's starting to change.

Better coding agents, app patterns, and ways to connect agents to real interfaces and data are making some internal apps useful enough to keep improving and use day to day.

It's definitely not for every enterprise workflow, and not every compliance-heavy system where reliability, support, and procurement matter more than customization. But for personal software, technical teams, and internal workflows, the line is moving.

AI is making it practical to own, customize, and connect the everyday software your team actually uses. SaaS isn't dead yet, but the moat is shrinking.

Internal tools suddenly got real

Let's look at a real-life example of this.

Teams are buried in data: product events, sales records, support tickets, call notes, billing systems, or a warehouse only three people really understand.

The usual SaaS answer is analytics software, which definitely helps, but the moment your exact dashboard doesn't exist, you're kinda screwed. You file a request, export a CSV, ask a data person, or settle for the closest chart.

So, what if you had a tool that worked out-of-the-box, but that you could own and customize the second you needed it to be different?

That's what we've been doing at my company with our own internal analytics app. Multiple teams are already using it for real work: sales can look for stalled accounts, support can inspect ticket patterns, product can ask where users drop off, and GTM can build event prep views from several systems.

The analytics team still owns the core tool. They define data sources, shared concepts, metrics, and dashboard patterns. And the app still has a UI, saved dashboards, and structure people can understand.

But when an exact view doesn't exist, anyone can ask the app's agent to just... make it. The agent can query any of our data sources, figure out which ones it needs to create the charts, and save a new dashboard that everyone can now use. It can actively adapt the app around how our team works.

The app stays stable where stability matters. The agent flexes where the workflow needs flexibility.

That's different than your traditional chatbot bolted onto SaaS. It's more like having a portable Claude Code with a fully dynamic UI. It's a different relationship between the user, the app, the data, and the agent.

We've been calling it agent-native apps.

The old SaaS way of doing things is dying

SaaS usually wins because building polished software is really, really hard.

The bargain goes something like this. SaaS companies absorb the work most teams don't want to own: hosting, auth, databases, permissions, integrations, support, reliability, design, and a thousand other product decisions that actually make software usable.

In return, teams pay for the tool. Buying beats building because the alternative is becoming a software company for every little workflow.

SaaS also encodes product learning. Calendar apps, mail clients, and docs apps aren't just tables, simple wrappers, or naked text areas. SaaS companies spent years discovering the primitives that actually make these categories of software feel good to use.

But in order to figure all that out, SaaS has always had to operate at scale. And so the bargain has a cost: the software is generalized.

It has features you don't need, lacks the exact feature you want, integrates (but not deeply enough), and offers AI only in that vendor's chosen shape. This isn't a failure of SaaS so much as just the nature of the generalized software beast.

A mail client has to work for sales, support, founders, recruiters, executives, writers, and even folks who just learned that a double click doesn't mean pressing both mouse buttons. Analytics has to satisfy teams asking vastly different questions.

And that's a tradeoff we all accepted, because building your own version was obviously too expensive. But SaaS has already taught us the useful primitives of each category. If AI makes it easy to clone those primitives without inheriting the vendor's one-size-fits-all assumptions, then the math changes.

It's no longer a question of only, "Which SaaS tool should we buy?" It's also, "Which workflows are specific enough to us that owning and customizing the tool would give us significant gains?"

Cloning SaaS workflows doesn't mean making slop forks

There is an obvious bad version of this future: the slop fork.

In the slop fork version, AI copies the surface area of a product without understanding the constraints that make it work. It copies the UI, routes, database shape, and demo interactions, but when you actually try to use it, it feels... overloaded. It's hard to parse what anything does or means. It's not built from first principles.

That's because software isn't a visible feature list. Mature products are packed with invisible judgment: defaults, permissions, sync behavior, recovery flows, keyboard behavior, data models, and edge cases that only look obvious after years of use.

Asking AI to implement a SaaS clone shouldn't be, "Hey, reverse engineer this for me." Instead, you need to understand the stable primitives of the workflow you need and then customize from there.

Templates can really help with this, and lately my team has been open sourcing a bunch of the most helpful bits of our internal tools, to help folks get started. They're free to use. You just need an AI subscription to something like Cl

A good template should be generic. It's just there to capture the basic shape of the category that SaaS has been defining for decades. For instance:

Mail needs inboxes, threads, accounts, and fast triage.
Content needs documents, editors, comments, and sharing.
Analytics needs queries, dashboards, charts, saved views, and metric context.
Forms need questions, responses, branching, and exports.

Once those primitives are captured in a cloneable template, you can focus on what you want to customize about that stable base instead of reinventing the wheel. Let the agent add custom actions, data integrations, dashboards, automations, shortcuts, and workflow-specific UI.

That's the difference between slop forks and useful clones. Slop copies everything badly. Useful clones start with the basics and only add what you need. They become yours over time.

Agents + UI is way more powerful

A basic cloned app is useful. An agent-native cloned app is where things get way more exciting.

The old AI pattern was to bolt a chatbot onto an existing product. It could answer questions, summarize, draft, or trigger a few narrow actions, but it didn't really share the UI's context or operate the same workflows a human could.

Agent-native apps start from a different premise: the agent and the UI are one system.

The UI gives us humans (👋) a structured surface to work in: dashboards, timelines, tables, documents, canvases, inboxes, and forms. The agent gets access to the app's actions, state, and context, so it can see what changed, understand what object is selected, and choose the operation that would actually help.

The UI gives the workflow a repeatable shape. The agent gives that shape flexibility.

A cloneable mail client is useful. A cloneable mail client with an agent that understands your inbox, CRM context, calendar, team, and preferred workflows is a different kind of tool.

The same pattern applies to content, analytics, clips, slides, forms, calendars, video editors—whatever. Once the agent is native to the app, even a basic interface becomes a way to expand what the model can do.

For instance, I'm writing this article in an agent-native content tool (Notion clone), where I can press one button to publish to our blog when I'm done, and the app will report back to me on the article's success or failure. Wildly helpful stuff.

Owned data makes the app smarter

Customization isn't even the biggest advantage of owned, team-specific software. The biggest advantage is all the context you can give the app and agent.

Generic SaaS knows only the data it is allowed to know. Your own app can start from the data the workflow actually needs.

An email client can see the relationship, deal status, and likely need behind a message. Our analytics app can build an event-prep view by pulling together account data, call context, and internal notes that normally live in separate systems.

The content workspace I'm writing in can understand the publishing pipeline, brand constraints, SEO research, articles, and performance data. Internal linking stops being a scavenger hunt; the tool shows me, in one place, every article linked from a draft, then suggest related pages by conversion signal.

This is where generalized SaaS starts to feel strange. If your team has the data, and AI can turn it into interfaces and automations, why trap every workflow inside a vendor's product assumptions?

A2A makes apps a network

Traditional SaaS integrations depend on vendor incentives. Companies aren't always motivated to make products work deeply with competitors, and even good integrations reflect the vendor's flows.

A2A gives agent-native apps another path: apps can talk directly across team workflows. If I have an analytics question while writing content, I can just ask from the content workspace. Slides can turn Gong calls into a pitch deck. Mail can ask Calendar for availability.

Suddenly, the simple primitives feel more like legos. Your individual cloned apps become a software graph.

If you own the apps, protocols, and data flow, you don't have to negotiate between vendors for the integrations you need. It's just part of the architecture.

Where enterprise SaaS still makes sense

This isn't an argument that every company should replace every SaaS product with AI-coded internal tools. Stability, compliance, support, procurement, security, permissioning, trust, and long-term maintenance all still matter.

But agent-native apps don't need to beat enterprise SaaS everywhere. They just need to be specific, owned, and stable enough for personal workflows, internal tools, and technical-team workflows.

That stability gap isn't gone yet, but it's closing fast with this latest generation of AI coding tools. And the customization gap is already enormous. When the rented tool is stable but misfit, and the cloned tool is stable enough but shaped around the work, more teams will choose the tool that fits.

Make software that fits you better

The future of SaaS isn't that every product disappears. It's that more software becomes something you can own, reshape, and connect—especially as more folks make better and better open source templates.

We've been working on that here with our agent-native cloneable templates for the categories teams already rent: mail, calendar, content, slides, video, analytics, clips, design, forms, a Dispatch control plane, and a minimal Starter scaffold. They map to familiar Superhuman, Gmail, Calendar, Notion, Docs, Pitch, Looker, Loom, Figma, Canva, and Typeform-style workflows.

So, pick a workflow that's been annoying you lately, or a tool you're tired of paying for. Clone the closest template, try it with real work, connect the data that matters, and ask the agent to shape the app around how you actually operate.

If it helps, keep shaping it. If it breaks, open an issue. If you improve it, send a PR.

And if the idea clicks, feel free to star the repo, join the Discord, and help define what agent-native software should become.

With any luck, the next generation of software won't be a thousand subscriptions we juggle. It'll be tools we clone, customize, and make fit.

Read the full post on the Builder.io blog

6 Best GitHub Copilot Alternatives in 2026

Tue, 05 May 2026 18:00:00 GMT

If you're on annual Copilot Pro+, your Sonnet 4.6 multiplier just went from 1x to 9x. Opus 4.6 went from 3x to 27x. All of it bills starting June 1, 2026, when GitHub's usage-based billing replaces the flat-fee model. Base prices stay the same and code completions stay free, but everything agentic now meters by the token.

By the end of this post you'll have a confident pick from six AI coding tools that make sense at the new prices.

Cursor

Cursor is the AI-native IDE most developers leaving Copilot land on first, a VS Code fork with first-party agent mode (Composer 2), cloud-run background agents, and Tab autocomplete the community has called best-in-class since 2024.

Cursor is a VS Code fork. Tab autocomplete runs on the Supermaven model Cursor acquired in 2024 (Supermaven itself is sunsetting; the tech now powers Cursor's predictive editing). Composer 2 handles multi-file agent runs, and Cloud Agents spawn parallel runs in remote sandboxes so you don't tie up your laptop.

Key features:

Tab autocomplete
Agent mode for multi-file changes
Cloud Agents for parallel runs in remote sandboxes
Codebase indexing with @-mentions for context
All major models supported (Claude, GPT-5.5, Gemini, Grok)
51.7% SWE-Bench solve rate versus Copilot's 56.0%

Best for:

Developers wanting a familiar VS Code feel plus deeper agent mode

Pricing:

Hobby (Free): limited Tab completions and limited Agent requests
Pro $20/month: usage-based credit pool drawn against your model spend (the older fast-vs-slow request tiering retired in June 2025)
Pro+ $60/month
Ultra $200/month

Bottom line. The default community pick. For a one-for-one Copilot replacement that goes further on agentic work, start here. For the head-to-head against Claude Code, read our Claude Code vs Cursor comparison.

Builder.io

Builder.io is the only entry on this list built for whole teams. It's a collaborative workspace where designers, product managers, and engineers work alongside parallel AI agents, and where engineers approve every change before it ships. Where the rest of this list is a single-developer coding assistant, Builder is a multiplayer layer on top of whichever one you pick.

Solo coding has never been faster. The handoff from designer to engineer to QA has never moved more slowly in comparison. Cursor and Claude Code make individual developers faster but they don't let a designer push a real component change or a PM update copy without filing a ticket. Builder.io collapses that handoff. Designers, PMs, and engineers contribute inside the same workspace; agents run in parallel; engineers gate the merge.

Key features:

Multiplayer workspace: designers, PMs, engineers, and agents in one branch
Massively parallel agents: spawn dozens of agent runs in cloud containers without contending for your laptop
Design-to-code workflow: Turns Figma into production React, Vue, Svelte, Angular, Qwik, Solid, HTML, React Native, Kotlin, or Flutter
Engineer-gated merge: every agent change ships only after a developer approves it'

Best for:

Teams whose bottleneck is team-wide shipping and not individual developer speed

Pricing:

Free: 5 users, 15 daily / 60 monthly Agent Credits
Pro: pay-as-you-go ($25 per 500 Agent Credits)
Team: $40/user/month
Enterprise: custom seats and Agent Credits

Bottom line. Pick Builder.io when you want a team-wide workspace where designers, PMs, and engineers work alongside parallel AI agents, and where engineers approve every change before it ships.

Claude Code

Claude Code is Anthropic's agentic coding tool. It runs in your terminal, in VS Code or JetBrains, in the desktop app, or on the web, bundled with Claude.ai Pro or billed per token.

Claude Code started as a terminal-first agent and has grown into a multi-surface tool. It still runs as a CLI command (the spiritual home for terminal-native engineers), and ships first-party VS Code and JetBrains extensions, a desktop app, and a web client. The differentiator is shape: Claude Code stays out of your editor. It runs as an agent that reads files, runs commands, and opens PRs, then steps aside.

Key features:

Terminal-first agentic CLI with VS Code, JetBrains, Desktop, and Web surfaces
MCP support, sub-agents, routines, scheduled tasks
GitHub Actions integration for CI-driven runs
Sonnet 4.6 and Opus 4.7 (the models powering most agent benchmarks today)
Direct access to Anthropic models without provider markup

Best for:

Engineers who treat AI as a junior collaborator on multi-step tasks
Anyone already paying for Claude.ai Pro or Max (Claude Code is included)

Pricing:

Bundled with Claude.ai Pro ($20/month) or Max (from $100/month, with a higher tier above)
Pay-per-token: Sonnet 4.6 input $3/MTok, output $15/MTok; Opus 4.7 input $5/MTok, output $25/MTok

Bottom line. The most powerful single-developer agent on this list. Pair it with Builder.io when the work crosses from "coding alone" to "shipping as a team."

Codex

Codex is OpenAI's desktop command center for agentic coding. The desktop app runs on macOS and Windows, and is built for supervising multiple long-running coding agents across local folders, isolated worktrees, and cloud environments.

Codex is OpenAI's coding agent, but the desktop app is the part worth separating from the CLI and IDE extension. The Codex app gives you a dedicated workspace for project threads, diff review, Git operations, automations, and skills. It is closer to an agent operations console than a code editor: you choose a project, start a Local, Worktree, or Cloud thread, review changes in the app, comment on diffs, and commit, push, or open a PR without switching tools.

Key features:

Native desktop app for macOS and Windows
Parallel project threads for running multiple Codex tasks side by side
Local, Worktree, and Cloud modes for choosing where the agent works
Built-in Git diff review, inline comments, staging, commit, push, and PR creation
Integrated terminal scoped to each thread or worktree
Automations for scheduled recurring tasks
Skills support shared across app workflows
Windows-native sandboxing; Computer Use for macOS GUI tasks

Best for:

Developers who want to supervise multiple agent runs from a dedicated desktop app
Teams already using ChatGPT plans and wanting OpenAI's first-party coding agent without living in an editor fork

Pricing:

Free $0/month and Go $8/month: Codex included for a limited time
Plus $20/month: focused Codex usage with access to the latest models and higher limits than Free/Go
Pro from $100/month: 5x or 20x higher Codex usage than Plus
Business: pay as you go, with larger virtual machines, admin controls, and credits-based extension
Enterprise / Edu: custom access with priority processing, audit logs, analytics, data controls, and enterprise security

Bottom line. Pick Codex if you want a desktop app for orchestrating many coding agents, not another editor. It is strongest when your workflow is "start parallel agent threads, review diffs, and ship from one command center."

Windsurf

Windsurf is Cognition's AI-native IDE, a Cursor-shape product with a more generous free tier. Teams pick Windsurf over Cursor mostly for budget; agent mode and model selection are broadly comparable.

Windsurf is a VS Code-style fork with Cascade (Windsurf's in-editor agent), Tab autocomplete, Supercomplete intent prediction, and SWE-1.5, Windsurf's proprietary model that ships alongside the major closed-source ones. (Codeium rebranded to Windsurf in April 2025; Cognition acquired the company in July 2025.)

Key features:

Cascade in-editor agent, Tab autocomplete, Supercomplete intent prediction
SWE-1.5 (Windsurf's proprietary model)
All major premium models supported (Claude, GPT-5.5, Gemini)
Generous free tier refreshing daily and weekly

Best for:

Developers wanting a VS Code-style editor with a more generous free tier
Mixed-model shops (Windsurf supports all premium models)

Pricing:

Free: refreshing quota daily and weekly (no expiration)
Pro $20/month
Max $200/month
Teams $40/user/month

Bottom line. Pick Windsurf if Cursor's $20 Pro plan feels expensive and you want a similar shape.

Zed

Zed is the fastest AI editor on this list, a Rust-native, GPU-accelerated editor from the ex-Atom and Tree-sitter team that runs agentic edits via an open Agent Client Protocol (ACP) and ships Edit Prediction powered by Zeta2, an open-weights model. Zed hit 1.0 on April 29, 2026 after five years of cross-platform development. Philosophy: "out-of-your-face" AI that hides until you ask for it.

Zed is built in Rust by Nathan Sobo, Antonio Scandurra, and Max Brunsfeld, the team behind Atom, Electron, and Tree-sitter. No Electron under the hood; GPU shaders drive sub-millisecond input latency at 120 FPS. ACP makes Gemini CLI a first-class native agent and lets Zed host Claude Agent, Codex, and Cursor through official adapters. Edit Prediction runs on Zeta2, an open-weights next-edit model. Zed is actively developing DeltaDB, a pre-GA CRDT sync engine that tracks every change at character granularity.

Key features:

Edit Prediction (Zeta2): open-weights next-edit model, hidden by default, surfaces only when a modifier is held.
Agent Client Protocol (ACP): open standard for hosting any agent. Gemini CLI runs natively; Claude Agent, Codex, and Cursor run through official adapters; community ACP agents install from a registry without plugins.
Parallel Agents and Agent Metrics dashboard: run multiple AI threads across projects, track adoption and turn times for engineering ROI.
Native development tooling: Git, DAP debugger, LSP, multibuffer editing, remote development, Jupyter REPL, Dev Containers, vim/Helix bindings.
Disable-AI one-click: turn off all AI features for regulated environments or pure-editor workflows.

Best for:

Developers who want the lowest-latency editor on the list (native Rust, no Electron)
Teams that prefer "out-of-your-face" AI hidden until asked, the opposite of Cursor's always-on overlay
Open-source-leaning developers who want an open-weights edit-prediction model (Zeta2) and an open agent protocol

Pricing:

Free / Personal $0 forever: 2,000 accepted edit predictions/month, plus unlimited use with your own API keys or external agents
Pro $10/month: unlimited edit predictions, $5 monthly token credits, usage-based billing at API list price + 10%, with a configurable spending cap. Two-week trial includes $20 token credits.
Student: Free Pro for university enrollees for 12 months ($10/month token credits, unlimited predictions)
Zed for Business: centralized billing, role-based access controls, team management; SSO, usage analytics, and data-privacy guarantees on the Enterprise tier

Bottom line. Pick Zed if input latency, native performance, and open AI architecture matter more than the absolute-best autocomplete (Cursor still leads there). Zed is the editor for developers who want AI on tap, only when asked.

How to pick the right GitHub Copilot alternative?

If your team's bottleneck is collaboration, pick Builder.io. For the closest one-for-one Copilot replacement, pick Cursor. For the most powerful single-developer agent on hard problems, pick Claude Code. For a desktop command center that supervises parallel coding agents, pick Codex. For the strongest free tier from a Cursor-shape product, pick Windsurf. For the fastest editor with native Rust performance and open AI architecture, pick Zed. The right tool maps to the shape of your work.

Read the full post on the Builder.io blog

Announcing Quality Review Agent: Agentic QA on Every PR

Thu, 30 Apr 2026 18:00:00 GMT

Your team is shipping more code than ever. Code review agents can read the diff and flag what they find right in the PR, but they don't open your product, click a button, or fill out a form.

Today we're launching Quality Review Agent.

For every PR your team opens, an agent loads your product in a real browser and uses it the way one of your customers would. It clicks. It types. It walks every flow the change touches. A code review runs alongside the same diff. You've got full coverage on every PR.

An agent that uses your product on every PR

Quality Review Agent spins up a browser on every PR. It clicks, types, and navigates through the change and every surface it affects, and checks that it works.

Coverage runs at three layers:

Critical flows: The happy path for the change itself.
Edge cases: Empty states, invalid input, rate limits, error paths.
Regression: Whether this change broke anything around it.

The agent reads the PR title, description, and diff to decide what to test. A tweak to a dashboard filter, for example, re-tests the charts that depend on it.

Every run posts a list of issues straight to the PR. Each flagged bug comes with a "Fix in Builder" button. Click it, describe the fix in plain English (or let the agent resolve it), and the update pushes back to the same PR for a re-run.

Replays with reasoning, network calls, and console output

Every flagged bug comes with the full run that produced it. The replay is a video of the agent walking through your product, with three panels synced to the timeline:

Agent reasoning. What it did and why, at each step.
Network calls. Every request the agent triggered.
Console output. Every log and error on the page.

Scrub frame by frame to see exactly what the agent saw. Play it at 8x to skim the boring parts. Jump straight to the second a bug fired. At that exact frame, the network panel has the failed request and the console has the exception.

Paired with code review for full coverage

Pair Quality Review Agent with a code review on the same PR. High-severity blocks a merge. Medium gets a reviewer's eyes before approval. Every flagged issue has a "Fix in Builder" button.

Between Functional Testing and Code Review, every change gets a full first pass before a human reviewer looks at it.

Quality Review Agent runs on GitHub PRs. Support for GitLab, Bitbucket, and Azure DevOps is on the way.

Catch bugs before the PR even opens (coming soon)

Today, Quality Review Agent runs the moment a PR opens. Soon you can trigger it on a local branch from wherever you code, whether that's Claude Code, Codex or Cursor.

The agent runs against your code and sends back a report with the video, bugs, network calls, console output, and a single command to fix all issues found.

Full coverage on every PR

Every PR your team opens gets a real-browser run and a code review on the diff. The person who caused an issue can fix it in Builder, whether that's the PM, designer, or agent who opened the branch.

Try Quality Review Agent

Scaling across a large team? Request an enterprise trial

Read the full post on the Builder.io blog

When Agents Work for the Whole Team

Fri, 01 May 2026 18:00:00 GMT

When every role can prompt agents, validate in real time, and move work forward directly, the handoffs stop piling up. Here's what that looks like in practice.

When companies adopt AI coding tools, the workflow usually looks like this: developers gain access, individual contributor productivity increases, and delivery timelines remain flat. AI made developers faster at the one step they already owned, and left everything around that step exactly as it was.

The teams closing the gap between AI promise and actual delivery throughput are taking a different approach. They're putting agents in the hands of the whole product team, not just the engineers.

What the standard workflow is actually costing you

The standard product workflow is sequential by design. A PM defines the work, a designer shapes it, an engineer builds it, and QA validates it. Each step waits for the previous one to finish, and each handoff carries a queue. This structure made sense when it was built because code was genuinely expensive to produce. Every change had to flow through the one function that could produce it. Everything before coding was prep work, meant to ensure engineers didn't have to recode anything once they got started.

That assumption is now outdated. Agents can produce working code from a prompt, and the cost of generating a first implementation has dropped close to zero. The question is no longer whether your team can afford to code something; it's who gets to write the prompt.

When only developers interact with agents, the sequential structure stays intact. Designers file redlines and wait for engineers to interpret them. PMs write specs that sit in sprint backlogs. QA waits until something is nearly finished before testing it. Engineers field clarification questions that interrupt their focus. Making the coding step faster doesn't change any of that. The workflow moves quickly in one narrow lane and at the same pace everywhere else.

This is why delivery metrics stay flat for most organizations after AI adoption. Individual velocity improved. The handoffs didn't.

What changes when the whole team has access

When every role can interact directly with agents, the sequential structure begins to collapse. A designer can refine spacing and interactions directly in code without having to file a redline. A PM can turn a ticket into a working prototype without opening a Jira comment thread. QA can reproduce a bug, prompt a fix, and verify it in the same session. None of that work needs to touch an engineer until it's already been reviewed and validated by the people who would have generated rework cycles anyway.

The mechanics of this shift are worth walking through concretely, because the abstract version undersells how much it changes the actual experience of building software.

A product idea comes in. A PM kicks off an agent on the real codebase, gets a working implementation, and shares a preview link with the team. There's no spec document. There's no ticket waiting in a sprint backlog. There's a live branch with running code that anyone on the team can open in a browser. From there, the work moves in parallel:

A designer opens the branch in a visual editor, fixes the layout, adjusts component spacing, and confirms the interaction behavior matches what was intended.
QA steps through the flows, finds an edge case, and prompts a fix.
The PM shares the preview with a customer, collects feedback, and makes a copy change on the spot.

By the time the branch reaches an engineer for final review, it has already been through design QA, functional testing, and a real-user feedback loop. The engineer reviews code, approves what ships, and moves on. They never opened a redline document. They never responded to a Slack message asking them to clarify a spec. They never fixed a spacing issue that a designer could have handled in thirty seconds with the right tool.

This is what multiplayer AI development actually means in practice. Every role moves work forward in the medium they understand. A designer who spots a spacing problem fixes it in the visual editor. A PM who has a copy change makes it directly in the branch. QA who finds an edge case prompts the fix and verifies it on the spot. None of those work routes through engineering.

Why context is the precondition

None of this works if agents are generating generic code. A PM who prompts a change and gets back output that ignores your component library or overrides your design tokens hasn't saved anyone time. The work still lands on engineers, just in a worse form than if the engineer had built it from scratch.

The precondition for everything described above is context. Agents need to know your real system: your components, your tokens, your architectural patterns, and the reasoning behind decisions your team has already made. Builder indexes your codebase directly, reads your Figma component maps, and builds a model of how your design system actually works, not an interpretation of what it looks like, but a full understanding of the relationships between components, tokens, and patterns. When that context is in place, AI output matches your codebase from the first generation. Designers can refine it without having to deal with foreign component names. QA can test it against real behavior. Engineers can approve it without rewriting it first.

Context also shapes the feedback loop. When a PM builds a working prototype using your actual design system, stakeholder and customer feedback focus on something that looks and behaves like your real product. When a designer makes a refinement in a live branch, the refinement that goes to review is the actual change, not an approximation that an engineer would need to interpret. Every step that uses real context produces outputs that don't require translation before the next step can begin.

This is the mechanism that collapses the handoff cost. Every role can participate without creating downstream cleanup work for the people who come after them. It's why teams that try to stitch together disconnected tools — a coding agent here, a design handoff tool there — still end up with the same queues they started with. Integrated context across the full workflow is what makes the difference.

What this means for engineering teams specifically

There's a version of this that sounds threatening to engineering teams, and it's worth addressing directly. Giving non-engineers the ability to write to a codebase raises legitimate questions about code quality, adherence to the design system, and what happens to standards when people who don't fully understand the system start making changes.

Engineers don't lose control in this model; they gain a better-defined scope of what that control actually means. Engineers retain merge authority. Nothing ships without their review. What changes is what the review contains by the time it reaches them.

When every role contributes through a workflow with structured approval stages, engineers receive pull requests that have already been reviewed by the people with the most context on what the change was supposed to do. The designer confirmed it looks right. The PM confirmed it behaves correctly. QA confirmed it doesn't break anything obvious. The engineer reviews the code itself, not its intent. That's a significantly smaller and more valuable scope of work than reviewing everything from scratch while also fielding questions about what the spec actually meant.

Senior engineers didn't become senior engineers because they're good at moving buttons. They became senior engineers because they're good at making hard technical decisions, maintaining system integrity under pressure, and spotting the kinds of problems that only become visible at scale. A workflow that keeps their attention on those problems and routes everything else to the people better positioned to handle it is a better use of their time. Teams that have made this shift describe engineers finally focusing on architecture and hard problems rather than translating specs into pixels.

The org-level delivery change

Organizations that adopt this model tend to describe the experience the same way: they stop feeling like engineering is the bottleneck and start feeling like the whole team is building together.

Features move from idea to production faster because the feedback loop starts earlier and runs in parallel. Fewer changes require rework at the end because each step is validated in context by the people with the most relevant expertise. Engineers spend more of their time on work that's genuinely hard and genuinely interesting, which matters for retention and for the quality of what they build.

The gap between AI's promise and delivery narrows as the workflow finally matches its capabilities. AI made code generation fast. Taking advantage of that requires redesigning the workflow around it. When every role can drive agents, prompt changes, and move work forward without waiting for someone else's queue to clear, the full pipeline becomes fast, not just one step in it. The path from prototype to production shortens because validation occurs continuously rather than at the end.

Every enterprise has the same graveyard of failed AI POCs. They promised speed. They delivered rework. The difference between those projects and the ones that actually change delivery throughput is almost always the same: whether AI was given to one function or built into how the whole team works together.

The handoff era isn't fading; it's over.

If your team has adopted AI tools and delivery timelines haven't moved, the workflow is the problem. Builder puts agents in the hands of your entire product team, connected to your real codebase, design system, and existing review process.

Start building for free, or talk to our team.

Read the full post on the Builder.io blog

Claude Design Review: An Innovative Way to Brainstorm with AI

Wed, 29 Apr 2026 18:00:00 GMT

So, is Figma dead yet?

I spent a few days with Anthropic’s new tool, Claude Design, to figure out that answer.

And the thing is, Claude Design isn’t really trying to be Figma. Or Lovable, for that matter. It's more just… a design workspace, where iteration feels more like whiteboarding than prompt engineering.

I’m pleasantly surprised. The good stuff is genuinely good: a canvas that asks you questions, generates multiple design options, gives you sliders to tweak instead of re-prompting, lets you annotate directly, and bundles everything up for your coding agent of choice when you're ready to build.

But it’s also an early product, with a lot of cruft. Claude Design is best at helping you figure out what a product could look like. It's much less useful once your designs need to live inside real, changing applications. (But no worries, I'll cover how to work through that at the end.)

What Claude Design is

Claude Design is a browser-based tool from Anthropic Labs for generating and iterating on visual work with Claude. They position it for prototypes, product concepts, decks, one-pagers, and marketing assets rather than app generation alone.

You can start from a prompt, upload assets, bring in documents, capture an existing website, or feed it codebase context. Claude then generates a visual project you can refine through chat, annotations, comments, direct edits, and even generated controls. Then, you export it or hand it off to your actual coding agent.

Claude Design is completely different from Claude Code.

Claude Code is the agentic coding system that reads your local codebases, makes cross-file changes, runs tests, and delivers committed code.
Claude Design is a separate visual surface that produces a handoff bundle when you're ready to implement. (Design is actually a separate app that’s not bundled the Claude desktop app, which I found a bit disappointing for workflow.)

Put simply, Claude Code is focused on devs, and Claude Design is focused on designers.

Prompting with designers in mind

My favorite thing about Claude Design is that it doesn't make every interaction feel like prompt engineering.

In my testing, I told it things as simple as “make me a new homepage” or “redesign the blog hero” and that was enough for it to get going.

Because after your simple prompt, it asks you questions. A lot of questions. This is more than just a quick question tool in the prompt window. Claude Design uses the entire canvas to give you a taste exam.

It’s a little thing, but I find the expanded interface, and the knowledge that there will always be lots of clarifications, where I have space to doodle my thoughts or select multiple answers to the same questions, to be very freeing.

And I found that the questions were very relevant to what we were designing, too.

This fits my design workflow. I don’t always know what I want when I’m starting at a text-heavy chat thread and a blank canvas. Instead of trying to describe the perfect result and know about all my context beforehand, I get to lean on the AI to do what it does best: go figure out the mundane things for me and just ask me about questions of taste.

And sometimes you just need the tool to show you a few directions, ask what feels right, and help you narrow the space. It’s less “type prompt, receive UI” and more like a guided creative process.

Tweaks instead of re-prompting

Another really interesting idea in Claude Design is the Tweaks panel.

When Claude generates a design, it also generates custom controls for that design: sliders, toggles, and adjustable parameters that let you change parts of the result without asking the model to regenerate everything. (If you’ve ever worked on video games, it’s a lot like building your own custom tooling to make assets.)

I found this to be a genuinely cool interaction model. Most AI design tools make every refinement conversational. Want more spacing? Prompt the model. Want a darker background? Prompt the model. Want the hero to feel more editorial and less SaaS-y?

Design, and especially design with AI, is about taste and restraint. Usually AI models go overboard (or don’t do enough) of whatever it is you tell them to do. So, being able to play with a slider that tweaks spacing or intensity in realtime if super helpful.

That said, the feature as it is today is still pretty buggy. Nobody has figured out how to do great generative UI yet, and Anthropic is no exception. Sometimes tweaks are mutually exclusive, and not all permutations actually react to your intent. And I often wanted more tuning parameters and less drastically different options.

But the product idea is strong: AI design tools should give designers tweakable ranges instead of hard answers.

Visual feedback makes Claude Design feel designer-native

Another thing AI design tools should let designers do is communicate visually with the model.

Claude Design’s scratchpad feature is a nice example here. It's a lightweight drawing surface available to you even when the agent is working (thanks to Claude Design’s tabbed workflow), then send your doodles to the agent as context.

Sometimes you don't want to describe a layout in paragraphs. You want to draw an arrow, circle an area, sketch a rough shape, and say, "more like this." For that, Design also lets you annotate and comment directly on the project.

In these ways, Claude Design feels genuinely designer-oriented. Instead of focusing on ways to type, it gives you ways to point.

Edit mode doesn't quite give designers enough control

Now, for the big one. As designers, we want to get our hands dirty and move things around ourselves on the page.

Here, I have to say, I’m both pleased and pretty disappointed.

On the one hand, Claude Design’s edit mode feels way more designer-oriented than most AI coding tools, using language like “tracking” instead of arbitrary CSS values that only devs know.

Plus, it limits what you can do by the element selected. If you select a simple sentence, you get font-related options. If you select a grid, you get layout options. Just because all options are available in code all the time doesn’t mean Claude Design shows you a massive panel of unrelated controls, like other tools. It shows the ones that you actually need.

All that said, what you can actually edit on the page is pretty limited. You can’t just grab elements and freely move them around like you would in Figma or Builder. For anything bigger than borders, colors, font options, and margins, you’re back to talking to the AI.

There’s a tension here. Claude Design has this great visual surface, but it’s not a flexible design canvas in the way you’d expect. You get *some *direct control, but not enough to get real design work done without resorting to a game of AI telephone.

Multiple design options are great, but the canvas gets in the way

The last really cool thing I appreciated about Claude Design is that it can generate multiple options per prompt turn and then lay them out on the same canvas for you to compare.

Sure, it takes the AI much longer to do (and uses way more tokens), but often, designing is about thinking outside the box, and I found it super helpful to tell the model to generate 4 vastly different-looking things for me to bounce ideas around.

That said, the feature is currently buggy. When you get multiple design options generated, the canvas mode changes to a more traditional “pan around the page” design mode, and you can’t scroll or easily interact with the static-ish mocks it presents. (The tweak option in the bottom right of my screenshot was not working.)

Hopefully, this gets fixed soon, because as it is, the feature isn’t super usable.

Claude Design feels early when you need production-level fidelity

There’s several other rough edges in Claude Design, especially given it’s such a new product.

Figma support is one of the biggest examples. In my testing, I couldn’t work with the Figma MCP or individual Figma frames. I had to upload and entire .fig file, which is really awkward when I just wanted to bring over one flow, frame, or component state and then had to create a whole new file instead.

History and branching is another gap. There’s no way to go back in the AI chat history, which means that if the AI messes up your design, you’re stuck with it. Plus, I found that sometimes switching between commenting, annotating, editing, and other modes sometimes auto-sent things to the model before I was ready.

Overall, I felt a big lack of fine-grained control. It’s a frustrating tool once you’re ready to refine the design. **And, crucially, there’s no export to Figma. **This means you’re stuck with whatever you make.

Claude Design breaks for team workflows

Claude Design is a fast tool for design exploration and a genuinely innovative way for designers to work with AI. Other than the actual AI generations, which take minutes, clicking around the interface lets you hop between ideas quickly and narrow down the sometimes overwhelming space of design.

But it starts to fall apart is when you use it as part of a team working on an ongoing application.

Let me explain what I mean.

Fast, but not persistent

Claude Design feels so lightweight because no part of it is your real app. Even though it can reference your code, it can’t run it. It doesn’t sync your latest production code. It doesn’t preserve any relationship with your app as it changes.

If you’re doing a one-off concept, that’s not a problem. But it breaks over time.

Imagine you use Claude Design to explore a new marketing page, hand it off to your engineers, and then the site changes a bit over the next few months.

When you next come back to that same part of the product, you can’t just open up your old Claude Design project, because it won’t be a real representation of your app. You have to start fresh: new project, re-add context, generate from scratch again.

The problem here is that most product design isn’t net-new ideation. It’s returning to a surface that already exists, understanding what changed, making another iteration, and getting that change reviewed.

Context isn't continuity

The “start from scratch” wouldn’t be such a big problem if there were reliable ways to get your current application state into Claude Design in the first place.

For this, Claude Design has several options, in addition to screenshots and your own prompt engineering:

Figma import.
Custom design system.
Code import.

We already talked about some of the frustrations of the Figma import not being fine-grained enough and not having roundtrip back to Figma. But a bigger point is just… you’re adding another tool to a workflow that ideally wouldn’t need it.

Because if Claude Design *knew *your codebase and design system perfectly, you could use its AI to start from scratch without messing around in Figma.

So, let’s look at the design system features, which a lot of designers are really hyped about.

The setup flow is really straightforward, and Claude generates a basic design system references your brand’s typography, colors, and even spacing values. This part was easy to do.

And in my testing, its generations produced something that looked directionally related to the product I gave it as context.

But there's a big difference between approximating a design system and working from the real one. The output doesn’t use our real code components. It has no 1:1 relationship with the existing implementation our team already worked hard on.

The fidelity was maybe 50%-75% at best: close enough to feel relevant at a glance, but not close enough to trust as a production representation.

And unfortunately, it’s the same with importing your code repository. Claude uses the code as context, but it doesn’t actually work inside your repo or run your app to look at the rendered state. It’s just glancing through the files to get the vibe.

Handoffs are ugly

Because the code Claude Design generates doesn’t use your team’s stack, design tokens, or real components, it’s just a prototype.

And that’s fine. Anthropic has said that Claude Design is meant to export to Claude Code, which can read a codebase, work in a real repo, make changes across files, run tests, and deliver code.

But realistically, Claude Code is hard for designers to use. It’s powerful, but it’s a developer-oriented tool. It asks you to deal with local repos, dependency installation, environment variables, app preview servers, Git branches, commits, pull requests, syncing, and merge conflicts.

And if all that doesn’t scare you, then it makes way more sense just to start there anyway. Claude Design is an extra step that, frankly, sits in a really awkward middle spot between Figma and Claude Code.

You might as well stick to Figma.

Of course, most designers on a bigger team would want to use Claude Design and then hand off to engineers, much like you would from Figma.

This is fine, in theory, but no one really saved any time. Your engineers now have to recreate your implementation from scratch, which may divert from the existing design system in subtle ways. You’re back to all the existing problems with design handoffs.

A better design handoff exists

We’ve thought about designers, AI, and handoffs and awful lot, and we’ve designed common workflows right into our product.

Claude Design is strong when you want to explore what should exist. Builder is stronger when you need to safely change what already exists.

Builder connects to your team’s repo one time (devs can set this up), and then anyone on your team can access their own branches with a live preview, a visual editor, and AI (Claude, GPT, whatever you like) that can see and manipulate your real app inside a safe sandbox.

Any changes you make use your real design tokens and real code components. And you can test them out against real data in your application.

When you’re done, you can tag in the proper team for review, and they can continue work on the same branch, making sure that you’re prototyping effort—the tokens, time, and taste you put in—don’t go to waste. No more design fidelity loss.

This also makes it far easier to make small design tweaks that would otherwise be backlogged for eternity. You might be browsing your team’s marketing site and notice a button that needs to move 2px to the left. Great. Just open the editor, move the button, and hit submit.

Your engineers always get final say, so you’ll never break production. You’re free to play around without worrying about jumping between a bunch of different tools. And if you do still want to roundtrip with Figma, that’s completely supported.

You can still use Claude Design with Builder

If you still want to use Claude Design, but you’re interested in Builder, we also support importing your Claude Design projects at any phase of your workflow.

Claude Design is a hint at the future of design workflows

Claude Design is genuinely innovative. It makes AI feel less like a blank chat box and more like a creative partner: one that can ask questions, generate options, expose sliders, accept annotations, and turn rough intent into something visual.

But the closer you get to production, the more the workflow depends on handoff: to a coding agent, to the repo, and to the team's review process.

Builder takes a different path. Instead of making a better artifact to hand off, it gives designers a visual way to work in the existing app, with the repo, preview, branch, and PR flow already handled.

Claude Design is for exploring what should exist. Builder is for safely changing what already exists.

Read the full post on the Builder.io blog

v0 Alternatives for 2026

Tue, 28 Apr 2026 18:00:00 GMT

v0 is fast. You describe a UI, you get React and Tailwind, and you're iterating in seconds.

The catch is that the output lives inside v0. The shadcn-style components are nice, but the moment your prototype turns into a real product, you want that same speed inside your own repo — with a backend, a build pipeline, and a review workflow your team already trusts.

Usually, if you're looking for a v0 alternative, it's because you've outgrown component-by-component generation. You need a tool that can keep the prompt-driven speed, but stop pretending the UI is the whole app.

So, here are the options when you grow out of v0, or want a tool that does the same thing with a different workflow. I've grouped them by where you want to land after the prototype.

Skip the comparison if you already know what you want.
If you're outgrowing v0 because the UI is great but the rest of your product lives in a real repo, see what changed in Builder 2.0 — it's the v0-style prompt loop, but every change lands as a PR on a real branch.

Quick comparison table

If you're skimming, here's the shape of each tool at a glance:

What is v0?

v0 by Vercel is a prompt-to-UI generator. You describe a screen, and it returns a React component built on shadcn/ui and Tailwind, with an iframe preview you can keep refining in chat.

The reason it took off is the same reason designers love Figma autolayout: the loop is tight. You're not configuring a project — you're describing a UI, watching it render, and adjusting one prompt at a time. For a hero section, a settings page, or a dashboard sketch, it's hard to beat.

The reason people start looking for an alternative is also pretty consistent. v0 generates components, not applications. There's no real backend story, no native PR flow on top of an existing repo, and the project effectively lives on Vercel until you copy the code out and finish the wiring yourself. If you're past the "first screen" stage, that gap gets expensive.

Hosted, v0-style prompt-to-app loops

These are the closest functional substitutes — keep the prompt loop, swap who hosts it.

Bolt.new is StackBlitz's hosted AI IDE. You prompt, Bolt scaffolds a full-stack app in the browser (frontend, backend, sometimes a database), and you watch it run in a WebContainer next to the chat. It's the most v0-adjacent tool that also gives you a real file tree.

Bolt.new

Bolt leans heavier on code than v0 does — you'll see and edit files, not just a preview — but the loop is the same: describe, run, refine.

What to like

Real full-stack output (React + Node, with deps installed in-browser)
One-click push to GitHub when you want to leave
Works on top of an existing repo, not just greenfield projects

Tradeoffs to expect

Token-based pricing burns fast on bigger apps
The WebContainer abstraction occasionally fights real-world deps
Less polished visual editing than v0's component canvas

Works well for

Prototyping a full-stack idea end-to-end before committing to local dev
Engineers who want v0's speed but with code visibility

Lovable is the prompt-to-app tool that everyone benchmarks against right now. You describe an app, you get a working app — frontend, auth, database wired up — running on a shareable URL in minutes.

Lovable

Compared to v0, Lovable is much more "ship the whole thing" and less "scaffold the UI." If v0 feels like a UI generator, Lovable feels like a hosted product team in a chat box.

What to like

Genuinely impressive zero-to-demo speed
Visual editing on top of generated code
Two-way GitHub sync once you connect a repo

Tradeoffs to expect

The project is born inside Lovable; pulling it cleanly into your own stack takes work
Credits run out faster than you'd think on iteration-heavy projects
Heavier opinions on stack choices than v0

Works well for

Founders who want a working app, not just a UI
Anyone validating an idea before writing a line of code

Replit Agent

Replit Agent bundles an agent, an IDE, and a hosted runtime into one tab. You describe what you want, and the agent writes, runs, and debugs the app in a Replit workspace you can share.

Where v0 is component-shaped and Bolt is project-shaped, Replit is workspace-shaped. The agent is doing the work, but you're sitting in a normal-looking IDE while it does.

What to like

Multi-language support out of the box (not just React/Next)
Live collaboration on the same workspace
Hosting and a database are right there — no separate setup

Tradeoffs to expect

Git exists, but isn't the native workflow
The agent can wander on bigger tasks; you'll babysit
Pricing is usage-based and a little hard to predict

Works well for

Backends, scripts, and data apps where v0's React-first frame doesn't fit
Teaching, demos, and anything that benefits from "click the URL, see the app"

A repo-first workflow that lets you ship more than UI

This is the slot where v0 actually leaks customers — when the UI is great, but the rest of your product is in a real repo, and you want the same prompt-driven speed there.

Builder

Builder is the v0 prompt loop, but the output lands as a PR in your real repo. You connect a GitHub repo, Builder spins up a containerized dev environment from it, and you can prompt or visually edit your real app — components, pages, even backend code — with every change shipped as a pull request on a branch.

The framing matters: v0 generates UI in a sandbox, then asks you to copy it into your repo. Builder skips that step. The repo is the project. Prompts and visual edits are just two more ways to write code, alongside your normal editor and your normal CI.

What to like

PR-first by design, so your existing review workflow keeps working
A deep visual editor on top of your real components, not a v0-only sandbox
The same project can be edited via prompt, visual canvas, or your local editor — the agentic IDE sits on top of the real codebase

Tradeoffs to expect

The "real repo" framing means a tiny bit more setup than a hosted sandbox
Best fit when you actually want a long-lived codebase, not a throwaway demo
Credit-based pricing (predictable, but worth modeling for heavy AI use)

Works well for

Teams who want v0-style iteration speed without giving up code ownership
Designers and PMs editing the same repo engineers ship from
Marketing pages, product surfaces, and internal tools that need to live next to your real code

Done copying components out of v0 into your repo?
Builder runs the same prompt loop directly on your codebase — every change is a PR, every preview is your real app. Try Builder Fusion free →

For when you're ready to learn engineering

These aren't really v0 alternatives — they're what v0 graduates use. If you've already shipped one prototype and you're tired of being naked without a code editor, this is the next step.

Claude Cowork and Claude Code

Claude Code is Anthropic's file-and-repo-aware agent. Claude Cowork wraps it in a friendlier collaboration surface. Either way, the model is reading your real files, running commands, and editing code on disk — not generating from scratch in a sandbox.

This is a step up in concept-cost. You're not just describing a screen; you're describing changes to a real codebase. But once you make that jump, the leverage is wildly higher than what v0 can give you.

What to like

Operates on your actual repo, not a sandbox copy
Excellent at multi-file refactors and "follow this pattern" tasks
Works alongside your normal Git workflow

Tradeoffs to expect

Comfort with reading diffs and managing branches is required
Usage-based pricing rewards thoughtful prompting
Less hand-holding than v0 for someone brand-new to web dev

Works well for

Engineers (or designers-becoming-engineers) who want an agent in their existing flow
Anything that touches more than just the UI layer

Cursor is the agentic IDE that ate a lot of VS Code's mindshare. It looks like a code editor because it is one — there's just an extremely capable model living inside it that can read your repo, edit files, and explain itself in chat.

Cursor

Where v0 is "I have a screen in mind, give me code," Cursor is "I have a codebase, help me move through it faster." Different surface area, different ceiling.

What to like

Real IDE ergonomics, with all the muscle memory you already have
Strong multi-file edits and codebase-aware chat
Works on any local repo — no platform lock-in

Tradeoffs to expect

No visual canvas; you're reading code
Subscription pricing on top of your existing tooling
Steepest ramp of anything in this list for someone who's never written code

Works well for

Engineers who want AI in their existing IDE
Anyone tired of context-switching between v0 and their real editor

Keep the v0 speed, but ship more than UI

If you're chasing that specific rapid UI iteration loop, Bolt.new and Lovable are the closest functional equivalents. They keep the flow, just on a different platform.

But if you're thinking about code ownership, real backends, and review workflows that survive past the demo, you probably want to start with your architecture in mind.

My recommendation, for most of you reading this: Builder. It's the only tool on this list that gives you v0's prompt-driven iteration speed and lands every change as a PR in the repo your team already ships from. If you've outgrown v0 because the output doesn't live anywhere your real product does, that's the gap Builder closes.

If you want something else:

Use Bolt.new, Lovable, or Replit Agent when you're okay with the project living its life mostly inside a hosted prototype.
Use Claude Code or Cursor when you're ready to level up and become a software developer in earnest.

Speed matters. But a pretty UI is just the first 10% of shipping a real app.

Try Builder free →

Read the full post on the Builder.io blog

Setting Up a New Claude Code Project: The Complete Guide

Fri, 24 Apr 2026 18:00:00 GMT

Most Claude Code setup guides still tell you to run npm install. That method is deprecated. The native installer requires no Node.js at all, and the real setup work starts after installation. Configuring CLAUDE.md, skills, and MCP servers turns a generic AI assistant into one that knows your codebase.

I've been using Claude Code daily for months. The difference between a default setup and a properly configured one is night and day. With a good CLAUDE.md and a couple of MCP servers, Claude stops guessing about your project and starts giving answers that fit your codebase. That setup takes 15-20 minutes. This claude code tutorial covers installation through your first productive session.

If you're getting started with Claude Code for the first time, start with our complete guide to Claude Code for the big picture.

How do you install Claude Code?

Install Claude Code using the native installer. Run curl -fsSL | bash on macOS or Linux, or irm | iex on Windows PowerShell. The npm method is deprecated and no longer recommended. No Node.js is required. After installing, run claude --version to verify, then launch claude in any project directory to authenticate. The official setup docs cover platform-specific details.

The install process takes about two minutes on any platform:

Run the native installer for your OS:
Verify the installation:
Authenticate on first launch by running claude in any directory. A browser window opens for OAuth login with your Claude Pro, Max, Team, or Enterprise subscription.

Alternative install methods include Homebrew (brew install --cask claude-code) and WinGet (winget install Anthropic.ClaudeCode), though these don't auto-update.

The old npm installation (npm install -g @anthropic-ai/claude-code) still works but is officially deprecated. If you installed via npm previously, run claude install to switch to the native version. I put off this migration for weeks and it took about 30 seconds.

One note for Windows users: both WSL and native PowerShell work for Claude Code. If you're already using Linux-based toolchains, WSL is the better choice since it supports full Bash tool sandboxing. Native Windows requires Git Bash.

What should your first Claude Code session look like?

Start your first Claude Code session by navigating to your project directory and running claude. Use /init to generate a starting CLAUDE.md, then explore your codebase before writing any code. Follow the Explore, Plan, Implement, Commit workflow. Always ask Claude to describe its approach and wait for your approval before it begins implementing changes.

A productive first session follows this pattern:

Navigate to your project and start a session:
**Run **/init to bootstrap a CLAUDE.md. Claude scans your project structure, identifies file types, and generates a starting configuration.
Explore before coding. Ask Claude to explain your codebase, summarize the architecture, or identify patterns. This builds its understanding of your project.
Use plan mode (Shift+Tab) to preview proposed changes before Claude implements them. This is your safety net for reviewing what Claude intends to do.

Anthropic's best practices documentation recommends an Explore, Plan, Implement, Commit workflow. Resist the urge to jump straight to "build me a feature." Let Claude understand your project first, propose a plan, and only then start implementing.

Commands worth knowing from day one:

/help -- shows all available commands
/clear -- fresh conversation without losing CLAUDE.md context
/compact -- compresses the conversation to free up context window space
/config -- opens your settings

What should your CLAUDE.md file include?

A CLAUDE.md file should include a one-paragraph project summary, your tech stack, build/test/lint commands, and key coding conventions, all within 60 to 80 lines. Use the WHY/WHAT/HOW framework: explain why the project exists, what it does, and how to work with it. Keep code style rules out of CLAUDE.md entirely and rely on linters and formatters instead.

CLAUDE.md is persistent memory. Claude reads it at the start of every session, so everything in this file shapes how Claude interacts with your code. Getting it right matters more than any other part of your claude code setup.

The WHY/WHAT/HOW framework (popularized by a HumanLayer blog post that earned 748 points on Hacker News) gives you a clear structure:

WHY: What the project does and what problem it solves
WHAT: Tech stack, dependencies, project structure
HOW: Build, test, lint commands and verification steps

A template that covers the essentials:

# Project: MyApp

A task management API built with Node.js and Express.

## Tech stack
- Node.js 20, Express 4, TypeScript
- PostgreSQL with Prisma ORM
- Jest for testing

## Commands
- `npm run dev` - Start dev server (port 3000)
- `npm test` - Run all tests
- `npm run test:watch` - Watch mode
- `npm run lint` - ESLint check
- `npm run build` - Production build

## Conventions
- Use async/await over raw promises
- Named exports preferred over defaults
- Error responses follow RFC 7807 format

Community consensus from developers who use Claude Code daily is to keep it under 60-80 lines. Beyond that, Claude starts deprioritizing instructions. Use deterministic tools (ESLint, Prettier) for code style enforcement rather than hoping Claude follows prose style rules consistently.

For larger projects, use @imports to reference external docs. A CLAUDE.md line like @docs/api-conventions.md tells Claude to load that file on demand. Imports chain up to 5 levels deep, so you get progressive disclosure without bloating your root file.

Two features that most setup guides skip completely.

.claude/rules/** directory** lets you create modular, topic-specific rules. Each .md file in this directory loads as project memory. You can scope rules to specific file paths using YAML frontmatter:

---
paths:
  - "src/api/**/*.ts"
---

# API rules
- All endpoints must include input validation
- Use the standard error response format

CLAUDE.local.md is for personal project overrides. It's automatically added to .gitignore, so your team shares a common CLAUDE.md while each developer keeps their own preferences (sandbox URLs, preferred test data, personal shortcuts) in CLAUDE.local.md.

Claude also maintains auto memory at ~/.claude/projects//memory/, saving project patterns, debugging insights, and your preferences across sessions. You don't need to configure this. It happens in the background.

For a deeper dive into CLAUDE.md configuration, see our CLAUDE.md guide.

How do you configure skills and slash commands?

Create custom skills by adding SKILL.md files to a .claude/skills/ directory in your project. Each file uses YAML frontmatter to declare its description, allowed tools, and execution behavior. The skills system replaces the older .claude/commands/ approach, though both still work. Beyond custom skills, the Claude Code plugin ecosystem offers over 1,300 community-built skills you can install.

Skills are how you teach Claude Code repeatable workflows. A skill is a directory containing a SKILL.md file with YAML frontmatter and markdown instructions. I use them for everything from running test suites to generating changelog entries.

A test-fixing skill looks like this:

---
name: fix-tests
description: Analyze and fix failing tests in the project
allowed-tools: Read, Grep, Bash(npm test *)
---

When fixing tests:
1. Run the test suite to identify failures
2. Read failing test files and the code they test
3. Determine if the test or implementation is wrong
4. Make minimal changes to fix the issue
5. Re-run tests to verify the fix

Focus on: $ARGUMENTS

Invoke it with /fix-tests src/features/auth/ and Claude runs the full workflow.

Claude loads skills progressively. It reads only the name and description at session start, then loads the full skill content when you invoke it or when Claude determines it's relevant. This keeps your context window lean.

The plugin ecosystem already has over 1,300 community-built skills across registries like Claude Code Plugins and Claude Plugins Directory. You'll find plugins for documentation lookup, browser automation, code intelligence, and linting. Browse what's available and install the ones that match your stack.

What MCP servers should you set up for a new project?

Configure MCP servers by running claude mcp add or by creating a .mcp.json file at your project root. Start with 2-3 servers that match your workflow. Keep project-scope servers in .mcp.json so teammates share the same tool configuration through version control.

Model Context Protocol (MCP) connects Claude Code to external tools and data sources. MCP servers let Claude query databases, pull up-to-date API docs, run browser tests, and call external APIs. They've changed my workflow more than any other Claude Code feature.

Three scopes control where MCP servers are available:

Local (default): Private to you, stored in your user config
Project (.mcp.json): Shared with your team via git
User (~/.claude/): Available across all your projects

Adding a server is one command:

claude mcp add context7 -- npx -y @anthropic-ai/context7-mcp@latest

For a new project, start with 2-3 servers that match your stack. Context7 gives Claude access to up-to-date API docs. Playwright handles browser automation. A database connector lets Claude query your schema directly.

Keep your total to 5-6 MCP servers per project. Each server consumes context window space, and loading too many dilutes Claude's attention.

Store credentials as environment variables in .mcp.json, not as raw strings. The file goes into version control, so API keys should never appear directly in it.

For a full walkthrough of MCP configuration, see our guide to Claude Code and MCP.

What are the best practices for a new Claude Code project?

Start every Claude Code project with git initialized and permissions at their defaults. Use plan mode to review proposed changes before Claude implements them. Keep your CLAUDE.md concise and up to date after each major architectural decision. For production work, choose the stable release channel, which runs about a week behind the latest channel and skips builds with known regressions.

These are the habits I've seen make the biggest difference across dozens of projects:

Initialize git first. Always have version control before your first Claude session. Git is your safety net for rolling back any change Claude makes.
Keep permission defaults. Claude asks before reading, writing, or executing commands. Start with these prompts enabled and learn which operations you're comfortable approving.
Use feature branches for AI-driven changes. Create a branch before asking Claude to refactor or generate significant code. Review the diff before merging.
Treat CLAUDE.md as a living document. Update it after architectural decisions, dependency changes, or workflow shifts. A stale CLAUDE.md leads to stale suggestions.
Pick a release channel. The latest channel (default) gives you new features immediately. The stable channel runs about one week behind and skips builds with known regressions. For production projects, stable is the safer choice. Configure it via /config or in your settings.json.
Start simple, then verify. Boris Cherny (engineering lead on Claude Code at Anthropic) uses a "surprisingly vanilla" setup. The tool works well out of the box. Add complexity only when you hit a real need, and review Claude's output the same way you'd review a pull request from a capable but fallible colleague.
Choose your interface. The terminal gives you full power and all features. The VS Code extension adds inline suggestions and diff views for developers who prefer staying in their editor. JetBrains support is also available. Pick what fits your workflow.

See our full list of 50 Claude Code tips and best practices.

How do you extend this setup to the rest of your team?

Share what git tracks, then bring in the people git can't reach. CLAUDE.md, .claude/skills/, and .mcp.json all live in the repo, so every developer who clones it inherits the same setup on their first claude session. That covers your engineering team. It doesn't cover the designers, PMs, and QA folks who never run Claude Code at all, and the setup work you just did doesn't extend to the parts of the workflow they own.

A designer who wants to propose a layout change still has to hand it off to engineering. A PM who wants to draft a feature update still has to spec it in a doc and wait. Your Claude Code project is ready for you to ship faster. The rest of your team still ships at pre-AI speed.

Builder 2.0 runs Claude Code in cloud containers that pick up your repo's CLAUDE.md, so the context you built (tech stack, commands, conventions) applies in every Builder session. Custom subagents and MCP servers live at the workspace level, so the team's connected tools are available to every teammate who joins a branch. A designer opens a branch in Builder's visual canvas and gets agent output generated against your real components. A PM prompts a copy change from Slack and the agent handling it has the same project context. A QA agent runs browser tests on every branch. Everyone works through a Claude Code agent that knows your project, regardless of whether they install the CLI.

Your setup is for your repo. Builder 2.0 is how every teammate inherits it.

See how Builder 2.0 picks up your Claude Code project →

FAQ

Q: What is the difference between Claude and Claude Code?

A: Claude is Anthropic's conversational AI model, accessible through claude.ai and the API. Claude Code is a terminal-based coding agent that runs locally, reads your project files, and makes real changes to your codebase. Claude Code uses Claude as its underlying model but adds file access, command execution, and project context that the Claude Desktop chat interface doesn't have.

Q: Do I need Node.js to install Claude Code?

A: No. The native installer (curl on macOS/Linux, PowerShell on Windows) doesn't require Node.js. The older npm installation method required Node.js 18+ but is now deprecated. Use the native installer for a cleaner setup with fewer dependencies.

Q: How do I start a new project in Claude Code?

A: Create a project directory, initialize git, navigate into the folder, and run claude to start a session. On first launch, run /init to let Claude scan your project structure. Then create a CLAUDE.md file with your project summary, tech stack, and key commands. Claude reads this file at the start of every session.

Q: Can I use Claude Code in VS Code instead of the terminal?

A: Yes. Install the Claude Code extension from the VS Code marketplace. It provides inline suggestions, diff views, and an integrated terminal panel. The terminal version remains more full-featured, but the VS Code extension works well for developers who prefer staying in their editor.

Q: How do I troubleshoot Claude Code connection issues?

A: Verify your subscription is active and your credentials are valid by running claude --version and checking your account at claude.ai. If you're behind a corporate proxy, confirm outbound connections to Anthropic endpoints are allowed. Restarting the terminal and re-authenticating resolves most connection issues.

Wrapping up

Installation takes two minutes. The real value of your Claude Code setup comes from what you configure after: a focused CLAUDE.md, the right MCP servers for your stack, and skills that encode your team's workflows. Auto memory and path-specific rules keep improving things in the background once you've done the initial work.

Start with the native installer and /init. Then write a CLAUDE.md under 60 lines using the WHY/WHAT/HOW framework. That single file will improve Claude's output more than anything else you configure.

Once your project is set up, explore how to use Claude Code for daily development workflows, or dig into customizing Claude Code for advanced configuration.

And when you want this setup to reach the people on your team who don't run Claude Code, start free on Builder.io →

Read the full post on the Builder.io blog

Why is AI Agent Authentication So Hard?

Mon, 20 Apr 2026 18:00:00 GMT

Maybe you've run into this.

Cursor can read your Notion workspace just fine, but then it immediately hits a 403 when it tries to update the page it just summarized.

Claude kicks off a sub-agent to triage a Linear issue, and suddenly that sub-agent has all the same access the parent did, including Slack, GitHub, and everything else.

Copilot works through a multi-step refactor across three repos, and when you check the GitHub audit log, it all looks like one human user did the whole thing, with no way to tell which agent handled each step.

Agent auth is hard because OAuth gets you through login, but it doesn’t really solve delegation, runtime authorization, or auditability, even though agents need all three at once.

Let's take a look at where things actually break, why OAuth on its own isn't enough, and what you can do about it today.

Why AI agents need their own identity

If your team’s agents mostly work but keep breaking in weird places like Linear, Notion, GitHub, and your internal APIs, this is usually why. Agents sit between people and the systems they use. They need an identity of their own, but that identity still has to stay tied to the user or system that authorized it.

On top of that, their permissions can change from task to task, they may need to stop and ask for approval, and after a few handoffs, you still need a clear audit trail showing who actually did what.

A dead giveaway that you're stuck in this messy middle is when your audit log says the user did everything, even though it was actually an agent, or even a sub-agent three steps removed, that pushed the commit.

Three ways AI agent authentication fails

These three failure modes show up again and again, and once you recognize them, you’ll start spotting them behind almost every agent integration bug your team files.

The agent can read the doc, but it still can’t edit the page. It has a perfectly valid OAuth token with wiki:write, opens the postmortem in your internal wiki, and then immediately gets a 403 when it tries to make a change. The token isn’t really the problem. The page has its own ACL, a separate access list outside OAuth permissions, and while the user is on that list, delegated agents usually aren’t.

The OAuth scope is basically saying, “this app can write wiki pages.” But the page-level ACL is saying something much narrower: “this specific person can edit this specific page.” That kind of resource-level rule lives completely outside the things scopes were designed to express.

A sub-agent inherits the parent’s full scope. Say a parent agent has repo:write and wiki:write, then spins up a child agent just to summarize a doc. In practice, that child often ends up with both permissions anyway. Suddenly a harmless summarization step has the same blast radius as the whole workflow. And OAuth doesn’t really give you a clean way to say, “this child only gets wiki:read for the next ten minutes, and only on this one page.”

After three handoffs, the audit log can’t tell who actually did the work. A user kicks off a workflow, that workflow calls a planner agent, the planner calls a code-writing agent, and eventually a commit gets pushed. But downstream, the system still just sees the same user token, so everything gets attributed to the user. When someone has to untangle it on Monday, there’s no clear way to tell which agent handled step three.

These aren't authentication problems. They're runtime authorization problems.

Once the user is logged in, the real questions are more concrete: can this agent access this specific resource, does a child agent automatically inherit all of the parent’s permissions, and after a few handoffs, can the audit log still tell you who actually did what? Oso calls this the runtime authorization problem. Once the token is issued, the authorization server can’t really see how the agent is using it.

All three failures come from the same basic mismatch. OAuth scopes get set when the token is minted, and after that, the authorization server is basically out of the loop. So the answer isn't some cleverer OAuth flow. It's about adding a few things OAuth was never really built to handle: delegation that names both the user and the agent, runtime checks that are more precise than scopes, and a real identity for the agent itself.

How token exchange fixes agent delegation

What actually carries that user-and-agent relationship over the wire is RFC 8693 token exchange. The key idea is that the agent's identity travels with the user's identity, not instead of it. When a tool handles this properly, the audit log shows both who the human was and which agent took the action, rather than the usual mess where it all looks like one user did everything.

The IETF’s OAuth on-behalf-of-user draft, currently at -02, starts to make this more official by spelling out what that token exchange between the agent and user should look like.

Ask any MCP server this before you let your team use it: when it calls a downstream API, does it just forward your token, or does it exchange it? If the answer is "we pass it through," that's a red flag. As Aembit points out in its MCP guidance, every extra system that sees a forwarded token is another place that token can leak, and the downstream API still has no idea the MCP server was involved.

The safer pattern is token exchange: the server swaps the user's token for a new one scoped to that specific downstream API and carrying the agent's identity alongside the user's.

In practice, the quick gut-check is pretty simple: vendors doing this the right way will explicitly talk about token exchange (RFC 8693) or on-behalf-of flows in their auth docs, and they’ll explain that the server has its own separate credential too. Vendors doing it the wrong way will say something like, “we just forward your OAuth token to the downstream API,” or worse, they won’t explain how that downstream call is authenticated at all.

How to evaluate AI agent tools and MCP servers

This is usually more of a tool selection problem than an implementation problem. Before your team adopts any agent tool, MCP server, or platform, it’s worth asking a few basic questions:

Does the agent have its own identity, or is it basically operating as the user?
What happens when a sub-agent kicks in? Does it automatically get all the same permissions as the parent, or is it limited to a smaller scope?
Will the audit log show which agent actually did what, or does everything just get attributed back to the human who kicked it off?
Does the tool exchange your token for one of its own before calling downstream APIs, or does it just pass yours straight through?
Is the agent using credentials that expire when they should, or is it sitting on a long-lived API key?

The red flags in vendor docs are usually the opposite of those questions:

“We use OAuth,” but they never explain what they actually mean by that.
A setup guide that tells you to paste in a long-lived API key.
No explanation of what sub-agents can access or whether they just inherit everything.
Passing your token directly from the MCP server to the downstream API.

If you see any of those, assume all three failure modes are still very much in play for your team.

A few practices are worth standardizing no matter which vendor you use. Don’t give agents tokens that are broader than the job in front of them. Be extra careful with tools that can spawn sub-agents until they clearly explain how delegation works. And when you can, favor tools that keep the agent contained, so if one step ends up with too many permissions, the blast radius is limited by more than just the OAuth scope.

If you're picking an authorization layer for your team's internal tools, OpenFGA is a solid option for relationship-based permissions and gives you a real audit trail. SPIFFE can handle the workload identity side.

This is starting to move beyond standards docs and into real identity products. Okta’s Cross App Access is one of the first signs that this shift is actually happening.

The easiest place to start is with the MCP servers your team already uses. Pull up each vendor’s auth docs, ask the five questions above, and sort them into two buckets: “exchanges tokens” and “forwards tokens.” That quick pass usually tells you which integrations are recreating the same three failure modes and which ones aren’t.

Until tools start treating delegated software actors as their own category, instead of just a proxy for the user or some generic service account, agent auth problems are going to keep showing up as architecture problems.

Why agent containment matters as much as the spec stack

At the platform level, the big question is: who on your team can safely run agents in the first place?

Containment is what makes this possible. If each agent gets its own scoped environment, a designer or PM can hand it a task and keep moving without needing an engineer to double-check whether it can access production secrets, the wrong repo, or a teammate’s branch, because the environment already says no.

That’s hard to pull off in Claude Code or Cursor, where the agent is working against your local filesystem with all the reach of your shell, and whether it’s “safe for anyone to use” depends heavily on how carefully the machine was set up by its user, who may or may not have technical know how.

The ideal is per-agent containment. Each agent gets its own scoped environment, with its own filesystem, network access, and credentials. That means even if a sub-agent ends up with more permission than it should, the container still limits the blast radius before OAuth scopes even come into play. It’s basically the runtime version of the same idea the spec stack is aiming for: giving the agent a real, narrow identity instead of having it inherit whatever access the parent already has.

Builder does this by default, and engineers can use it in tandem with Claude Code or other tools they already love. Every agent runs in its own scoped cloud container, with its own service account, network policy, and credential rotation already set up, so the platform team doesn’t have to stitch all of that together themselves.

Read the full post on the Builder.io blog

How 270 Developers, Designers, and PMs Went from PRD to Working App in 60 Minutes

Wed, 22 Apr 2026 18:00:00 GMT

At a recent live Builder workshop, 270 developers, designers, and PMs built a working calendar app from a one-page PRD. They started with a blank project, attached a PDF, typed a prompt, and watched Builder generate a functional calendar using Google's Material Design component library. Sixty minutes later, most had added features they hadn't planned for:

Natural language event creation
An agenda view
Dark mode with a toggle
Drag-and-drop rescheduling

None of them wrote a line of CSS.

The workshop demonstrated what the workflow looks like when the whole team builds on the same codebase, using the same design system, with review and approval built into every step before a PR reaches engineering. Each attendee connected to a shared Builder project pre-configured with Material Design 3 components and tokens, attached a product requirements doc, entered a single prompt, and Builder created a feature branch with a working app.

From there, they iterated in interact mode, asked the agent to suggest high-impact additions from the design system, picked one, and had it implemented. A design-minded attendee adjusted header colors in the visual editor and told the agent to use design tokens rather than the specific hex she'd selected in the style panel. Someone else added a Google Meet integration. Several people built out responsive calendar views.

When they were ready to hand off, they clicked "Send PR." Builder generated a pull request with a full summary of every change, organized by commit. Engineers on the call could pull the branch name into their local environment and keep working, or leave a comment in the PR for the Builder bot to action, which pushed a new commit in seconds.

That last part matters for engineering leaders. The PR your team receives has already been through product review, design QA, and functional testing by the time it lands. Developers read the diff, check code quality, and merge. They spend their time on decisions that require engineering judgment, not on translating a Figma file into components that already exist in the design system.

If you're evaluating this workflow for your team:

Design system completeness determines output quality. Builder's code generation depends directly on how well your component library is indexed. Teams with complete, well-documented design systems get production-ready output. Teams starting from scratch get something closer to a prototype.
Prototypes built this way tend to stay in the codebase. 80% of the prototypes Builder's internal team creates are merged into production PRs. The calendar app attendees built could have been merged straight into a pull request on a real repo.
The review burden shifts earlier. Because QA, design, and product all validate against a live preview URL before engineering touches anything, changes get caught when they're cheap to fix.

The real takeaway from 270 people building in parallel for an hour is that the workflow scales. Product managers, designers, and QA all contributed directly to real code, engineers stayed in review mode, and every change moved through a structured approval process before it touched a PR. That's not a demo condition. It's the same workflow your team would run in production, just compressed into a single session.

Watch the full workshop recording to see the build from prompt to PR.

Download the idea-to-production guide to map the workflow to your team's specific roles and tools.

Read the full post on the Builder.io blog

How to build agent-native applications (and what not to do)

Tue, 21 Apr 2026 07:00:00 GMT

This is the biggest mistake I see people constantly make when building AI applications.

// Don't do this
const output = await llm(prompt)

Let me show you how to make this substantially better.

Step 1: tools and a loop

First, we need two new things. We need tools and a loop. LLMs can't do anything on their own, but you can provide them tools — for an email app it could be draftEmail, searchEmails, etc.

You send a call to an LLM, it sends back what tools it wants to run, those tools execute, and then the results are sent back to the LLM on a loop until things are complete.

You can introspect each step. You can output each piece to the UI, like progress.

Step 2: stop assuming the AI is correct

But this still has one massive issue. We're still assuming the AI is correct.

In this case, we're just running through a loop and then doing something with the results without giving users a way to give the feedback that we know is so critical for non-deterministic systems.

So the better thing you can do is this: build a UI that shows the streaming result as the agent is outputting things. Give users a way to stop it, give feedback, queue the next message.

This is sort of the state of the art today. But I actually think we can do one solid step better.

Step 3: customization (instructions, skills, memory)

The reason things like Claude Code, Codex, and OpenCode are so powerful is there's a lot more customization you can do of your agents.

You can give them all kinds of custom instructions unique to you and your use case and your project. You can give them additional files as context that they can reference right from a file system. You can give them skills. They can keep track of their own memory as they learn and improve.

These things can make a crazy difference and are a big reason why Claude Code is exploding so much right now.

But then you're probably wondering: how do I provide all of that in my application? That's a lot to build.

I've personally come to the opinion that this is the better pattern that pretty much every application should adopt if possible. But it's true — it's complex to build a Claude Code fitted for your application that is user-friendly, has the right permissions and guardrails, and just generally makes sense.

Agent-Native

I've been working on an open source project called Agent-Native. It's very early, but it does a couple of interesting things.

The first one is that your application is defined as a set of actions. These actions are exposed over APIs, so your frontend will use these same actions — for email, searchMail, draftMail, etc.

And these core actions the agent can use as well as tools. The agent has a bunch of built-in stuff that I'll show you.

You render the agent chat + workspace anywhere you want in your application, and then users can chat with that — or you can send messages to the agents from other parts of your app.

Because a great part of applications versus pure agents is you can have workflows and have buttons that give users guidance. But again, you don't want those buttons to just make an LLM call and just dump a result somewhere. You want them to go through an agent so you can look, modify, give feedback. It's influenced by those instruction skills and other customizations. And then if the output's not right, you can go back to the chat, tell it what it did wrong, and get it right the next time.

Of course you need some basics. The agent and the UI always need to be in sync — when the agent makes updates, the UI updates, and vice versa. That's what the framework provides.

So we have our standard chat, but we also have our workspace. The workspace is like a Claude Code or Codex workspace where you can have your instructions and skills and memories. You can add files, you can add subagents. You can really customize it. This is stored in a way where each user can customize their own experience however they want their agent to behave in your app — and at the organization level, you can set standards too. Then when you chat with the agent, it respects all of those things.

So I can jump in and say "add a new revenue dashboard," and based on how I set it up, it might know what all those things mean and start doing the right queries and calling the right tools to do that.

Agent + app, not either/or

And that's cool and all. We can go into full agent mode — where the agent fills the screen, and it's kind of like using a chat app entirely.

But I mentioned applications have a lot of value too, and I see people treating these things as way too either/or. I generally find most applications are better with a built-in agent, and most agents are better if they have UI capabilities. As you've seen recently in things like Cursor, Codex, and Claude Code, these tools can all generate UIs kind of — but again, they don't work like an application.

If I'm using an agent for analytics, I'll want it to save certain views as a dashboard. I want to choose who has access to the dashboard. I want it to work like an app, but I don't want to lose any of the agent affordances.

I also want buttons. And what's cool is the buttons can take prompts. When you say something like "make me a traffic and signups dashboard," when you submit, that gets delegated to the agent. You can see the agent work. It works identically to all those Claude Code and other products you're used to, but it's native to your application.

In this framework, the agent can also see what's on your screen, update what's on your screen, navigate you to other pages, and generally speaking: if the UIs can do it, the agent can do it. And if the agent can do it, the UIs can do it.

And this doesn't require any super complicated container setup or dev boxes running machines on them. You can deploy this anywhere. You can use any LLM that you want. You can use any Drizzle-compatible database or any SQL database effectively. And it's pretty easy to use — at least I think so.

Good, better, best

with actions" />

Whether you want something all-encompassing like this or you just want to integrate into an existing product, I hope this ladder of sort of good, better, best is helpful.

If you want to try out Agent-Native, it's over on GitHub, totally MIT licensed. There are a bunch of example apps you can just try out and get a feel for it. And while it's super duper early, I'd love your feedback — both on the general concept and, if you try it, on the implementation.

But what do you think? Are applications better off not integrating agents? Or are agents better than applications, and nobody will have a UI in the future — it's just gonna be agents and text in Telegram, and that's the only way you ever use products?

I'd love to know your thoughts in the comments. Let me know.

Read the full post on the Builder.io blog

10 Best Windsurf Alternatives in 2026

Fri, 17 Apr 2026 18:00:00 GMT

OpenAI offered $3 billion for Windsurf, but the deal ultimately fell apart. The big sticking point was Microsoft's IP-sharing agreement with OpenAI, which would have effectively put Windsurf's technology in the hands of the GitHub Copilot team. CEO Varun Mohan wasn't willing to let that happen, so the acquisition died.

Within hours, Google DeepMind came knocking. By July 2025, Windsurf’s founding team had signed a $2.4 billion acquihire deal to work on Gemini coding agents. Cognition AI, the company behind Devin, picked up what remained: the IDE, the IP, 350+ enterprise customers, about $82M in ARR, and a 250-person team for roughly $250 million.

Then, in March 2026, Windsurf raised its Pro plan from $15 to $20/month and switched from a credit-based model to a quota system. Developers noticed right away, and the forums lit up.

The founding team is now off building something new at Google, and Windsurf itself lives inside a company whose main bet is a different AI coding agent. So if you're wondering what that means for Windsurf going forward, here are the alternatives worth knowing about — grouped by the job you actually need done.

Note: If you want the one tool on this list built around your entire team — not just developers — Builder 2.0 combines parallel AI agents, visual editing, and PR-first collaboration on your existing repo, so designers and marketers can ship alongside engineers without touching code. It's the only AI development environment that treats code review and cross-team contribution as first-class features, not afterthoughts.

Quick comparison table

If you're just skimming, this is the best place to start. The rest of the post unpacks what each of these columns actually means.

What is Windsurf?

Windsurf (formerly Codeium, rebranded in April 2025) is an AI IDE built around Cascade, its agentic system for multi-file coding work. In 2025, Cognition AI acquired Windsurf after the founding team took a $2.4B acquihire to join Google DeepMind and work on Gemini coding agents.

What to like:

Cascade is designed for multi-file agentic work: you describe the feature, and it can go find the right files, write the code, and iterate with you as you refine it.
Familiar IDE experience that feels a lot like VS Code, so if your team is coming from a conventional editor, the switching cost is pretty minimal
350+ enterprise clients are now now under Cognition ownership, with established SLAs already in place.

Tradeoffs to keep in mind:

The founding team is now at Google working on Gemini, so they’re no longer the ones directly steering Windsurf’s roadmap
The March 2026 quota change can interrupt autonomous task runs halfway through, especially for heavier users.
Cascade can get a bit shaky on more complex multi-file work, with documented issues like terminal execution loops, internal errors, and Language Server crashes on mid-size projects.

Works well for:

Teams already covered by Cognition enterprise SLAs and not in any rush to switch yet
Developers whose day-to-day work stays mostly in lighter agentic tasks, where Cascade is generally more reliable

Hosted AI coding environments

These tools keep the "describe it, run it" loop nice and tight. Like Windsurf’s hosted experience, they give you a live environment without any local setup to worry about. The tradeoff is that your project usually starts life inside their platform. If you later want to move it into a repo you fully own, that typically takes a more deliberate handoff.

Replit

Replit gives you a cloud-based development setup with basically no local configuration to worry about. The whole environment — runtime, database, and deployment — runs right in your browser. Agent 4 can also take on autonomous full-stack development with built-in browser testing, so going from "describe a feature" to "see a running app" feels fast and surprisingly smooth.

What to like:

Built-in PostgreSQL, one-click deploy, real-time multiplayer, and support for 50+ languages — all available right in the browser
No local setup to wrestle with, and the mobile app means you can build and ship from pretty much anywhere
Agent 4 can handle full-stack autonomous builds and browser testing in the same loop, so it feels much faster to go from an idea to a working app

Tradeoffs to expect:

Works best for cloud-native projects, but if your setup relies on complex local tooling or hard-to-replicate dependencies, it may start to feel restrictive
Your project lives inside Replit’s platform, instead of a standard Git repo you fully control, so that can feel a bit restrictive if ownership and portability matter a lot to you

Works well for:

Rapid prototyping and non-technical builders who want to test ideas fast
Distributed teams that would rather skip local infrastructure management altogether

Bolt.new

Bolt.new is StackBlitz's browser-based app generator. You tell it what you want to build, and it spins up a deployed full-stack app using WebContainer technology that runs Node.js entirely in the browser, so there’s no server setup to worry about.

What to like:

From a plain-language prompt, you can go straight to a deployed app without any local setup.
StackBlitz WebContainers run Node.js right in the browser, so you can skip backend provisioning altogether.
A great way to quickly find out whether an idea is actually worth building

Tradeoffs to keep in mind:

What you get is a solid starting point, not production-ready code.
Platform-first: your project begins on Bolt.new rather than in a repo you already control

Works well for:

Non-technical founders who want to validate an idea before investing in a full build
Demos and quick MVPs where the main goal is to prove the idea works, not to polish the codebase

Full AI IDE replacements

These are the closest things to direct Windsurf replacements: full editors with strong AI agent modes, where your code lives in your own repository from day one. Unlike hosted environments, you still control the repo, the review workflow, and the deployment stack. For a deeper look at how these compare as agentic development environments, see our agentic IDE comparison.

Cursor

Cursor is a VS Code fork built from the ground up for AI-first development. In real use, that means its agent mode can handle multi-file tasks with surprisingly little hand-holding: you describe what you want done, and it tracks down the relevant files, writes the code, runs the tests, and keeps iterating. Its tab autocomplete, powered by Supermaven's Babble model (acquired in November 2024), is also one of the fastest in the category.

What to like:

Most of your existing VS Code extensions work right away, with little to no reconfiguration
On Pro+ in 2026, you can run up to eight agents in parallel, and if you need tighter control, self-hosted cloud agents let you keep your code inside your own network.
You can choose from Claude, Gemini, GPT-4o, and Cursor’s own proprietary models — or just use Auto mode and let Cursor pick the best fit for the task.

Tradeoffs to keep in mind:

Cursor runs as its own editor, so if your team depends on specific VS Code extensions, it’s a good idea to test those dependencies before rolling it out more widely.
If you lean too hard on frontier models, credits can disappear fast, so in most cases it's smarter to let Auto mode decide when they're actually worth using.

Works well for:

Experienced developers working in large codebases who want cutting-edge models without giving up most of the VS Code experience
Teams that are okay switching editors in exchange for one of the strongest agent modes in the category

If you want a more detailed side-by-side look, check out the Windsurf vs. Cursor comparison. If you’ve already ruled Cursor out, we’ve also rounded up the full field of Cursor alternatives.

Zed

Zed is a code editor built in Rust, with GPU-accelerated rendering and real-time multiplayer collaboration built right in. It feels fast and stays smooth even on large files and codebases, without the memory bloat you often run into with Electron-based editors. It’s also open source under the MIT license.

What to like:

Uses noticeably less memory than VS Code on large TypeScript monorepos
Real-time collaboration runs on CRDTs, so live pair-programming sessions stay smooth and don’t turn into merge-conflict chaos
MCP support means Zed belongs in the same conversation as Cursor, Cline, and Claude Code when it comes to connecting external tools, and the Pro plan also adds AI edits with a 200K context window

Tradeoffs to expect:

A smaller extension ecosystem than VS Code, so you may miss a few niche plugins or workflows you rely on
If you're on Windows, you'll need either Windows 10 version 1903+ or Windows 11 22H2+.

Works well for:

Performance-focused developers who are tired of Electron-heavy editors eating up memory on large codebases
Distributed teams that pair program often and want collaboration built into the editor itself, instead of bolted on afterward

A repo-first workflow that gives your whole team visual editing

At a certain point, a pure code editor can start to feel a little limiting. You can keep routing every change through code and PR review, or you can shift to a workflow where visual editing, AI-generated code, and team collaboration all land in the same PR — and non-developers can contribute without ever touching code. Builder 2.0 is built for exactly that moment.

Builder 2.0

Builder 2.0 is a repo-native visual IDE built for multiplayer coding, with real-time collaboration, parallel agents, and visual editing that shows up as proper Git diffs. Unlike Cursor or Windsurf, it's not just about generating code. It's built to close the gap between your design team and your development workflow, using the repo you already have.

What to like:

PRs-first by design: every AI or visual edit shows up as a reviewable diff, so you can actually see what changed instead of untangling a giant wall of edits or mystery code
Parallel agents run in containerized environments, so you can have multiple AI tasks going at once without constantly worrying about merge conflicts
Visual editing means design and marketing teams can make changes in a browser editor, without ever having to touch the code.
A "Git for anyone" approach that makes branching and reviews feel much more approachable for non-developers
Works with your existing repository from day one, so you don’t have to migrate to a new platform or worry about lock-in

Tradeoffs to expect:

There’s a little more setup here than with a pure code editor, since Builder 2.0 connects to your existing repo and needs some initial configuration.
The visual editing layer really shines when design or marketing teams are working alongside developers; if you're mostly coding solo, it may not feel as compelling.

Works well for:

Frontend-heavy product teams where developers, designers, and marketers all need to ship changes together
Organizations that want the speed of AI-assisted development without giving up the accountability of a PR-first code review workflow

Want your designers and PMs shipping in the same repo as engineering?
Builder Fusion is the only tool on this list where a non-developer can make a real change and open a PR your team already knows how to review. Try Builder Fusion free →

Add AI to your existing editor

If switching editors is the main thing holding you back — whether that’s team friction, extension dependencies, or budget — these tools bring AI into the editor you already use. So there’s no new IDE to evaluate and no workflow change you have to convince your team to adopt.

GitHub Copilot

GitHub Copilot slips into the IDE you already use, so you don’t have to change editors to get started: VS Code, JetBrains, Visual Studio, and Neovim are all supported. At $10/month for Pro, it’s also one of the more affordable premium AI coding assistants, with enterprise-grade controls available on higher tiers.

What to like:

Deep GitHub integration: you get PR summaries, code review suggestions, and repo-aware context from your own repositories right out of the box.
Copilot added a proper agent mode in March 2026, letting you assign issues and multi-file tasks with less hand-holding than earlier versions.
You also get access to GPT-4o, Claude Sonnet, and Gemini under a single subscription.
The free tier includes 2,000 completions and 50 chat messages per month, which is enough to use it on real work and see whether it’s worth paying for.

Tradeoffs to expect:

The agent mode still trails Cursor and Claude Code on complex multi-file tasks, but the gap has narrowed considerably since March 2026.

Works well for:

Teams that already live in GitHub and want the easiest way to add AI coding help without changing how they work
Developers who want to try AI coding help for $10/month before committing to a more heavyweight tool

Cline

Cline is an open-source VS Code extension for full agentic coding, with MCP support and bring-your-own-API-key flexibility. There’s no subscription fee, so you only pay your actual API costs, and it shows every file operation before it runs.

What to like:

Every file read, write, and API call is shown before it happens, so you can see exactly what Cline is doing without any extra setup
BYOK gives you the flexibility to choose the provider that best fits your setup — OpenAI, Anthropic, Gemini, AWS Bedrock, or even local models through Ollama
Full MCP support also makes it easy to connect Cline to databases, GitHub, and other external tools.

Tradeoffs to keep in mind:

It takes a bit more setup than plug-and-play tools since you’ll need to bring and manage your own API keys.
If you use it a lot in autonomous mode with long-context models, your API bill can climb past what you'd pay for a flat $20/month subscription

Works well for:

Cost-conscious developers and teams that want clear visibility into what the tool is doing, or need stronger control over data sovereignty
Organizations that want fine-grained control over which model handles each request

Gemini Code Assist

Gemini Code Assist stands out for having the biggest context window in this category: one million tokens. That means you can work with a huge amount of code at once without constantly trimming things down. The individual tier is also free with a Google account. It’s also worth noting that some of the team behind Windsurf’s strongest ideas — CEO Varun Mohan and co-founder Douglas Chen — now work at Google DeepMind on Gemini coding agents.

What to like:

One-million-token context window — you can work with entire codebases at once instead of splitting everything into chunks or constantly babysitting context
Free for individual developers with a Google account, and Gemini CLI also gives you a free terminal tool with up to 1,000 requests per day
The team behind some of Windsurf’s best ideas is now helping shape Gemini Code Assist, which gives the product a familiar feel if that’s what you liked about Windsurf

Tradeoffs to keep in mind:

If hands-off, agentic task execution is what you care about most, Cursor still has the edge
Enterprise pricing is on the pricey side at $75/user/month, and MCP support still relies on third-party extensions.

Works well for:

Google Cloud and Workspace teams that want a more native, tightly integrated setup
Developers who liked Windsurf’s context handling and want a free option with an even bigger context window

For a broader look at how the underlying models compare on coding tasks, see our best LLMs for coding breakdown.

Terminal-native agents for deep autonomy

These tools skip the editor completely. You give them a task from the command line, and they dig through your codebase, write code, run tests, and keep iterating on their own. If you already spend most of your day in the terminal and like the idea of handing off whole features end to end, this category is worth trying before you settle on a GUI-based tool.

Claude Code

Claude Code is Anthropic’s terminal-native CLI agent built for handing off entire feature implementations right from your command line. It has also grown remarkably fast for a tool in this category: as of early 2026, it was estimated at $2.5 billion in annualized revenue, and “claude code” hit one million monthly searches in March 2026 — up 20x year over year.

What to like:

The CLAUDE.md file keeps track of project-specific instructions across sessions, so you don’t have to repeat your tech stack and preferences every time you start a new run
Full MCP support means you can connect Claude Code to your databases, GitHub, external APIs, and custom tools without much setup or friction
Up to a 200K-token context window on the Pro plan, and up to 1M tokens if you're using the API

Tradeoffs to expect:

There’s no free tier, so you’ll need at least the $20/month Pro plan to get started, and if you rely heavily on Opus, you may bump into 5-hour session limits.
Purely terminal-based with no GUI — so if you prefer working in a visual editor, it probably won’t feel like the right fit

Works well for:

Terminal-first developers who want to hand off entire features from start to finish without leaving the command line
Teams already using Claude, especially since CLAUDE.md helps carry project knowledge from one session to the next

Aider

Aider is a terminal-based AI coding tool that turns every AI edit into a proper Git commit. It’s free, open-source, and works over SSH. If you care more about keeping AI-assisted changes clean, reviewable, and easy to roll back than you do about having a polished editor UI, Aider is built for exactly that kind of workflow.

What to like:

Every AI edit lands in your repository as a proper Git commit, which makes it easy to review, compare, or roll back just like any change made by a teammate.
SSH support also means you can point Aider at code on a remote server and work there directly, without first pulling the entire codebase down to your local machine
It connects to OpenAI, Anthropic, Gemini, and local models through Ollama, and it even supports voice input.

Tradeoffs to expect:

It’s terminal-only, so there’s a bit of a learning curve if you’re not already comfortable with Git-heavy workflows

Works well for:

Senior engineers who already live in Git and want AI edits to fit naturally into the same review process as the rest of their code
Debugging or refactoring code on remote servers and in production-like environments

Pick a tool built to last, not one in transition

Windsurf is still around, and it’s still shipping updates. But the founding team is gone, the product changed hands, and the price went up. That’s three big signals in a single year, all pointing in the same direction.

The tools whose founding teams are still actively building them — Cursor, Claude Code, Cline, Zed, Builder 2.0 — are just moving faster in 2026 than Windsurf likely will under Cognition’s priorities. That’s not really a knock on Cognition; it’s simply the predictable result of how the company is structured now.

So what would I recommend?

Use Builder 2.0 if your team wants visual editing, multiplayer coding, and AI-generated PRs on an existing repo — especially if design or marketing needs to work alongside engineering.
Use Cursor if you want something that feels closest to a drop-in replacement for Windsurf’s IDE experience, while still giving you frontier model access and full VS Code compatibility — the Windsurf vs. Cursor comparison has the full breakdown.
Use Claude Code or Aider if you do your best work in the terminal and want to hand off full feature implementations end-to-end, with a clear review trail you can actually follow afterward.
Use Cline if you want BYOK flexibility, open-source transparency, or on-prem model deployment without having to give up the editor you already like.
Use GitHub Copilot if switching editors just isn’t realistic and you’re already deeply embedded in the GitHub ecosystem.
Use Replit or Bolt.new if you want to prototype fast in a hosted environment and you're starting from scratch rather than plugging into an existing codebase.

Switching between most of these tools is usually easier than people expect. Most install in minutes, work with your existing editor or API keys, and show you what they're good at within an afternoon. If you're deciding, pick your top two and spend a few hours with each.

If you want the version of this that's already wired up for your whole team — not just developers — try Builder Fusion free →

Frequently asked questions

Q: Is Cursor better than Windsurf right now?

In 2026, generally yes. Cursor’s agent mode tends to be more stable, its VS Code compatibility is stronger, and its Tab autocomplete — powered by Supermaven’s Babble model after the November 2024 acquisition — is still one of the fastest in the category. Its $20/month Pro plan also matches Windsurf’s current pricing. If you want the full breakdown, check out the Windsurf vs. Cursor comparison.

Q: What happened to Windsurf?

Windsurf — formerly Codeium, and rebranded in April 2025 — went through a pretty dramatic shake-up. In July 2025, Google DeepMind acquihired the founding team for $2.4 billion after OpenAI’s $3 billion acquisition attempt collapsed when Microsoft blocked the deal. After that, Cognition AI (the team behind Devin) bought the remaining product, IP, enterprise customers, and team for roughly $250 million. So today, Windsurf operates under Cognition ownership, with Jeff Wang serving as interim CEO.

Q: What is the best open-source Windsurf alternative?

Cline is usually the best open-source place to start. It’s the most full-featured option overall: a VS Code extension with agentic coding, MCP support, and BYOK flexibility across the major AI providers. If you’d rather work in the terminal and want cleaner Git history, Aider is probably a better fit. Zed is open source too and worth a look if raw editor performance matters most to you.

Q: Does Claude Code have a free tier?

No — Claude Code doesn’t offer a free tier. To use it, you’ll need at least a Claude Pro subscription, which starts at $20/month (or $17/month billed annually). If you’re looking for a free terminal-based alternative, Gemini CLI gives you 1,000 requests per day with no credit card required, using Gemini models.

Q: Is Supermaven still a standalone product?

Not really. Anysphere, the company behind Cursor, acquired Supermaven in November 2024. The Supermaven plugins are still maintained, but the team’s attention is now mostly on Cursor. Supermaven’s Babble model powers Cursor’s Tab autocomplete, so at this point it’s basically been folded into the Cursor product.

Q: Which Windsurf alternative works best for large codebases?

If you're working in a large codebase, Cursor and Claude Code are usually the best places to start. Cursor is especially good at keeping project-wide context straight across multi-file work, and its Pro+ plan supports up to eight parallel agents. Claude Code is also a strong pick if you prefer working in the terminal, with a 200K-token context window on Pro and up to 1M tokens through the API for deeper codebase analysis. And if your main goal is getting the biggest context window for free, Gemini Code Assist stands out with a one-million-token free tier for individual users. For teams that want parallel agents working on the same large repo without merge conflicts, Builder Fusion runs each agent in its own containerized environment and lands every change as a separate PR.

Q: Which Windsurf alternative is best for teams that aren't all developers?

Builder Fusion is the clearest pick. It's the only tool in this roundup designed for designers, PMs, and marketers to ship real changes alongside engineers — visual edits land as reviewable PRs in the same repo, so non-developers can contribute without code review breaking down. Cursor, Windsurf, Claude Code, and the rest are all developer-only by design.

Read the full post on the Builder.io blog

Claude Code Routines Tutorial: Schedule, API, and GitHub Triggers Explained

Wed, 15 Apr 2026 18:00:00 GMT

Monday morning. You open your laptop and find 40 issues filed over the weekend, all unlabeled and unassigned. Three pull requests have been waiting since Thursday. The Friday deploy went out but nobody verified it held. You'll spend the entire day on triage instead of work.

Now imagine the same Monday. Every weekend issue is already labeled and assigned. The three PRs each have a review summary with inline comments covering security, performance, and style. A Slack message in #releases confirms the Friday deploy is clean. Claude did all of this while your laptop was closed, running on Anthropic's cloud infrastructure.

That's Claude Code Routines. This tutorial walks through all three trigger types (schedule, API, and GitHub) with real copy-paste prompt templates, a complete /fire endpoint example, and the gotchas that didn't make it into the news coverage.

If you want to see what happens when you scale this further, Builder 2.0 runs more than 20 Claude agents in parallel across content and engineering workflows. Routines keep working when your laptop is closed; Builder 2.0 goes further by keeping entire teams of agents running around the clock.

What Are Claude Code Routines?

Claude Code Routines are saved configurations (a prompt, repositories, and connectors) that run automatically on Anthropic-managed cloud infrastructure. They activate via a recurring schedule, an HTTP API call, or a GitHub event. Unlike /loop (session-bound) and Desktop scheduled tasks (machine-bound), routines keep running when your laptop is off.

It helps to think of Claude Code's scheduling options as three separate layers:

A routine is made of three parts: (1) a prompt (the most important piece, since the routine runs without human approval at each step), (2) one or more repositories that Claude clones and works in, and (3) optional connectors (MCP integrations like Slack, Sentry, Linear, or GitHub) that give Claude access to external services.

One Claude Code routine can combine all three trigger types. A PR review routine can run on a schedule, fire via API call, and react to GitHub events, all from the same saved configuration.

Routines are in research preview. Behavior, limits, and the API surface may change before the feature reaches general availability.

How Do You Set Up Your First Claude Code Routine?

Create a routine at claude.ai/code/routines by clicking New Routine, writing a name and prompt, connecting repositories, and selecting a trigger. The prompt is the most critical piece: routines run without approval prompts, so be explicit about what to do, which connectors to use, and what success looks like.

Three creation paths exist:

Web UI at claude.ai/code/routines — supports all three trigger types; the canonical path
CLI with /schedule — creates schedule-only routines from within an active Claude Code session; add API or GitHub triggers afterward from the web
Desktop app (New Task > New Remote Task) — distinct from local Desktop scheduled tasks, which run on your machine

For web UI creation:

Go to claude.ai/code/routines, click New Routine
Give it a name (your reference only; Claude doesn't use it during runs)
Write the prompt
Connect one or more GitHub repositories
Select a trigger type or combine multiple
Remove any connectors the routine doesn't need; all connected MCP connectors are included by default

What separates a working autonomous prompt from a broken one is specificity. Routines run without human approval at each step, so the prompt carries the full cognitive load. Specify what "done" looks like: a Slack message, a draft PR, a labeled issue. Name which specific connectors to use; don't assume Claude knows your Slack workspace or Sentry project. Describe what to do when something unexpected happens.

A bad prompt: "Check for issues." A good one: "Read all GitHub issues opened today in {repo}, apply a label from [bug, feature, docs, question, needs-triage] to each, assign it based on which files it references, and post a summary to #dev-standup with the count and breakdown."

How Does the Schedule Trigger Automate Recurring Dev Work?

The schedule trigger runs a routine on a recurring cadence: hourly, daily, weekdays-only, weekly, or a custom cron expression with a minimum interval of one hour. Schedules are timezone-aware; enter the time in your local zone and it converts automatically. Runs may start a few minutes after the scheduled target; that stagger is small and consistent per routine.

Choose from four preset cadences: hourly, daily, weekdays (Monday through Friday), or weekly. For anything more precise, like every Tuesday at 9am or the first of each month, use a custom cron expression. Set it from the web UI or via /schedule update in the CLI. The minimum interval is one hour; sub-hourly expressions are rejected.

Design for "sometime overnight," not exact timing. If a routine needs to fire at precisely 23:00:00, the schedule trigger is the wrong tool. If a window works, it's the right one.

Here's a prompt template for nightly backlog grooming. Copy it, replace {repo} with your repository name, and adjust the Slack channel:

# Nightly backlog grooming

It's end of day. Read all GitHub issues opened today in {repo}.

For each issue:
- Apply the appropriate label from: bug, feature, docs, question, needs-triage
- Assign it to the relevant owner based on which files or directories it
  references (check CODEOWNERS if one exists)
- If the issue is unclear or missing reproduction steps, leave a comment
  requesting more information — don't label it until the reporter responds

After processing all issues, post a summary to #dev-standup in Slack:
- Total issues processed today
- Breakdown by label
- Any issues flagged as needing human attention

Keep the Slack message concise. Use bullet points. If zero issues were
filed today, post a single line: "No new issues today."

With this running on weekdays, the team starts each morning with a labeled, assigned queue. The API trigger works differently: instead of a clock, an HTTP call starts the run.

How Do You Fire a Claude Code Routine from the API?

The API trigger gives each routine a dedicated HTTP endpoint. POST to it with a bearer token and an optional freeform text field to pass runtime context: alert bodies, deploy metadata, or any string you want Claude to work with. The bearer token is shown exactly once when you generate it; store it immediately, since it cannot be retrieved after that.

The endpoint follows this pattern:

POST https://api.anthropic.com/v1/claude_code/routines/{trigger_id}/fire

Full curl example with all required headers (none of the four are optional):

curl -X POST https://api.anthropic.com/v1/claude_code/routines/{trigger_id}/fire \
  -H "Authorization: Bearer {your_token}" \
  -H "anthropic-beta: experimental-cc-routine-2026-04-01" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{"text": "Production alert: error rate on /api/checkout exceeded 5% threshold. Alert ID: ALR-4821. Environment: prod-us-east-1."}'

On success, you get back a session ID and a URL:

{
  "type": "routine_fire",
  "claude_code_session_id": "session_01HJKLMNOPQRSTUVWXYZ",
  "claude_code_session_url": "https://claude.ai/code/session_01..."
}

Log that session_url. It links to the live run so you can watch what Claude is doing, review changes, or continue the conversation manually.

Three things to know before you wire this into production:

The text field is a literal string. Whatever you put in it arrives to Claude as plain text. If you send {"alert_id": "123"} in the text field, Claude reads the JSON notation as a string. Write it as human-readable prose ("Alert ID 123 fired in prod") rather than structured data.

Each token is scoped to one routine. Rotating one token doesn't affect others. Revoke it from the API trigger modal in the routine's edit form.

The beta header may rotate. experimental-cc-routine-2026-04-01 is currently required. Two of the most recent previous header versions continue to work temporarily; migrate when Anthropic ships a new dated header. Verify the current header in the Claude Code Routines documentation before shipping any integration.

A practical use case: wire your monitoring tool to call /fire when an error rate threshold is crossed, passing the alert body as text. The routine pulls the stack trace, correlates it with recent commits, and opens a draft PR with a proposed fix. Your on-call engineer reviews a PR instead of starting from a blank terminal at 2am.

The GitHub trigger flips the activation model: instead of your toolchain calling Claude, GitHub calls it automatically on repository events.

How Does the GitHub Trigger Work for Automated PR Reviews?

The GitHub trigger fires a new routine session on matching pull request or release events. It requires the Claude GitHub App installed on the target repository, separate from running /web-setup. Filter rules let you scope exactly which events activate the routine. Events beyond the per-routine hourly cap are dropped, not queued.

Supported events: pull_request (opened, closed, assigned, labeled, synchronized, or otherwise updated) and release (created, published, edited, or deleted).

Setup requires two separate steps; many people stop after the first one:

Run /web-setup in Claude Code to grant repository access for cloning (already done if you've used Claude Code with this repo)
Install the Claude GitHub App on the target repository to enable webhook delivery. Running /web-setup does not install the GitHub App. Both are required. The UI prompts you, but it's easy to stop after step 1 and wonder why triggers aren't firing.

Filtering narrows which events activate the routine. Filter on: Author, Title, Body, Base branch, Head branch, Labels, Is draft, Is merged, From fork. All filter conditions must match for the routine to fire.

The regex operator gotcha: matches regex tests the entire field value, not a substring. To match any PR title containing "hotfix", write .*hotfix.*. Without the surrounding .*, the filter only matches a title that is exactly the word "hotfix" with nothing before or after it. For simple substring matching, use contains instead.

Branch permissions: By default, Claude can only push to claude/-prefixed branches. To push elsewhere, enable "Allow unrestricted branch pushes" in the routine settings. Commits and PRs appear under your personal GitHub identity, not a bot account.

Session model: Each matching GitHub event starts a fresh Claude Code session with no state carryover from previous runs. Write prompts that are self-contained per event.

Events at the cap are dropped. If your repo is high-volume, keep filters narrow. Events that arrive after the per-routine hourly cap is hit are gone until the window resets; they are not retried.

Here's a PR code review prompt template. Adapt the checklist to your team's actual standards:

# PR code review checklist

A new pull request has been opened. Review it against our team checklist.

## Security
- Any hardcoded secrets, API keys, or credentials in the diff?
- Any unvalidated user inputs that could enable injection attacks?
- Any new dependencies? If so, check for known CVEs.

## Performance
- N+1 query patterns in ORM calls?
- Missing database indexes for new query patterns in this PR?
- Large synchronous operations that should be async?

## Code style
- Follows our eslint configuration?
- Consistent naming with the rest of the codebase?
- Functions and variables named clearly enough to be self-documenting?

Leave an inline comment on each specific issue found. Give the line
number and the specific fix, not vague observations like "there might
be a performance issue here."

Post a top-level summary comment with a pass/fail for each category.
Human reviewers should focus on design decisions, not mechanical checks.

Run this on pull_request.opened and reviewers stop spending attention on SQL injection checks and naming conventions. That frees up code review for the work that actually requires human judgment.

How Do Claude Code Routines Compare to GitHub Actions, cron, and n8n?

Claude Code Routines are purpose-built for AI-powered dev automation: they run Claude natively on Anthropic's cloud with no YAML required. GitHub Actions wins for language-agnostic CI/CD pipelines. n8n and Zapier win for connecting non-coding tools across hundreds of app integrations. Cron is best for simple local scripts that don't need AI reasoning.

Routines and GitHub Actions complement each other. Use Actions for build, test, and deploy pipelines. Use Routines for the AI reasoning work around those pipelines: reviewing what got merged, triaging what failed, verifying what deployed.

n8n and Zapier win when you're connecting 10+ SaaS tools without writing code. Routines win when the job requires Claude to reason about developer artifacts: code diffs, issue descriptions, error logs, stack traces. These are different use cases, and the answer for most teams is both.

Cron still has a place. A 20-line bash script that runs nightly and produces clean output is a cron job. When the job needs judgment, reach for Routines.

What Are the Limits and Daily Caps for Claude Code Routines?

Each plan has a daily run cap visible at claude.ai/code/routines and claude.ai/settings/usage. GitHub trigger events beyond the per-routine hourly cap are dropped, not queued, until the window resets. Organizations with metered usage enabled can continue on overage; others are rejected until the daily window resets.

Daily run cap: Every plan has one. Check claude.ai/settings/usage to see your current remaining runs. Anthropic hasn't published official per-plan numbers in the docs; don't build critical workflows around figures circulating on social media until they're confirmed.

GitHub hourly cap: Separate from the daily cap. Events that arrive after the hourly limit is hit are dropped. They're gone until the next window opens. Keep filter rules narrow so only genuinely relevant events consume your budget.

Metered overage: Team and Enterprise plans with extra usage enabled can continue running on overage billing when the daily cap is hit. Individual and non-metered plan users are rejected until the window resets. Enable extra usage from Settings > Billing on claude.ai.

Routine ownership is individual. Routines belong to your personal claude.ai account, not your team or organization. Commits and PRs appear under your personal GitHub identity. There's no team-sharing or co-ownership during the research preview. If teammates need the same routine, each one sets up their own copy.

All of the above applies to the current research preview and may change as the feature matures.

FAQ

Do Claude Code Routines run when my laptop is off?

Yes, and that's the core differentiator. Routines execute on Anthropic-managed cloud infrastructure, not your local machine. Unlike Desktop scheduled tasks (machine-bound) and /loop (session-bound), routines keep running when your laptop is closed. Set a schedule or a GitHub trigger and close the lid.

What's the difference between Claude Code Routines and /schedule?

/schedule is a CLI shortcut for creating schedule-triggered routines from within a Claude Code session. It creates the same underlying routine object, but only supports the schedule trigger type. To add an API or GitHub trigger, edit the routine at claude.ai/code/routines afterward.

How many times can I run a Claude Code Routine per day?

Each plan has a daily run cap, but Anthropic hasn't published official per-plan numbers in the documentation at time of writing. Check claude.ai/settings/usage to see your current limit and remaining runs. Don't plan around unconfirmed figures.

What happens when a routine hits its event cap?

GitHub trigger events that arrive after the per-routine hourly cap is exceeded are dropped, not queued for the next window. Keep filter rules narrow so only the events that matter consume your hourly budget. Schedule-triggered and API-triggered runs follow the daily cap, not the hourly one.

Can I share Claude Code Routines with teammates?

Not currently. Routines belong to your individual claude.ai account. Pull requests and commits from a routine appear under your personal GitHub identity. There's no team-sharing, transfer, or co-ownership mechanism in the research preview.

Conclusion

Claude Code Routines shift Claude from a tool you invoke to one that works alongside you, running on a schedule, responding to API calls, and reacting to GitHub events on Anthropic-managed infrastructure. The three trigger types handle nearly any recurring dev workflow without CI/CD infrastructure or YAML.

Start with the schedule trigger and the backlog grooming template above. It's the lowest-friction way to see a routine complete a full end-to-end run. After the first nightly run finishes, you'll have enough intuition to write the prompt for your actual use case.

Head to your routines dashboard, click New Routine, and paste one of the templates. The first run teaches you more than any documentation. Check the Claude Code Routines documentation for the full API reference and limit updates as the research preview matures.

If you want to see what happens when you scale this further, Builder 2.0 runs more than 20 Claude agents in parallel across content and engineering workflows. Routines keep working when your laptop is closed; Builder 2.0 goes further by keeping entire teams of agents running around the clock.

Read the full post on the Builder.io blog

Claude Code Subagents: How to Create, Use, and Debug Them

Thu, 16 Apr 2026 18:00:00 GMT

Claude Code feels great—right up until your main thread turns into a pile of logs, grep output, and dead-end research, and you see the dreaded "compacting" start.

Claude Code subagents help by offloading that noisy side work to specialized workers with their own prompt, tool access, permissions, and optional memory, then returning a clean summary to the main session.

This guide covers what subagents are, how to create them, where they help most, and when the handoff overhead just isn’t worth it.

What are Claude Code subagents?

Claude Code subagents are specialized workers that run in separate context windows, each with their own prompt, tool access, permissions, and optional memory. They’re useful for side work you want to keep out of the main session, like repo exploration, docs lookup, test runs, and result validation.

Anthropic describes them as custom assistants for specific kinds of tasks. Claude uses a subagent’s description as a routing hint when deciding whether to hand work off to it, so a good subagent is more than a persona. It’s a clearly scoped workflow with the right tools and instructions for a recurring job.

Claude Code already includes versions of this pattern. Explore and Plan are read-only helpers for reconnaissance, while the general-purpose agent handles broader multi-step work. Custom subagents become useful when you want your own repeatable version of that workflow for tasks you do regularly.

A simple mental model is: CLAUDE.md holds ongoing project context, skills store reusable playbooks, and subagents handle isolated tasks where the main session only needs the result.

Why do Claude Code subagents matter when context is the real bottleneck?

In long Claude Code sessions, the real limit usually isn’t capability. It’s context. Even a simple task can turn into file reads, tests, doc searches, log checks, and plenty of dead ends. Before long, the conversation gets crowded with raw artifacts instead of the decisions that actually matter.

That’s where subagents help. They take on bounded, noisy work in separate threads and return condensed results to the parent session. Instead of carrying every search result or debug note forward, the main conversation keeps the conclusions, tradeoffs, and next actions.

They also make parallel work practical. One subagent can inspect the data layer while another traces UI entry points or gathers documentation, then each reports back with a summary. The benefit isn’t just speed—it’s preserving space in the main context window for higher-level reasoning.

How do you create and configure Claude Code subagents?

You can create Claude Code subagents either from the /agents UI or by writing Markdown files with YAML frontmatter. The main things to get right are scope, description, tool access, permission mode, and prompt design.

Anthropic supports project-level subagents in .claude/agents/ and user-level subagents in ~/.claude/agents/. Project-level definitions are usually the better default when a workflow depends on a codebase’s conventions. User-level agents are a better fit for portable habits, like repo exploration or docs lookup.

The Markdown file does two jobs: the frontmatter configures the agent, and the body becomes its system prompt. One easy-to-miss detail from Anthropic’s docs is that subagents do not inherit the full default Claude Code system prompt. They get their own prompt plus basic environment details, which makes them easier to shape deliberately.

A good starting pattern for a read-only repo explorer looks like this:

---
name: repo-explorer
description: Search unfamiliar codebases, map entry points, and summarize the architecture. Do not edit files.
tools: [Read, Grep, Glob]
disallowedTools: [Edit, Write, Bash]
model: haiku
permissionMode: plan
memory: project
---

Find the main app entry points, core data flow, and likely risk areas.
Return a short summary with file paths, key abstractions, and open questions.

That definition works because each field reinforces the same job: the description helps Claude route to it, the tool access matches the task, and the prompt defines the output.

You can also configure fields like model, maxTurns, mcpServers, hooks, background, and isolation. In practice, though, the most useful fields are usually the simple ones. Start with a sharp description, narrow tools, and the smallest permission surface that still gets the job done. Turn on background: true when a worker can run concurrently without needing clarifying questions. Use isolation: worktree when parallel edits might collide and you want file-system separation.

Claude may delegate automatically, or you can force a specific worker with an @ mention. You can also run an entire session as a single agent with claude --agent .

What makes a good Claude Code subagent?

Once you’ve created a few subagents, the next challenge is making them specific enough to be genuinely useful. The best ones are narrow, shaped around a repeatable job, and easy for Claude to route correctly.

Start with job-shaped names like repo-explorer, test-runner, pr-reviewer, and docs-researcher. Claude tends to route to those more reliably than generic names like frontend-engineer, which sound flexible but give weaker signals and often lead to bloated instructions.

Descriptions matter just as much as names. If the real task is “inspect auth changes and look for unsafe patterns,” say that plainly. Action-oriented language works better than vague capability language. If a subagent keeps misfiring, check the description before you start tweaking the prompt body.

Keep tool access tight. If an agent only needs Read, Glob, and Grep, there’s no reason to give it write access just for convenience. Tightly scoped agents are easier to trust, cheaper to run, and generally much easier to debug.

When a task benefits from persistence, have the agent produce a durable artifact like research.md, plan.md, or review-notes.md. That gives your main session something concrete to verify, edit, and reuse.

Reviewer agents are an especially good fit because they benefit from a clear checklist, a limited toolset, and a crisp output format.

What are the best real-world use cases for Claude Code subagents?

The best Claude Code subagent use cases are noisy, self-contained tasks where the main session only needs a summary or recommendation back.

Repo exploration is one of the clearest wins. In an unfamiliar codebase, a repo explorer can inspect entry points, trace data flow, and spot conventions, then return a short brief instead of filling the main thread with search output.

Docs lookup is another strong fit. Official docs, changelogs, and example repos can generate a lot of raw material quickly. A docs-focused agent can gather the relevant sources, summarize the differences, and point to the source you should actually trust.

Test runners and log investigators also pay off quickly. Instead of carrying every stack trace forward, the parent session gets the failing files, likely root cause, and the next thing to try.

Reviewer and checker agents are especially reusable. A TypeScript strictness checker, accessibility reviewer, or security reviewer can run near the end of a task and return a compact pass/fail-style summary.

A few concrete examples:

repo-explorer: maps entry points, data flow, and likely risk areas in an unfamiliar repo.
docs-researcher: pulls in official docs and release notes, then summarizes what matters for the task.
test-runner: runs targeted tests, groups failures, and suggests the most sensible next debugging step.
pr-reviewer: reviews changed files and gives feedback on code quality, testing, and maintainability.
security-reviewer: reviews authentication, secret handling, and input boundaries without changing the implementation.

For more advice and helpful patterns, check out my article on when and how to use subagents. Subagents work best when the task is separable enough to hand off cleanly.

When should you not use Claude Code subagents?

Claude Code subagents are not a good fit for every task.

They come with setup, handoff, and context overhead. For small edits, tightly coupled work, or tasks that need constant back-and-forth, it usually makes more sense to stay in the main conversation.

You’ll see the failure mode quickly in feature work that spans multiple layers. Say you’re changing a schema, updating a server route, wiring up a React screen, and fixing the test suite in one pass. That kind of job depends on shared intent across every step. If you split it into too many isolated workers, the mental model can get fragmented, and you end up with awkward summaries and handoffs between phases.

Anthropic's docs are pretty clear about the limits. Subagents start fresh, so they need time to gather context. They also can't spawn other subagents. If you need fast collaboration across multiple phases, the handoff itself can end up being the problem.

A simple rule of thumb works well here: use a subagent when the work is noisy, bounded, and easy to summarize. Stay in the main conversation when the work is small, tightly coupled, or depends on a shared mental model that would get weaker after a summary pass.

How do subagents compare with skills, hooks, MCP, and agent teams?

Subagents are specialized workers inside a single Claude Code session. Skills store reusable instructions, hooks handle deterministic automation, MCP connects external systems, and agent teams coordinate separate collaborating sessions.

Here’s a simple decision table:

The most subtle distinction is between subagents and agent teams. Subagents stay inside one session and report back to the parent. Agent teams add peer coordination across separate sessions. As Anthropic's agent teams documentation explains, that coordination can use about 7x more tokens in plan-heavy workflows and comes with more operational overhead.

So the choice mostly comes down to communication. If the parent session just needs a clean result back, use a subagent. If multiple workers need to collaborate as peers, agent teams are the better fit.

If you're already using MCP heavily, it's worth reading our guide to Claude Code MCP servers alongside this one. MCP expands what an agent can access, while subagents put clearer boundaries around how that work gets done.

How should developers think about Claude Code subagents going forward?

The long-term value of Claude Code subagents is workflow standardization. They let you turn repeated instructions into reusable, scoped building blocks with their own prompt, permissions, and operating rules.

That’s why the feature feels like more than a convenience setting. If you keep repeating the same review loop, repo exploration prompt, or validation pass, that’s usually a sign the workflow wants a more durable shape.

Public adoption still seems early. One recent exploratory study of agentic coding tool configuration found that advanced artifacts like skills and subagents were often used pretty shallowly. That matches how the ecosystem feels right now: the ideas are solid, but the patterns are still settling.

So start small. Build one or two focused workers for exploration, review, or testing, then watch where the summaries help and where they hide too much context.

If you're already spending a lot of time in Claude Code, this roundup of Claude Code tips and best practices is a good next read. Look at the prompts you keep repeating and pick the noisiest one. That's usually the first subagent worth keeping.

Read the full post on the Builder.io blog

The Backlog Problem AI Didn't Solve

Thu, 16 Apr 2026 18:00:00 GMT

The backlog isn't a prioritization problem. It's a unit cost problem. Here's why AI tools made your team faster but didn't shorten the backlog.

Most enterprise product teams have adopted AI tools by now. Developers are using Claude Code and Cursor, product managers are drafting PRDs with Claude, and designers are prototyping in Figma AI. Everyone is faster at their individual job. And somehow, the backlog is the same size it was a year ago.

The explanation holds up on inspection. When you speed up each step in a sequential process without changing the process itself, you reach the handoff faster. Writing PRDs and generating prototypes are faster now. But the gap between them is exactly where it was.

What most teams want to test is whether lots of small incremental improvements can deliver more business value than a few large strategic projects. The hypothesis is almost always yes. The problem is that even the small things, the backlog items that never get addressed, require moving through the same heavyweight process as everything else. Without a way to connect AI directly to the codebase and let reviews happen in context, nothing gets faster except the individual steps. The handoffs stay exactly where they were.

The four-bucket problem

Most product teams are managing work across roughly four categories at any given time:

Large strategic initiatives that span multiple quarters
Sprint-level enhancements
A backlog of smaller improvements that compound into meaningful product quality over time
Ongoing bug fixes and support

The first two get prioritized. Bug fixes only get attention when something breaks, but small backlog items rarely do.

The problem is structural. The cost of a backlog item is almost the same as that of a full enhancement. Both require a designer to create something, a developer to build it, a round of back-and-forth between them, a review cycle, and a deployment process. When the unit cost is that high, the small things never pencil out. They stay on the board indefinitely. Adding headcount doesn't solve it either.

When AI has direct access to your codebase, design system, and existing context, the unit cost of small changes drops significantly. A designer or PM can describe a change; the agent builds it in an isolated branch; a developer reviews the diff; and it ships. The infrastructure exists to handle it without consuming a full sprint. The backlog becomes addressable.

Why the tool-per-step approach falls short

The obvious response to slow handoffs is to make each tool faster. Better Figma prototyping. Faster design-to-code conversion. AI-assisted code review. These are real gains, and teams that have adopted them are measurably faster at each step. The steps are still separate, though, and the people doing them are still in separate tools.

What you end up with is four faster silos handing off to each other. Designers work visually. Developers work in code. Neither can easily contribute to the other's environment. The iteration cycle at the end, when you're trying to reconcile what was designed with what was built, stays expensive. You've just arrived at it sooner.

The gap is in connecting the generated UI to the existing infrastructure. Legacy integrations, API contracts, and backend services with documented but sprawling context. When AI doesn't have access to that context, it generates something that looks right and works in isolation but doesn't fit the actual system. A human has to catch that, and the review cycle stays as heavy as ever.

What the collaboration layer actually changes

The shift isn't about replacing the tools people already use. Most developers will keep using the IDE they prefer. Designers will keep working in Figma. What changes is that there's a shared environment where the work comes together. A PM assigns a backlog ticket to an agent; the agent builds against the real codebase and surfaces a preview link; the designer refines it visually; QA validates it. The branch carries the context the whole way, and the engineer reviews a diff rather than rebuilding someone else's feedback from scratch.

AI is only as useful as the context you give it, and different types of work require different contexts. A conceptual prototype needs very little. A high-fidelity prototype heading toward production needs to meet accessibility standards, meet compliance requirements, address geo-specific constraints, and be informed by the service architecture behind it. The workflow has to reflect those differences, not flatten them.

The agent opens a PR. A developer reviews it. A designer confirms the visual output. The work that used to sit in the backlog for quarters is shipped in days. The strategic work doesn't get crowded out because the small things are no longer competing for the same resources.

The real leverage is in the process, and that's what actually moves the backlog.

Builder connects to your existing codebase, design system, and git workflows so every role on your team can build, review, and ship together. Sign up for a free trial.

Read the full post on the Builder.io blog

The New Path from Prototype to Production

Mon, 13 Apr 2026 18:00:00 GMT

I recently spoke at a multi-day event for product leaders across SaaS and enterprise software. The audience included PMs, product execs, and teams responsible for shipping and adoption. Across conversations, one question kept coming up: How do I get engineering leadership on board with this way of working?

The answer starts with how work moves.

Teams are starting to split the work differently. Developers use tools like Claude Code for the parts that require deep engineering judgment: core logic, architecture, anything that touches systems or performance. That work stays with them.

Once that foundation is in place, the rest of the work can move forward. The branch gets pushed into Builder. From there, other roles take over directly in the code:

QA finds bugs and fixes them without routing everything back through engineering
Designers refine spacing, interactions, and visual details themselves
Product updates copy, tracking, and small requirements without opening new tickets

The last mile of development, which tends to be the most fragmented and time-consuming, stops flowing through engineers.

This shift resonates with engineering leaders for a specific reason. When product and design generate and refine working code themselves, engineers spend more time on architecture, performance, and system design. Work moves in parallel across the team. Throughput increases without adding headcount.

That shift in ownership also changes when teams learn. Engineering teams spend significant time rebuilding work after launch. Requirements evolve once real users interact with a feature, which drives rework and slows teams down.

Teams at the event connected quickly with a different approach: validate earlier, while the work is still easy to change. With Builder, product and design teams generate working code, push it to a branch, and share a live preview with customers. Feedback comes from real usage through those previews.

By the time something reaches engineering, the direction is clear. The team has already seen how users respond. Engineers review and finalize work that reflects real usage. Iteration cycles shrink. Quality improves.

One idea kept coming up in follow-up conversations: iterate before production. That principle carries through to the rest of the workflow. Engineers still start work as they normally would. They open branches and push changes. From there, review and refinement move to the right roles:

Designers adjust directly in code
Product refines scope based on feedback
QA tests and resolves issues before a pull request reaches engineering

Engineers stay focused on code quality and system-level decisions. The rest of the team contributes directly to building and validating what ships.

The experience feels familiar. It mirrors how teams already collaborate in tools like Google Docs or Figma. Work is shared, visible, and easy to evolve. AI agents handle repetitive tasks. People focus on judgment and decisions.

Underneath this is a broader shift toward agent-native development, where the agent and the interface operate as a single system. Work moves fluidly between people and automation across the entire workflow.

The takeaway from the event was consistent. Teams want faster delivery. They want less rework. They want to involve more of the team in building and get closer to real customer feedback earlier in the process. This model supports that shift.

If you want to put this workflow into practice, you can start using Builder today.

Read the full post on the Builder.io blog

Claude Advisor API: Use Opus for 80% Less Money

Sun, 12 Apr 2026 18:00:00 GMT

If you're building with Claude, you'll hit this wall.

You pick Opus. The reasoning is brilliant. The invoice arrives. You switch to Sonnet. The price drops. So does the quality on anything hard. You pick Opus again for the difficult calls and Sonnet for everything else, and now you're managing two models, two contexts, and two sets of prompts.

Anthropic just shipped a third option that makes that whole dance unnecessary.

The Claude advisor API, in beta since April 9, 2026, lets you pair a fast executor model (Sonnet or Haiku) with Opus as an on-demand advisor. When the executor hits a decision it's not confident about, it calls the advisor mid-task. Opus weighs in. The executor continues. The whole thing happens inside a single API call. No second request. No context synchronization. No orchestration layer.

In Anthropic's benchmarks, Sonnet with an Opus advisor scored 74.8% on SWE-bench Multilingual versus 72.1% for Sonnet alone, and cost 11.9% less than running Opus solo. And the Haiku numbers are even more striking, but we'll get to those.

This post covers what the Claude advisor tool is, how it works server-side, and exactly how to add it to an existing agent.

Builder 2.0 is the harness built for exactly this kind of agent work — run 20+ Claude-powered agents in parallel, each in its own cloud container with browser preview, with Slack and Jira wired in so your whole team ships via auto-generated PRs. No orchestration glue, just working features in production.

What is the Claude advisor API?

TL;DR: The Claude advisor API is a beta feature that lets you designate Claude Opus as an on-demand advisor for a faster executor model (Sonnet or Haiku). The executor calls the advisor mid-task when it needs strategic input. Everything happens in one API call — no extra network round-trips, no orchestration code, no context syncing.

The advisor pattern itself isn't new. In 2023, which is like saying the AI Bronze Age Researchers at UC Berkeley published a paper titled "How to Train Your Advisor: Steering Black-Box LLMs with ADVISOR MODELS." They found that small models trained to generate per-instance natural language advice could noticeably improve larger models' output. Anthropic built that same pattern directly into the Claude API.

The Anthropic advisor API adds a new tool type to your existing tools array. You enable it with a single beta header. Your executor model (the one doing the actual work) knows when to call the advisor. When it does, the call happens server-side. No round-trip. No client-side logic.

This is available today on the Claude API. It doesn't need a waitlist or special application for API access. Enterprise customers with Zero Data Retention (ZDR) agreements can use it without changing their data handling setup. The advisor is explicitly ZDR-eligible.

How does the Claude advisor tool work?

TL;DR: The executor model generates normally until it decides it needs help. It emits a signal the server intercepts, which pauses the executor and runs Opus on the full conversation history. Opus sends back ~400-700 tokens of advice — never shown to the user — and the executor resumes informed. One API call, transparent to the client.

Here's the server-side flow step by step:

You send a POST /v1/messages request with the advisor tool defined in the tools array
The executor model (Sonnet or Haiku) runs and generates output as normal
When it hits a decision it wants help with, it emits a structured token block ({"type": "server_tool_use", "name": "advisor"}). That's the trigger.
The server pauses the executor and runs Opus with the entire conversation history: the original prompt, every tool call made so far, and every result the executor has seen
Opus generates an advisory message — a plan, a correction, a strategic next step — in approximately 400-700 tokens
That advice is injected back into the assistant message stream as an advisor_tool_result block. The user never sees this.
The executor resumes, now informed by Opus's guidance, and continues generating

Nothing changes on the client side. One request in, one response out.

Two things to note. The advisor reads full conversation context but can only return text advice. Its tokens bill at the Opus rate but don't count against the executor's max_tokens cap. Both appear in the usage object, so cost attribution is clean.

How do you add the Claude advisor tool to an existing agent?

TL;DR: Add the anthropic-beta: advisor-tool-2026-03-01 header to your request and include {"type": "advisor_20260301", "name": "advisor", "model": "claude-opus-4-6"} in your tools array. Set max_uses to cap advisor calls — the primary cost control lever. That's the full integration: same endpoint, same SDK version, no orchestration changes required. Your existing agent code stays unchanged.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  tools: [
    {
      type: "advisor_20260301",
      name: "advisor",
      model: "claude-opus-4-6",
      max_uses: 3,
    },
  ],
  messages: [
    {
      role: "user",
      content:
        "Refactor this Go service to use a worker pool with graceful shutdown.",
    },
  ],
});

console.log(response.content);

curl and Python examples in the Anthropic advisor tool docs.

Configuring the advisor

max_uses caps how many times the advisor can be called per request. When that limit is hit, further advisor requests return a max_uses_exceeded block and the executor continues without more advice. This is your primary cost control lever. Set it based on task complexity.

caching enables advisor-side prompt caching. Add "caching": {"type": "ephemeral", "ttl": "5m"} if you're expecting three or more advisor calls in a single session. It lets Opus skip re-processing unchanged context on repeat calls, which saves tokens.

System prompt guidance. Anthropic's recommended approach is to tell the executor when to call the advisor. Their suggested template:

"You have access to an advisor tool backed by a stronger model. Call the advisor before substantive work — before writing, before committing to an interpretation, before building on an assumption. Also call advisor when you believe the task is complete, before delivering output. On tasks longer than a few steps, call advisor at least once before finalizing."

In practice, most of the value comes from one or two advisor calls per task: once early for orientation, once before finalizing output.

Production note: Priority Tier on the executor model doesn't extend to the advisor. If you're running production workloads, track advisor token usage separately in the usage object. It's broken out by model tier, so cost attribution is clean.

Which model pair should you use: Haiku+Opus or Sonnet+Opus?

TL;DR: For quality-sensitive tasks — coding agents, architecture decisions, complex research — use Sonnet as executor with Opus as advisor. You get near-Opus accuracy for less than Opus alone. For high-volume, cost-sensitive workloads, the Haiku+Opus pair is worth serious consideration: 85% cheaper than Sonnet, and dramatically better quality than Haiku alone.

The announcement focuses on Sonnet+Opus, and for good reason. Sonnet with an Opus advisor scored 74.8% on SWE-bench Multilingual, up from 72.1% for Sonnet alone. That's a 2.7 percentage point gain on a hard coding benchmark. And it cost 11.9% less than running Opus solo for the same tasks.

But the Haiku numbers are more dramatic.

Haiku alone scored 19.7% on BrowseComp, a research-heavy browsing benchmark. Haiku with an Opus advisor scored 41.2%. That's more than double. And this Haiku+Opus pair costs 85% less than running Sonnet for the same task.

That 85% number changes budget conversations. If you're running Claude at scale for classification, extraction, or pattern-matching that occasionally needs complex reasoning, the Haiku+Opus pair is worth testing.

Here's a practical decision matrix:

One honest caveat. All of these benchmarks are Anthropic's own. No independent third-party results exist yet. This is a three-day-old beta. And Haiku+Opus scores approximately 29% below Sonnet on general tasks. If your bar is raw Sonnet-level quality, use Sonnet+Opus. If you're currently running Haiku and want a cost-effective upgrade, Haiku+Opus is the move.

For broader context on how Opus and Sonnet compare across real agent sessions, our practical guide to Claude Code covers the model selection tradeoffs in daily development.

When should you NOT use the Claude advisor tool?

TL;DR: Skip the advisor tool for single-turn queries, trivial tasks, and latency-critical paths. It adds the most value in multi-step agentic workflows with real decision points. On simple tasks, the executor won't invoke the advisor anyway — but adding it adds overhead and complexity for no gain.

A few specific patterns where the advisor tool adds no value:

Single-turn queries. If the user asks "Summarize this document" and there's only one step to take, the executor won't invoke the advisor. The tool sits idle. You've added a beta header and a tool definition for nothing.

Trivial mechanical tasks. Data formatting, lookups, regex transformations. These don't have decision points that trigger the advisor. Same result, more complexity.

Already-optimized Opus-only workflows. If you're already running Opus and quality is your only concern, the advisor adds nothing. You're effectively advising Opus with Opus.

Latency-critical paths. There's no extra network round-trip, but Opus generation still takes time. On paths where every 100ms counts, the advisor's internal invocation adds latency you haven't accounted for.

When you need deterministic behavior. The advisor introduces non-determinism. Opus may give different guidance on reruns. If your pipeline requires reproducible outputs, test carefully before relying on advisor calls.

Anthropic's Building Effective Agents guide makes the same point broadly: add complexity only when it demonstrably improves outcomes.

FAQ

What model pairs work with the Claude advisor tool?

Three pairs are currently supported: Claude Haiku 4.5 as executor with Claude Opus 4.6 as advisor; Claude Sonnet 4.6 as executor with Claude Opus 4.6 as advisor; and Claude Opus 4.6 running as both executor and advisor. Any other combination returns an HTTP 400 error. The advisor must always be at least as capable as the executor.

Does the Claude advisor tool work with Claude Haiku?

Yes. Claude Haiku 4.5 can be the executor with Claude Opus 4.6 as the advisor. In Anthropic's BrowseComp benchmarks, this pair improved Haiku's performance from 19.7% to 41.2% (more than double) while costing 85% less than Sonnet. For high-volume tasks that need occasional complex reasoning, this pair delivers better quality at a fraction of Sonnet's cost.

How much does the Claude advisor tool cost?

You're billed at each model's standard per-token rate. The executor (Sonnet or Haiku) generates at its lower rate. Opus generates the advisory response (~400-700 tokens) at the Opus rate. Total cost typically runs lower than running Opus alone for the same task. Advisor tokens are broken out separately in the usage object for clean cost attribution.

Is the Claude advisor API in beta?

Yes. As of April 2026, the advisor tool requires the anthropic-beta: advisor-tool-2026-03-01 header. It's accessible through the standard Claude API with no special waitlist or application required. Enterprise customers with Zero Data Retention (ZDR) agreements can use it without changing their data handling setup. Contact your Anthropic account team for enterprise-specific arrangements.

The third option

The Opus-or-Sonnet decision used to be a binary tradeoff. You picked quality or you picked cost.

The Claude advisor API gives you a dial. Use Sonnet as your workhorse, bring in Opus on the hard calls, and pay less than you would running Opus full-time. Or go further with Haiku and let Opus double your quality at 85% of Sonnet's cost.

One header and one tool definition to wire it into an existing agent. Anthropic's advisor tool documentation covers the full specification, including caching options and Anthropic's complete system prompt template.

If you're building visual workflows on top of Claude agents, Builder.io integrates with Claude for AI-powered content and development workflows.

Read the full post on the Builder.io blog

AI Development Environments Fixed What Docker Couldn't

Fri, 03 Apr 2026 18:00:00 GMT

"It works on my machine."

Four words that have haunted software engineering since the dawn of personal computers. You'd think that by 2026, with Docker and Kubernetes and Nix and Dev Containers and an entire platform engineering movement, we'd have put this ghost to rest. We haven't.

A survey of over 650 engineering leaders and found that 67% of software teams still can't build and test their dev environment within 15 minutes. And more fuel: 72% of engineers say demands on their time make it hard to build new features, and they only spend 16% of their week actually writing code. A big chunk of the rest? Fighting tooling.

Here's the twist. The fix didn't come from DevOps. It came, almost by accident, from AI.

The cloud-first AI development tools that have exploded over the past year didn't set out to solve environment drift. Agents that spin up their own environments, do the work, and hand you a PR just solved it anyway, as a side effect of their architecture. And that accident might matter more than the code they write.

We never actually escaped the setup tax

You know the drill. New project, new repo, new pain:

Clone. Install dependencies. Discover the README is three versions out of date. Manually configure environment variables. Realize someone's .env.example is missing half the keys. Fix a port conflict with the other project you forgot was running. Wait twelve minutes for npm install to finish. Pray.

And that's the happy path. The one where nothing fundamentally incompatible lurks in your system Python or your Node version or your shell configuration. The one where you don't spend a full afternoon learning that the project secretly requires a specific version of Postgres that conflicts with the one you already have.

This is the core tension of AI orchestration. When your environment is a bespoke snowflake, everything built on top of it inherits that fragility. And now there's something new built on top of it: AI agents.

AI agents turned an annoyance into an emergency

The Anthropic 2026 Agentic Coding Trends Report frames a shift that most of us are already living: development is moving from writing code to orchestrating agents that write code. Developers now use AI in roughly 60% of their work.

But here's the thing about agents. They're less forgiving than you are.

You, a human, can look at a failing npm install and think, "Oh right, I need to switch to Node 20 for this repo." You adapt. You context-switch. You work around it. An AI agent? It either hallucinates a fix that makes things worse, or it just stops. As Coder's VP of Product put it: "Asking an agent to operate in a janky local setup is like asking someone to learn to drive in a car where the steering wheel only sort of works sometimes."

The problem compounds with scale. When you're running multiple agents in parallel, which is increasingly how real work gets done, each agent inherits your local environment's quirks. Different tool installs across machines cause agents to produce different outputs for the same prompt. Parallel runs compete for ports, filesystem state, and memory. Reproducibility doesn't just drift. It evaporates.

If your environment is lying to you, you'll probably notice. If it's lying to your agents, you'll get confidently wrong code in a PR you might approve.

I see this all the time. The other day, my agent rewrote an import path to use a package alias that only existed in the local tsconfig, one that had drifted from the repo's canonical version months ago. The code looked perfectly fine. It passed the agent's own checks. It broke in CI. That's an hour of debugging for something that never should have been possible in the first place.

Docker and Nix were the right idea, with the wrong tradeoff

Containerization was the correct impulse. Docker, Nix, Dev Containers: these tools all recognized that environment consistency is a prerequisite for reliable software, not a nice-to-have.

The problem is that every one of these solutions adds something to your workflow:

Docker: Dockerfiles to write and maintain, image sizes to manage, and on Mac, the perennial filesystem mount performance tax. It works. It also asks a lot of you.
Nix: Technically beautiful for reproducibility. But the learning curve is steep enough to have its own subreddit support group.
Dev Containers: Standardizes nicely, but requires VS Code (or compatible editors) and adds container startup time to every session.
Platform engineering: The enterprise answer. Dedicated teams building internal developer platforms. Effective, if you can afford to staff it.

These are all bolt-on solutions. They layer consistency on top of local development. You still start with a local machine, and you add tooling to make that machine behave consistently. That's better than nothing. But it's not the same as making the problem disappear.

More importantly, every bolt-on solution requires someone to maintain it. Dockerfiles go stale. Nix flakes need updating. Dev Container configs drift. You're trading one maintenance burden for another, and now you have two things that can break instead of one.

Cloud AI agent tools solved this without trying

Cloud-first AI coding tools like Builder, Claude Code, Cursor's background agents, or Devin didn't set out to fix "it works on my machine." They set out to make AI-powered development fast and accessible. But their architecture makes environment inconsistency structurally impossible.

Think about what happens when you use a cloud-based AI agent:

You describe a task or assign an issue
The agent spins up a fresh cloud environment: clean OS, correct dependencies, consistent tooling
It does the work in that isolated container
It opens a PR with the changes
You review the diff

At no point does the agent touch your local machine. At no point does it inherit your .zshrc aliases, your stale Homebrew packages, or the rogue Python 2.7 that's been lurking in your PATH since 2019. You didn't configure anything. You didn't debug anything. You just described what you wanted and reviewed what came back.

This isn't a feature in it of itself. It's just a structural byproduct of moving execution to the cloud. Every run starts from the same clean slate. No leftover state, no version mismatches, no port conflicts. The "it works on my machine" problem doesn't get solved—it gets eliminated, because there's no longer a "my machine" in the equation.

The best AI coding tools of 2026 all share this pattern to varying degrees. Cloud execution isn't just a deployment convenience. It's an environment consistency guarantee that you get for free.

What this means for your team

The implications go beyond just "no more env bugs." When there's no setup ritual, the barrier to contributing code drops to nearly zero. This is the shift I described in The AI Software Engineer in 2026: the developer as orchestrator, not the sole gatekeeper of a local environment that only they understand.

Onboarding gets faster. New hires don't spend their first three days fighting tooling. Open source contributors don't bounce off your project because your README assumes a specific OS. And the async bonus is real: fire off a task, close your laptop, get a notification when the PR is ready. Review it from your phone if you want. The cloud environment doesn't care what device you're on.

How we use AI development environments at Builder

I work in developer experience, which mostly means marketing, community, and docs. My relationship to our codebase isn't "I'm assigned sprint tickets." It's "I notice things that are broken because I talk to developers all day."

Previously, "noticing something broken" meant filing a JIRA ticket, describing the issue, and waiting for eng to prioritize it. For small paper cuts, that often meant it never got fixed. The overhead of filing, triaging, and assigning a minor annoyance was bigger than the annoyance itself. So you just live with it. You mention it in a Slack thread, someone agrees it's annoying, and the thread dies.

Now that same Slack thread is the fix. I can tag the @Builder.io bot directly in the conversation where I'm already discussing the problem. A few minutes later I get a notification with a link to a live preview. I click through to a full dev environment where I can test the fix, poke around, and make further changes myself if I need to.

Sometimes I just confirm the paper cut is gone and approve. Sometimes I dig in and adjust things at the code level. The point is I can operate at whatever level of granularity the situation calls for, without anyone setting anything up for me.

The shift here isn't just speed. It's that the people closest to a problem can now fix it. I see community pain points every day that engineering will never prioritize, because they're small and there's always something bigger on the roadmap. Now those paper cuts actually get addressed, by the person who noticed them, in the conversation where they came up.

This pattern is everywhere at Builder. Designers submit PRs with clean one-line diffs for layout tweaks. PMs fully prototype their own feature ideas, and those prototypes become our real code. Engineers end up reviewing already-working implementations instead of translating Figma specs into code. The tedious handoff work is gone, and developers focus on the parts that actually need engineering judgment.

All of it runs on the same consistent cloud environment. No "which branch are you on." No "did you run npm install." That consistency makes everything else work.

The best tools make problems invisible

The best developer tools don't ask you to fix your environment. They make the problem invisible.

Docker asked you to learn a new tool. Nix asked you to learn a new language. Platform engineering asked your company to hire a new team. AI coding agents didn't ask you anything. They just moved the environment to the cloud, and the problem went away.

That might be AI's most underrated contribution to developer experience. Not the code it writes. Not the PRs it opens. The fact that "it works on my machine" is finally, little-by-little, becoming irrelevant.

The industry spent two decades trying to solve environment consistency with more tooling. Turns out, the answer was to remove the environment from the equation entirely. Sometimes the best fix is just better architecture.