Why AI-Authored Apps Need a Spec-Driven Substrate

On one side, we have apps built the way we've been building them for years, on a no-code, template-based platform. On the other side, apps drafted in minutes by an AI agent. When we put both in front of users, many picked the AI-authored ones.

The easy conclusion is that users prefer AI-authored apps, so AI should build all of them. The better conclusion is that the preference test measured beauty on first contact, which is not the same thing as which approach produces better software once people have to live inside it.

A disclosure. I work at a company building an Agentic Development AI solution that sits on top of an established no-code, template-based application platform. I am not a neutral party. I would rather name that up front than pretend otherwise. The argument I am about to make is the bet we have already placed.

The distinction matters

Vibe coding is AI authorship without spec discipline: an agent, a prompt, and whatever the model happens to emit, with no shared design system, no interaction primitives, no accessibility baseline underneath.

Agentic development is AI authorship that sits on top of a well-engineered substrate—a platform that already carries the design system, the shortcuts, the popover behavior, the responsive rules—and uses the agent to author the task-specific parts on top of that floor.

These are poles on a spectrum; real tools land somewhere between them. What matters is which pole a given tool tilts toward. Both are "AI building apps." They age very differently, and most of what follows is about why.

What the preference really measured

The users were rating beauty. They looked at screens. They noticed hierarchy, color, density, the overall first impression. They were not living inside a real workflow. They were not flipping between this app and ten others across a working day. They were not leaning on keyboard shortcuts that had to behave the same way across a suite of tools. They were rating the cover, because the cover was what we put in front of them. The test rewarded whichever approach produced the prettiest first screen. We should not be surprised by what it measured.

The AI-authored apps winning on looks is a real finding and worth taking seriously. It tells us something specific about what those apps are doing well. But it is not a verdict on which approach produces better software over time. What follows is about what the beauty test did not measure.

What no-code, template-based apps quietly provide

The apps users walked past in the evaluation were not thin. They were carrying a great deal of invisible craft; the kind users only notice when it goes missing.

Design tokens carry a shared theme across the suite. Change the palette, the contrast level, or the density setting in one place, and every app moves with it. Dark mode arrives everywhere at once. Accessibility adjustments reach every screen. Learn the visual grammar of one app and you already know the grammar of every other.

Keyboard shortcuts travel across apps. Ctrl-Shift-A  means the same thing in every tool. Tab order behaves the way it should. Shortcut muscle memory, one of the most quietly load-bearing features in enterprise software, survives the jump between tools.

Popovers that land where they should, stay on screen, survive a scroll, and dismiss correctly. This sounds trivial. It is not. Considerable engineering sits behind a popover that just works on a crowded page, near the edge of a viewport, under a sticky header, inside an iframe, on an underpowered device. Users never see any of it. They also never have to think about any of it.

Hardware-accelerated transitions that stay smooth when the page is busy doing something else. Internationalization that holds up when a label gets longer in French. Responsive layouts that adapt rather than demanding a separate mobile build. Accessibility as a baseline rather than a checkbox to remember.

These are the features users assume are there. They notice which ones are missing by Wednesday. They feel the cumulative absence by Friday. None of them reveal themselves on first click.

What templates cost

Templates are limiting. Authors have less freedom than an agentic tool would give them. A template nudges every app toward a common shape, and when the task at hand does not fit that shape, the app can feel generic in a way users can sense. That cost is real. The point is what templates quietly buy in exchange for it.

Where agentic apps genuinely win

The AI-authored apps won on first impression because they earned it. Their advantage is not an illusion.

Free of a template, an AI-authored app can tune its information hierarchy to the specific task. The most important field takes the most important slot. Secondary actions recede instead of crowding the primary one. The chrome (the toolbar, the breadcrumbs, the frame) only appears if the app really needs it. Nothing is defaulted in because every other app has one.

The result feels specific. It feels like the app was designed for this task, not assembled from a catalog. Imagine a shipment-triage screen. The agent puts the exception count, the customer, and the one button they need in the top third of the viewport. The 40 metadata fields get pushed into a collapsible drawer underneath. A template would expose them all because the record has them all. The agentic version reads the task and hides the ones the user is not going to touch in the next thirty seconds. That is a real gain, and it is the kind of gain templates struggle to produce.

Some portion of what users liked is novelty. Some portion is genuine fit. Both are real, and we should not squint too hard to separate them. Either way, the agentic apps are doing something our template-based apps are not, and that is not a lesson we get to dismiss.

The dashboard problem, or why cohesion is the scarce resource

Here is where the beauty test starts to mislead.

The test showed users one app at a time. Real users do not work with one app at a time. They work inside a dashboard, a shell, a workspace, surrounded by the other apps they need to do their job. The unit of judgment in daily use is not this app. It is the suite, all at once.

At that level, cohesion becomes the loudest signal in the room. Louder than any individual app's content. Louder than any one app's hierarchy. Inconsistency across a suite is jarring in a way that inconsistency inside a single app is not, and the eye finds it before anything else.

Consider three regimes, ordered by how livable they are across a working week.

Template plus template. Shared grammar. The apps differ in content but agree on theme, spacing, shortcuts, motion, and popover behavior. Any single screen might be less beautiful than an agentic counterpart. Across the suite, the pieces fit. The user stops noticing the software and starts noticing their job.

Template plus vibed. One template-based app, one vibe-coded app, side by side. A jarring seam. One app follows the shared grammar. The other visibly does not. The user finds the seam before they find the content on either side of it. If the vibed app is beautiful, the seam is the first thing the eye lands on when the dashboard loads.

Vibed plus vibed. Cacophony. Every app picks its own dialect, its own spacing, its own keyboard shortcuts, its own idea of what a button looks like, its own opinion on where popovers should land. Each app might be lovely on its own. Assembled into a dashboard, they become exhausting to use every day. Users spend cognitive budget on translating between dialects when they should be spending it on their actual work.

This reframes what the no-code platform's rigor is really providing. It is not just craft inside a single app; it is cohesion across apps, which only surfaces in multi-app environments and compounds over time. Companies build internal tools to run their processes together as a connected whole. At that scale, cohesion is not a decoration. It is the job.

The agentic app that won the preference test was not rated on cohesion. It was rated on its own merits, one screen at a time.

How the two approaches age

Cohesion across apps is the spatial half of the problem. There is a temporal half too, and it points in the same direction.

What users do not anticipate, because they have not yet lived with the app, is which features they will reach for and find missing. A shortcut that does not do the same thing here. A dark mode that never arrived. A form that does not tab the way every other form does. None of these complaints surface in a head-to-head of screenshots. They surface at 3:30 on a Thursday, when a user is trying to finish something before a meeting.

Over longer horizons, the gap widens. No-code, template-based investments compound. The platform's rigor improves over time. When the platform added support for freezing and reordering table columns with per-user preferences, every table in every app in the suite got the feature automatically—no user or app team had to lift a finger. A new accessibility requirement lands, a better interaction primitive ships, a smarter responsive system arrives, and every app in the suite inherits it at once. Work done once pays out across every app that relies on the platform.

Vibed investments decay. Each app is its own island. Six months later, when a new theme rolls out, a new regulation lands, or a new browser quirk appears, every vibed app is its own migration project. In principle you can re-prompt the agent to regenerate each one, but now you are betting that the regenerated version preserves the muscle memory users have built up—the shortcut that was there last week, the layout they had learned. Regeneration is cheap. Regeneration without drift is not.

A fair objection here is that the substrate ages too. Decisions baked into a platform five years ago can calcify into constraints that no longer fit. That is true, and it is the real risk on the template side. The asymmetry is that when the substrate is fixed, every app inherits the fix; when a vibed app is fixed, the fix travels nowhere. A substrate can be wrong, but it can be wrong in one place and corrected in one place. A thousand vibed apps can be wrong in a thousand different ways.

The short-term question ("which do users prefer?") and the long-term question ("which approach holds up?") have different answers today. Over a year, the answers converge.

The route forward

The real choice isn't no-code template apps versus AI-authored apps. It's vibe coding versus spec-driven agentic development. Spec-driven agentic development draws on exactly the craft we built into no-code template apps over years: the theme, the shortcuts, the popover behavior, the accessibility baselines, the responsive rules. Those hard-won lessons become the substrate. The agent's freedom moves up the stack, where it belongs.

The total freedom is the very thing that makes vibed apps feel specific, and the same thing that prevents them from working well with the apps around them. Screenshots reward freedom. Working weeks reward constraints. A spec-driven agent trades some of the first for a lot of the second.

I want to be honest about the cost. A spec-driven agent is less free than an unconstrained one. The substrate rules out some of the moves a vibed app would make—an unusual control, a bespoke motion, a layout that breaks the grid because the task really would be better off without it. Some of those moves are exactly the ones that won the first-impression test. Spec-driven agentic development gives up some of the wildest moves on purpose, and a reader who values those moves over cohesion is not being irrational. I think the trade is worth it for software people use every day.

This is the bet my company has placed. The users who preferred the agentic apps were not wrong. The apps they were comparing against were not wrong either. The synthesis is what's worth building: agentic authorship on a spec-driven substrate that inherits what the template era quietly figured out.

Beauty on arrival is free. Cohesion over time is earned.

Spec-driven agentic development is how AI-authored apps earn it.

Beauty Now, Cohesion Later