Most teams that run Playwright at scale eventually hit the same wall. The first hundred tests are easy. The next thousand are a maintenance treadmill. UIs ship every week, selectors drift, brittle assertions fail at three in the morning, and the team's response calcifies into "just re-run it". Tests get quarantined. Coverage shrinks. The pipeline starts lying about whether the build is healthy.

We have been working on a different shape for that workflow. Drive the test once with an AI agent, snapshot the path it took into a real Playwright script, run that script forever after at deterministic-test cost, and let AI step back in only at the two moments where it actually adds value. When the test breaks, and when you want to change what it does.

This post walks through what we shipped, how it changes the cost curve, and where it sits relative to the recently announced Playwright AI agents.

The cost of running AI tests on every build

Running an AI agent against your app is a great way to author the first version of a test. It is a terrible way to run that test five hundred times a week in CI.

A single AI driven browser run lands somewhere between a few cents and around a dollar, depending on how long the run takes, how chatty the model is, and how many screenshots the agent inspects on the way through. That cost is fine for exploration. It is painful for regression. A team running 200 tests on every merge to main, ten times a day, on three branches, can quickly burn into four-figures-per-month territory just on inference, before infrastructure.

Deterministic Playwright runs cost fractions of a cent. They finish faster. They report exactly the same outcome on every run, which is the actual point of regression testing. The trick is getting from a flexible AI agent to a deterministic Playwright script without losing the part that made the AI agent useful in the first place. The ability to find a path through an app you did not pre-script.

The flow we built: one AI run, one click, one Playwright script

The starting point is the AI agent. You write a plain-English test plan ("log in, create a booking, confirm the booking shows in the dashboard"), the agent runs it against your app, and you watch it pass or fail. So far this is a plain AI test.

When the run passes, you click Generate Script. We turn the AI run into a real Playwright test file, with concrete selectors, structured assertions, and a clean spec layout. The output is a regular Playwright spec, not some opaque blob. You can read it, you can fork it, you can check it into your repo and run it from your own CI exactly as you would any hand-written test.

A few useful properties fall out of this:

Re-runs skip the AI entirely. Once the script exists, every subsequent run is a deterministic Playwright run. No LLM in the hot path. No model drift between commits.
The script comes with the original acceptance criteria attached. When the test fails, the report shows you which user-facing acceptance criterion broke, not just which selector blew up. The same view you got from the AI run, applied to a deterministic run.
It is real Playwright. Anyone on the team who knows Playwright can read it, change it, or extend it without learning a new DSL.

That is the happy path. The interesting questions start when something breaks. The script generation docs walk through the same flow with the actual UI affordances if you want to follow along inside the product.

When the UI changes: the healing agent

The thing that kills hand-written E2E suites is not the initial write. It is week 32, when marketing renames a button from "Continue" to "Get started", or the dev team replaces an inline modal with a slide-over panel, and forty-seven tests light up red overnight.

We hand that case to a healing agent. When a generated script fails, the healer takes a look at the failure, works out what changed, and produces a patched version of the script for review. You see the proposed diff right in the run report. You accept it, the script is updated, the next CI run goes green. You reject it, the patch is dropped and you go fix the test by hand.

We are deliberately leaving the internals of that loop out of this post. From the outside, the workflow has three steps:

The script fails on a CI build.
A patched version is offered alongside the failure.
You decide whether to apply it.

You stay in control of what lands in your repo. The healer is a suggestion engine, not an autopilot. Your test suite does not silently rewrite itself.

Healing is for breakages. Refinement is for intentional change.

If you want a generated script to assert on a different field, follow a new flow, or branch on a configuration flag, you can chat with the script. Describe the change in plain English, the script is updated, and you review the diff before saving. Each round of refinement is its own conversation turn, so you can iterate without restarting from scratch every time.

In practice this turns "go modify the test file in your editor" into a quick chat exchange. Useful in three places:

Adapting a script to a new feature in the same area of the product without re-recording the whole flow.
Splitting a long script into focused tests that each cover one acceptance criterion.
Tightening assertions after a real-world bug exposes that the original assertion was too loose.

Each refinement burns a small amount of credit. Healing is invoked only on failure. Most weeks, a stable test runs for free.

How this compares to Playwright AI agents

Microsoft has been building official Playwright AI agents that run inside Playwright itself. They are interesting tools, and they are worth watching. They sit in a different spot in the workflow than what we built.

The short summary, with the long version saved for a future post: Playwright's agents focus on having the AI in the loop at run time. The goal in that direction is letting the AI explore your app and decide what to do as it goes. That is genuinely useful, especially for exploratory testing. The cost shape is closer to "run AI on every test, every time" than "run AI once, run deterministic forever".

We chose the second shape. The reasoning, the tradeoffs, and the parts of Playwright's approach we still pull from will get a post of their own. For this one, the key point is just that there is more than one workflow that makes sense for AI assisted Playwright testing, and the right one for your team depends on whether you optimize for exploration or for regression.

What this looks like at the cost level

A typical 200-test suite running on every main-branch merge, with twenty merges a day, gives a useful baseline:

Pure AI runs, every time: ten cents per test on the low end. 200 × 20 × 10 cents lands around $400 per day, before retries.
Generated-script runs: a fraction of a cent per test in inference cost, plus your CI minutes. Effectively zero AI spend on a passing day.
Healing: invoked only when a test fails. A few cents per heal when it does. A healthy suite might see a handful of heal events in a week.
Refinement: a small per-turn cost when you intentionally change a script. Bounded by you, not by the suite size.

The break-even moment is fast. Most teams that adopt the workflow start seeing the savings inside the first week, because the running cost of a passing suite drops to roughly the cost of your CI minutes.

Why this changes the relationship with flaky tests

Flaky tests are usually not "the test is broken on Tuesdays". They are "the test was written against last quarter's UI, and the UI has shifted three times since". A team that does not address that drift ends up with a quarantine list, a policy of re-running anything that goes red, and a slow erosion of trust in the suite. We wrote a field guide on flaky tests earlier this year that goes into the upstream causes in more detail.

A generated script that comes with a healing path attached changes the math. The first time the test breaks, you do not write off the failure as flaky. You look at the proposed patch, decide whether the underlying behavior is still right, and apply or reject in one click. The maintenance cost of a thousand-test suite stops scaling with the rate of UI change.

This is not magic. The healer can be wrong. Some failures are real bugs that you want to see, not patch over. The point is that the default response to a red CI run stops being "rerun and hope" and starts being "look at the proposed diff and make a decision". That single change reclaims the engineering time that flakiness used to consume.

The end-to-end loop

Pulling it together, the workflow looks like this:

Write a plain-English test plan.
Run it once with the AI agent until it passes.
Click Generate Script. Get a real Playwright spec.
Check the spec into your repo. Run it in your CI from then on.
When the script breaks, review the proposed heal. Accept or reject.
When you want different behavior, chat with the script. Review the diff. Save.

Authoring is fast because the AI does the discovery. Running is cheap because deterministic Playwright is cheap. Maintenance is bounded because healing handles UI drift and refinement handles intentional change.

Try it

The feature is live in the test plan reports today. Generated scripts ship with every paid plan that includes script generation, healing comes with them, and the refinement chat is available alongside the diff view. There is more on the roadmap, including smarter healing, parallel generation across devices, and deeper acceptance-criteria reporting, but the core loop is shipping today.

If you want to put it in front of your own app, start a free trial or browse the pricing tiers. The script generation docs cover the full workflow with screenshots, including how healing and AI refinement plug into the run report.

Generate Self-Healing Playwright Tests From a Single AI Run