Automated QA: let an AI agent test your iOS app

Before you start

This guide assumes:

FlowDeck is installed, your trial is active, and the FlowDeck skill pack is installed for at least one agent (Claude Code, Codex, OpenCode, Cursor, or Gemini). If not, follow the getting-started guide first.
An iOS project that builds and runs on the simulator. The agent needs something to drive.
15 to 30 minutes you can spend not at the keyboard. The agent runs unattended.

By the end you'll have a workflow for handing an iOS app to an AI agent and getting back a structured QA report covering layout review, feature testing, edge-case probing, and any fixes the agent committed along the way.

Step 01

Pick the right scope

Autonomous QA works best on a defined surface, one app, a few flows, one session. It struggles with whole-product test plans in one shot. Decide what's in scope before you write the prompt.

Good targets: a single feature you just shipped, a screen that gets edited often, an app's full happy path end to end.
Marginal: the entire app on every screen. Doable for small apps (5-10 screens). Long sessions; agent attention drifts.
Bad targets: multi-user flows, hardware-dependent features (camera, ARKit), anything requiring real backend writes against shared infrastructure.

The agent will follow your prompt literally. A prompt that says "test the app" gets a generic sweep. A prompt that names the flows ("add a city, remove a city, switch between cities") gets a focused, useful run.

Step 02

Write the QA prompt

The shape of a good QA prompt is consistent. Six sections, each pulling its weight:

Act as a QA engineer and test this application using FlowDeck UI automation.

1. Review the design and layout of every screen.
2. Test the core interactions:
   - Get forecast for the current city
   - Find cities
   - Add cities
   - Check forecast for multiple added cities
   - Remove a city
3. Test UI stability, accuracy, and reliability.
4. Identify edge cases (malformed input, rapid input, empty states, misuse).
5. Fix every bug you find. Verify each fix by re-running the relevant flow.
6. Write a summary report: screens reviewed, features tested, bugs fixed,
   open observations.

What's load-bearing:

"Act as a QA engineer." Sets the frame. The agent role-plays a different mindset than "implement this feature." It thinks about what could go wrong instead of what to add.
Explicit flow list. The agent's most common failure mode is doing too little. Naming the flows forces coverage.
"Fix every bug" + "verify." Without this, the agent flags bugs but stops. With it, the agent patches the code, re-runs the flow, and confirms the fix in the same session.
"Summary report." The output you actually care about. Without it, the agent's logs are scattered across the conversation; with it, you get one structured digest at the end.

Tune the prompt to the app. For an e-commerce app, the flow list is cart + checkout + order history. For a chat app, message send + thread navigation + offline reconnect. The structure stays the same; the specifics change.

Step 03

Hand it off and walk away

Paste the prompt into your agent session and let it run. A typical session takes 15 to 30 minutes for an app with 5 to 10 screens. Time scales with app complexity, not with test count.

What the agent runs under the hood (you don't need to type any of this; it's what your agent will issue):

flowdeck run -S "iPhone 16"                            # launch the app
flowdeck ui simulator session start -S "iPhone 16"     # start a background capture loop
flowdeck ui simulator screen --json                    # screenshot + accessibility tree
flowdeck ui simulator tap "Add City"                   # tap by label, not coordinates
flowdeck ui simulator type "London"                    # type into the focused field
flowdeck ui simulator swipe up                         # scroll
flowdeck ui simulator assert visible "London"          # verify the element exists
flowdeck logs <app-id>                                 # read runtime output

The loop is mechanical: see (screen + tree), act (tap / type / swipe), verify (assert or screen again), repeat. When something fails, the agent reads the logs, traces back to source, edits, rebuilds, re-runs the flow.

Because the agent matches by accessibility label rather than coordinates, the same prompt works across screen sizes and survives UI changes. A renamed button is still the same button if the label updates with it.

For long sessions, the background capture session (started above) writes a fresh screenshot + tree to ./.flowdeck/automation/sessions/<id>/ every ~500ms. The agent reads those files instead of calling screen on every turn, which keeps token usage and latency down across long runs.

Step 04

Read the report

The agent's output structure follows the six prompt sections. A typical report:

Screens reviewed. Each screen with a pass/fail and any layout notes. Crowded labels, low-contrast text, missing accessibility labels, off-by-one alignment.
Features tested. Each named flow with the result. "Add city → London added successfully. Forecast loaded in 1.2s. Visible in the city list."
Edge cases. The probing pass. "Tapped Add City three times in 200ms; the second and third taps produced duplicate entries." "Empty search field showed 'No results' but the search button was still enabled."
Bugs fixed. The patches the agent committed. File + line + description of the change + a verification note ("re-ran the remove-city flow; the index bug no longer reproduces").
Observations. Things the agent noticed but didn't change. UX choices that aren't clearly bugs ("the Celsius display shows '20°' but Celsius users typically expect '20° C' as the suffix"). These are for you to decide on.

The fixed-bugs and observations sections are the most valuable. Fixed bugs need a human review before you ship them; observations need a product decision.

Step 05

Decide which fixes to keep

The agent's autonomous fixes are the most efficient part of the loop and the riskiest output. Treat each one as a PR-grade change:

Review the diff. The agent's patches are usually small and local, but always check for unrelated edits. git diff on the affected files before approving anything.
Re-run the flow yourself. The agent claims it verified the fix. Trust but verify, run the flow manually once, especially if the bug was in state management or a non-obvious edge case.
Add a regression test. If the bug was worth fixing autonomously, it's worth pinning with an XCUITest case or a unit test so the next refactor doesn't reintroduce it. Ask the agent to write the test alongside the fix.

If a fix is wrong, revert it. The session log is intact in your terminal, so the next agent run starts from a clean state.

Patterns that make this useful in practice

Run automated QA on every PR.: Wire the prompt into a script your CI can run on the simulator after a successful build. The agent runs unattended, attaches the report as a PR comment, and flags any bugs it found or fixed. Catches what XCUITest misses, on every change, before review.
Use a stable seed dataset.: Autonomous testing is sensitive to starting state. If the app's first-run state is empty, the agent has to seed data before it can test flows. Provide a deterministic seed (a fixture, a mock backend, or a launch argument) so the agent's runs are reproducible.
Run on the smallest device too.: Run the same prompt against iPhone SE in addition to your primary target. Small-screen layouts hide bugs the agent will find. Same prompt, swap the simulator name.
Pair with snapshot tests.: Autonomous QA catches functional regressions; snapshot tests catch visual ones. The two stacks are complementary. Use snapshot testing for pixel-level UI stability and autonomous QA for "does this app actually work."
Keep the prompt in the repo.: The QA prompt is part of your test surface. Check it into tests/qa-prompt.md alongside your XCUITest suite. Iterate on it the same way you iterate on test code, and every team member starts the agent from the same baseline.

When things go wrong

The agent says "I cannot test the UI" or asks for screenshots: The skill pack isn't loaded. Quit and restart the agent (skills load on session startup, not mid-session). Verify with ~/.claude/skills/ (or your agent's equivalent) for a flowdeck directory. Reinstall via flowdeck -i → A if missing.
The agent gets stuck on the same screen: Usually a label ambiguity: two elements match the tap target. Tell the agent explicitly in the prompt: "tap by accessibility identifier, not label, when there's ambiguity." Or hand it flowdeck ui simulator find "..." to inspect the tree before tapping.
The agent fixes a bug and breaks something else: The agent didn't verify the broader app state after its fix. Add an explicit verification step to your prompt: "after each fix, re-run the smoke flow (open app → primary screen → quick interaction) to confirm nothing regressed." Catches collateral damage.
The agent reports too many false positives: Usually a too-broad prompt ("find all UX issues"). Narrow it: "flag UX issues only when they would block a typical user from completing the listed flows." Calibration takes 2-3 runs to dial in.
The session times out before finishing: Your agent's session limit is shorter than the test took. Either split the prompt into two runs (flows first, edge cases second) or run the agent with a longer session window. Long apps with 15+ screens benefit from the split.
The agent's report is fine but no bugs were fixed despite obvious bugs: The agent's tool permissions don't include code-editing. For Claude Code, ensure the project allows file edits. For headless agent runs (no human approval), allow autonomous edits in the session config. Without write permissions, the agent observes but can't fix.

Annex

How this compares to XCUITest and snapshot testing

For reference, the three testing stacks side by side. Each one is useful; none replaces the others.

Aspect	XCUITest	Autonomous QA
Setup cost	High; write a test target that mirrors production	Low; one prompt per session
Maintenance cost	Tests break on UI changes; ongoing rework	Prompt survives most UI changes; matches by label
Catches regressions	Yes; codified, deterministic	Yes for functional bugs; less reliable for subtle visual issues
Catches new bugs	Only what tests cover	Yes; exploratory pass surfaces edges
Bug fix in the same loop	Manual; engineer reads report, patches separately	Possible; agent edits, rebuilds, re-verifies in-session
Best fit	Critical path regression coverage	"Does this actually work end to end" sweep
Run time	Seconds to minutes per test target	15-30 minutes per session

The right testing stack is usually all three: XCUITest for critical-path regressions, snapshot tests for visual stability, autonomous QA for exploratory passes. Each one catches what the others miss.

Let an AI agent test your iOS app.