Packt

US · packtpub.com

MobilePro #217: The Blind Spots of AI-Assisted TDD

Latest Mobile Dev Insights: iOS, Android, Cross-Platform

This email was sent

May 13, 2026 10:07am EDT

Is this your brand on Milled? Claim it.

Matte tone:

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

MobilePro #217: The Blind Spots of AI-Assisted TDD

Latest Mobile Dev Insights: iOS, Android, Cross-Platform

Runcil Rebello

May 13

READ IN APP

SwiftCraft 2026 is two weeks away — and your ticket includes a full workshop day

A single-track Swift conference where the Monday workshops are bundled in — including:

Daniel Steinberg’s full-day Foundation Models / Apple Intelligence workshop,
Phil Zakharchenko (ex-Apple, now OpenAI) on building great Mac apps in 2026, and
Joannis Orlandos on Swift for embedded Linux.

Conference days bring keynotes from Laura Savino (Adobe), Janina Kutyn (Apple Music), and Daniel Steinberg, plus 14 talks on performance, AI-assisted dev, StoreKit 2, SwiftUI animation, and more.

Packt readers: £50 off an Indie ticket with code PACKT26 at swiftcraft.uk.

Reserve your spot

Today we are covering a piece from Dave Poirier, co-author of AI-Driven Swift Architecture. This article immediately caught our attention because it cuts through a lot of the hype around AI-assisted development workflows.

In this article, originally published on LinkedIn, he compares two AI-assisted implementations of the same iOS subsystem: one driven by rigorous spec-first TDD, the other by a messier human-in-the-loop iterative process, and the results are surprisingly revealing.

Rather than arguing against AI or TDD, Dave explores where highly disciplined AI workflows can become structurally blind: missing operational concerns, instrumentation, integration seams, and the “what’s missing?” questions that experienced engineers naturally notice while running real systems.

TL;DR

AI-assisted TDD can produce clean, disciplined code — but still miss real-world engineering concerns
Specs + tests often reinforce the same assumptions instead of challenging them
The strongest production safeguards usually emerge after running and refactoring working code
Human-in-the-loop workflows catch operational gaps AI workflows often overlook
Correctness inside the spec is not the same as production readiness outside the spec
The missing step in many AI coding workflows is asking: “What did we fail to imagine?”

Hack Before You Launch

Hack Before You Launch is a hands-on workshop for developers and indie hackers building fast with AI tools like ChatGPT and Copilot. Ethical hacker Dr. Katie Paxton-Fear will reveal how AI-built apps can expose hidden security flaws—and how to fix them before launch. Attendees will leave with practical security insights and a pre-launch checklist to help ship safer products.

Saturday, 30 May | 10:00 AM – 11:30 AM | 1 hour 30 minutes

Join us to get a step ahead

This week’s news corner

iOS 26.5 RC: iOS and iPadOS 26.5 RC (with Xcode 26.5) introduces enhanced StoreKit support for subscription pricing models, along with new APIs and metadata access. It also fixes issues in receipts, subscription entitlements, StoreKit testing, and wallpaper installation/removal.
Android Auto app widgets mirror mobile setup: Android Auto is preparing to introduce widget support that mirrors the mobile widget selection experience. The feature has reappeared in recent leaks and is likely to be announced or rolled out soon.
Figma Builds In-House Redis Proxy to Hit Six Nines Uptime: Figma built an in-house Redis proxy (FigCache) to unify and stabilize its caching layer, achieving “six nines” uptime while reducing outages and operational complexity at scale.
Swift Package Manager is soon the default in Flutter: Flutter 3.44 will make Swift Package Manager the default for iOS/macOS, replacing CocoaPods as it moves to read-only status by 2026. The shift simplifies setup and pushes developers and plugins to adopt Apple’s modern dependency system.

Know the author

Dave Poirier is a software developer veteran with over 25 years of experience writing mobile, desktop, and server applications. His specialty includes data privacy, data security, and app robustness. Dave developed his skills mostly through self-education and contributing to the open-source community. To this day, Dave continues to contribute to the iOS and macOS communities by sharing his knowledge with his peers, and currently works for iVerify.io, building software solutions to detect compromised mobile devices.

AI: When the More Rigorous Process Shipped the Lower-Quality Code

This article is a field note on two AI-assisted implementations of the same iOS feature, and why the workflow that looked more disciplined produced a weaker result.

I recently reviewed two parallel implementations of the same telemetry subsystem on a mobile codebase: a background sampler that captures device metrics, buffers them in memory under power and capacity constraints, encodes them to a binary wire format, and writes time-gated artefacts on an 8-hour cadence.

Same ticket, same base branch, two teams, two workflows.

Branch A — PRD-driven AI TDD

Product requirements written upfront, decomposed into Given/When/Then specs, then strict test-first development with an adversarial convince me this is correct review skill. Commit history is textbook:

test(red) → feat(green) → test(red) → feat(green), dozens of micro-commits

Branch B — Human-in-the-loop iterative

Less ceremonious. The engineer wrote code, ran it, hit edge cases, refactored. Specs were updated after the implementation stabilized. Commit history is messier, visible Replace NSLock with DispatchQueue, Strengthen sampler test coverage, Per-target buffer sizes, and Add rationale to specs.

I went in expecting Branch A to be tighter and more correct. The opposite was true on three of four quality dimensions.

What I measured

After excluding tangential changes on both sides (a sibling component that landed in the same PR, an unrelated flaky-test fix), the scopes are comparable.

Test coverage

Branch A: roughly 43 test cases (XCTest)

Branch B: roughly 62 test cases (Quick/Nimble), with categorically stronger edge-case coverage

The count gap is moderate, but the kind of tests differ sharply:

Branch A’s strongest tests verify happy-path field storage and single-method behavior.
Branch B’s strongest tests verify integration invariants the team couldn’t easily have predicted upfront:
- A gate-closed condition must short-circuit before a destructive flush;
- Ring-buffer FIFO must hold independently across CPU and power buffers under capacity pressure; and
- Proto3 zero-field elision must be locked against binary golden fixtures so the wire format can’t silently drift.

The Threading correctness was measured against the project’s documented rule that synchronous waits on shared state must assert(!Thread.isMainThread):

Branch A—partially compliant: Legacy NSLock patterns retained in helper providers, no off-main asserts on those paths.
Branch B—compliant: Every sync-wait entry point asserts off-main; NSLock paths replaced with serial DispatchQueue

Production observability

Branch A—minimal: Silent skip on gate-closed, no performance-tracking hooks, and no log dedup.
Branch B: Integrated performance tracker, dedup’d failure logging via a once-only emitter, explicit decision-point logs at every short-circuit.

Architecture

Branch A—cleaner: Direct service registration in each app entry point; minimal public surface.
Branch B: Introduced a configuration-aggregating struct that bundles too many unrelated concerns into one place, a small God-object regression.

Branch A: A 168-line behavioral spec, decoder co-located with the component.
Branch B: A 336-line spec with embedded rationale per decision, decoders relocated to a cross-team folder with a CI snapshot-test runbook documenting the encoder/decoder lock-step contract.

Three of four axes go to Branch B. The one win for Branch A is real but smaller in absolute terms than the wins for B.

Why the disciplined process underperformed

Five mechanisms, ordered by how much I think they mattered.

TDD does not generate the highest-leverage tests, Refactor-and-lock-down does

The most valuable tests on Branch B, binary golden fixtures, ordering invariants across destructive operations, zero-field elision regression guards, share a property: they require first having working code, then capturing its output as ground truth, then writing a test that fails if the output drifts.

This is a post-implementation move. A strict red → green workflow steers away from it because there is no failing-test moment to anchor on.

Branch A has no golden fixtures. Branch B does, and they will catch wire-format drift years from now without anyone remembering why they exist.

Spec-first plus test-first creates a closed echo chamber

When the spec says X, the test asserts X, and the code does X, all three rhyme. None of them ask what about the interactions?—what happens when drop-oldest fires while the reporter is flushing, when the power throttle activates mid-tick, when the time-gate opens while the buffer is already at 80% capacity.

Branch B’s spec is twice as long as Branch A’s because the author kept discovering interactions post-implementation and updating the spec to match. Branch A’s tighter spec reads cleaner but encodes a smaller world.

TDD is hostile to instrumentation

Performance tracking, log dedup, debug logs at decision points, named factory methods, doc comments explaining queue ownership, none of these have failing tests demanding them. They are production hygiene.

A workflow whose rule is don’t add code that isn’t driven by a failing test will systematically under-instrument the system, because instrumentation never goes red.

Branch A is impressively minimal precisely because the author refused to add what wasn’t test-demanded. Branch B looks like an operator built it.

Adversarial review hardens claims; it does not reframe scope

A skill that asks convince me your code is correct makes you a better defender of your design. It does not make you ask should I have made a different design? If the author’s frame is minimal additive change, no surrounding refactors, adversarial questioning reinforces that frame.

The author of Branch A came out very confidently minimal—and that confidence was the structural weakness, not the code itself.

A complementary kind of adversary, one that challenges scope rather than correctness, would have asked:

Why isn’t this proto file in the cross-team decoder folder?
Why isn’t there a golden binary fixture?
Why does this helper still use NSLock when the codebase’s threading documentation flags that pattern?

Branch B’s process exposed those questions naturally because the author was running the code, reading surrounding modules, and noticing gaps. Branch A’s process didn’t, because the adversary was looking inward at correctness rather than outward at fit.

Micro-commits make the refactor step invisible

The refactor step of red/green/refactor is where dedup, factory methods, comment-as-design-rationale, and instrumentation typically appear.

Branch A’s commit log is dense with test(red) → feat(green) pairs and almost no visible refactor passes. Branch B’s commit log explicitly shows refactor passes, Replace NSLock with DispatchQueue, Strengthen sampler test coverage, Per-target buffer sizes, Drop swiftlint:disable directives. Same red/green machinery, but only one workflow paused to refactor.

When the next red → green always feels more productive than stop and improve what just shipped green, the refactor pass gets squeezed out. The duplicated buffer-append code, the inline power-throttle decisions, and the silent gate-skip in Branch A are all unfinished refactor passes that didn’t happen because the workflow never paused.

The meta-pattern: legible process is not the same as better outcome

Branch A has the more legible process. A reviewer reading its commit log feels reassured. Branch B has a messier, iterative trace, clearly the author was rediscovering invariants and updating things mid-flight.

But quality lives in the output, not the trace. The legible process produced a tight, correct, narrowly-scoped implementation that misses operational concerns the messier process happened to cover.

The lesson is not that TDD is bad, or that AI-driven workflows are bad. The lesson is more specific: TDD plus spec-first plus correctness-grilling is a high-confidence, self-consistent loop that rewards completeness within scope and is structurally blind to scope itself.

To escape its own frame, it needs an outside-the-loop perturbation, a skeptical second reviewer, an explicit what’s missing? pass, a forced refactor pause, or an adversary tuned to scope rather than correctness.

Practical recommendations for AI-assisted mobile engineering workflows

If you are running AI-driven TDD on a real production system, four interventions that would close most of the gap I observed:

Add a post-TDD golden-file lock-down step: Once all units pass, capture binary outputs as .golden fixtures and commit them. The wire format is now locked. No future refactor can silently change it.
This is anti-TDD by construction, you can only write the test after the code works, and that is precisely why it must be a separate phase rather than something the loop is asked to discover.
Force at least one integration-seam test per pair of components: Not does the sampler buffer correctly but does the sampler buffer correctly while the reporter is flushing?
These tests do not fall out of unit-level TDD. They require explicit prompting: name every cross-component interaction in this diff and write a test for each.
Pair correctness-grilling with scope-grilling: Run a second adversary whose only job is to ask what is not in your diff that should be? Point it at proto locations, missing instrumentation, missing comments, threading-rule compliance, opportunities to dedup against existing patterns. This is the question the disciplined inner loop will not ask itself.
Make the refactor step a first-class commit: If your trace shows N test(red) → feat(green) pairs and zero refactor commits, something is being skipped. Refactor should be visible in the history, not implicit in the file-state-at-merge.
A workflow rule as simple as after every three green commits, you must produce one refactor commit or explicitly state none is needed would have closed much of the observed gap.

Closing thought

The most generalizable observation from this comparison: when AI does the test-first dance and a human grills it on correctness, both participants stay inside the spec’s frame. Neither is positioned to notice what the frame leaves out.

The human-in-the-loop iterative process, slower, messier, less impressive on commit-log review, has one structural advantage the disciplined process lacks: the human is running the code and noticing the gaps.

That feedback loop, which includes operational instincts, surrounding-module awareness, and the discomfort of seeing your own freshly-written NSLock next to a CLAUDE.md rule that flags it, is not yet replaceable by spec-driven test generation. It is the part of engineering that lives outside the spec.

If you are designing AI-assisted workflows for production-grade mobile work, treat AI does TDD, human reviews tests as a strong default for forward progress, and pair it with a separate pass whose only job is to interrogate scope, instrumentation, threading compliance, and what the spec failed to imagine.

The disciplined loop will not catch its own blind spots. That is what the second pass is for.

📚 Go Deeper

If you’re exploring how AI-assisted workflows are reshaping real-world software engineering, AI-Driven Swift Architecture by Walid SASSI and Dave Poirier offers a practical look at building modern iOS systems with AI-assisted development, architecture patterns, concurrency, and production-ready engineering practices.

AI Driven Swift Architecture

🧑‍💻 Master Clean Architecture, TDD, and modernization with Claude Code support

🛠️ Build SwiftUI apps powered by Apple Foundation Models and on-device intelligence

📱 Apply MCP workflows to create AI-driven feature agents in real projects

Buy now at $44.99

💬 Let’s Talk

How often do you actually update your paywall after shipping it?
Reply and let me know — curious how teams approach iteration.

Advertise with us

Interested in sponsoring this newsletter and reaching a highly engaged audience of tech professionals? Simply reply to this email or leave a comment and our team will get in touch with next steps.

You're currently a free subscriber to Mobile & App DevPro Newsletter by Packt. For the full experience, upgrade your subscription.

Upgrade to paid

Comment

Restack

MobilePro #217: The Blind Spots of AI-Assisted TDD

Latest Mobile Dev Insights: iOS, Android, Cross-Platform

MobilePro #217: The Blind Spots of AI-Assisted TDD

Latest Mobile Dev Insights: iOS, Android, Cross-Platform

SwiftCraft 2026 is two weeks away — and your ticket includes a full workshop day

Hack Before You Launch

This week’s news corner

Know the author

AI: When the More Rigorous Process Shipped the Lower-Quality Code

Branch A — PRD-driven AI TDD

Branch B — Human-in-the-loop iterative

What I measured

Test coverage

Production observability

Architecture

Why the disciplined process underperformed

TDD does not generate the highest-leverage tests, Refactor-and-lock-down does

Spec-first plus test-first creates a closed echo chamber

TDD is hostile to instrumentation

Adversarial review hardens claims; it does not reframe scope

Micro-commits make the refactor step invisible

The meta-pattern: legible process is not the same as better outcome

Practical recommendations for AI-assisted mobile engineering workflows

Closing thought

📚 Go Deeper

AI Driven Swift Architecture

💬 Let’s Talk

Advertise with us

Recent emails from Packt