MobilePro #217: The Blind Spots of AI-Assisted TDD
Latest Mobile Dev Insights: iOS, Android, Cross-Platform
Is this your brand on Milled? Claim it.
MobilePro #217: The Blind Spots of AI-Assisted TDDLatest Mobile Dev Insights: iOS, Android, Cross-Platform
SwiftCraft 2026 is two weeks away — and your ticket includes a full workshop dayA single-track Swift conference where the Monday workshops are bundled in — including:
Conference days bring keynotes from Laura Savino (Adobe), Janina Kutyn (Apple Music), and Daniel Steinberg, plus 14 talks on performance, AI-assisted dev, StoreKit 2, SwiftUI animation, and more. Packt readers: £50 off an Indie ticket with code PACKT26 at swiftcraft.uk. Today we are covering a piece from Dave Poirier, co-author of AI-Driven Swift Architecture. This article immediately caught our attention because it cuts through a lot of the hype around AI-assisted development workflows. In this article, originally published on LinkedIn, he compares two AI-assisted implementations of the same iOS subsystem: one driven by rigorous spec-first TDD, the other by a messier human-in-the-loop iterative process, and the results are surprisingly revealing. Rather than arguing against AI or TDD, Dave explores where highly disciplined AI workflows can become structurally blind: missing operational concerns, instrumentation, integration seams, and the “what’s missing?” questions that experienced engineers naturally notice while running real systems. TL;DR
Hack Before You LaunchHack Before You Launch is a hands-on workshop for developers and indie hackers building fast with AI tools like ChatGPT and Copilot. Ethical hacker Dr. Katie Paxton-Fear will reveal how AI-built apps can expose hidden security flaws—and how to fix them before launch. Attendees will leave with practical security insights and a pre-launch checklist to help ship safer products. Saturday, 30 May | 10:00 AM – 11:30 AM | 1 hour 30 minutes This week’s news corner
Dave Poirier is a software developer veteran with over 25 years of experience writing mobile, desktop, and server applications. His specialty includes data privacy, data security, and app robustness. Dave developed his skills mostly through self-education and contributing to the open-source community. To this day, Dave continues to contribute to the iOS and macOS communities by sharing his knowledge with his peers, and currently works for iVerify.io, building software solutions to detect compromised mobile devices. AI: When the More Rigorous Process Shipped the Lower-Quality CodeThis article is a field note on two AI-assisted implementations of the same iOS feature, and why the workflow that looked more disciplined produced a weaker result. I recently reviewed two parallel implementations of the same telemetry subsystem on a mobile codebase: a background sampler that captures device metrics, buffers them in memory under power and capacity constraints, encodes them to a binary wire format, and writes time-gated artefacts on an 8-hour cadence. Same ticket, same base branch, two teams, two workflows. Branch A — PRD-driven AI TDDProduct requirements written upfront, decomposed into Given/When/Then specs, then strict test-first development with an adversarial convince me this is correct review skill. Commit history is textbook: test(red) → feat(green) → test(red) → feat(green), dozens of micro-commits Branch B — Human-in-the-loop iterativeLess ceremonious. The engineer wrote code, ran it, hit edge cases, refactored. Specs were updated after the implementation stabilized. Commit history is messier, visible Replace NSLock with DispatchQueue, Strengthen sampler test coverage, Per-target buffer sizes, and Add rationale to specs. I went in expecting Branch A to be tighter and more correct. The opposite was true on three of four quality dimensions. What I measuredAfter excluding tangential changes on both sides (a sibling component that landed in the same PR, an unrelated flaky-test fix), the scopes are comparable. Test coverageBranch A: roughly 43 test cases (XCTest) Branch B: roughly 62 test cases (Quick/Nimble), with categorically stronger edge-case coverage The count gap is moderate, but the kind of tests differ sharply:
The Threading correctness was measured against the project’s documented rule that synchronous waits on shared state must assert(!Thread.isMainThread):
Production observability
Architecture
Three of four axes go to Branch B. The one win for Branch A is real but smaller in absolute terms than the wins for B. Why the disciplined process underperformedFive mechanisms, ordered by how much I think they mattered. TDD does not generate the highest-leverage tests, Refactor-and-lock-down doesThe most valuable tests on Branch B, binary golden fixtures, ordering invariants across destructive operations, zero-field elision regression guards, share a property: they require first having working code, then capturing its output as ground truth, then writing a test that fails if the output drifts. This is a post-implementation move. A strict red → green workflow steers away from it because there is no failing-test moment to anchor on. Branch A has no golden fixtures. Branch B does, and they will catch wire-format drift years from now without anyone remembering why they exist. Spec-first plus test-first creates a closed echo chamberWhen the spec says X, the test asserts X, and the code does X, all three rhyme. None of them ask what about the interactions?—what happens when drop-oldest fires while the reporter is flushing, when the power throttle activates mid-tick, when the time-gate opens while the buffer is already at 80% capacity. Branch B’s spec is twice as long as Branch A’s because the author kept discovering interactions post-implementation and updating the spec to match. Branch A’s tighter spec reads cleaner but encodes a smaller world. TDD is hostile to instrumentationPerformance tracking, log dedup, debug logs at decision points, named factory methods, doc comments explaining queue ownership, none of these have failing tests demanding them. They are production hygiene. A workflow whose rule is don’t add code that isn’t driven by a failing test will systematically under-instrument the system, because instrumentation never goes red. Branch A is impressively minimal precisely because the author refused to add what wasn’t test-demanded. Branch B looks like an operator built it. Adversarial review hardens claims; it does not reframe scopeA skill that asks convince me your code is correct makes you a better defender of your design. It does not make you ask should I have made a different design? If the author’s frame is minimal additive change, no surrounding refactors, adversarial questioning reinforces that frame. The author of Branch A came out very confidently minimal—and that confidence was the structural weakness, not the code itself. A complementary kind of adversary, one that challenges scope rather than correctness, would have asked:
Branch B’s process exposed those questions naturally because the author was running the code, reading surrounding modules, and noticing gaps. Branch A’s process didn’t, because the adversary was looking inward at correctness rather than outward at fit. Micro-commits make the refactor step invisibleThe refactor step of red/green/refactor is where dedup, factory methods, comment-as-design-rationale, and instrumentation typically appear. Branch A’s commit log is dense with test(red) → feat(green) pairs and almost no visible refactor passes. Branch B’s commit log explicitly shows refactor passes, Replace NSLock with DispatchQueue, Strengthen sampler test coverage, Per-target buffer sizes, Drop swiftlint:disable directives. Same red/green machinery, but only one workflow paused to refactor. When the next red → green always feels more productive than stop and improve what just shipped green, the refactor pass gets squeezed out. The duplicated buffer-append code, the inline power-throttle decisions, and the silent gate-skip in Branch A are all unfinished refactor passes that didn’t happen because the workflow never paused. The meta-pattern: legible process is not the same as better outcomeBranch A has the more legible process. A reviewer reading its commit log feels reassured. Branch B has a messier, iterative trace, clearly the author was rediscovering invariants and updating things mid-flight. But quality lives in the output, not the trace. The legible process produced a tight, correct, narrowly-scoped implementation that misses operational concerns the messier process happened to cover. The lesson is not that TDD is bad, or that AI-driven workflows are bad. The lesson is more specific: TDD plus spec-first plus correctness-grilling is a high-confidence, self-consistent loop that rewards completeness within scope and is structurally blind to scope itself. To escape its own frame, it needs an outside-the-loop perturbation, a skeptical second reviewer, an explicit what’s missing? pass, a forced refactor pause, or an adversary tuned to scope rather than correctness. Practical recommendations for AI-assisted mobile engineering workflowsIf you are running AI-driven TDD on a real production system, four interventions that would close most of the gap I observed:
Closing thoughtThe most generalizable observation from this comparison: when AI does the test-first dance and a human grills it on correctness, both participants stay inside the spec’s frame. Neither is positioned to notice what the frame leaves out. The human-in-the-loop iterative process, slower, messier, less impressive on commit-log review, has one structural advantage the disciplined process lacks: the human is running the code and noticing the gaps. That feedback loop, which includes operational instincts, surrounding-module awareness, and the discomfort of seeing your own freshly-written NSLock next to a CLAUDE.md rule that flags it, is not yet replaceable by spec-driven test generation. It is the part of engineering that lives outside the spec. If you are designing AI-assisted workflows for production-grade mobile work, treat AI does TDD, human reviews tests as a strong default for forward progress, and pair it with a separate pass whose only job is to interrogate scope, instrumentation, threading compliance, and what the spec failed to imagine. The disciplined loop will not catch its own blind spots. That is what the second pass is for. 📚 Go DeeperIf you’re exploring how AI-assisted workflows are reshaping real-world software engineering, AI-Driven Swift Architecture by Walid SASSI and Dave Poirier offers a practical look at building modern iOS systems with AI-assisted development, architecture patterns, concurrency, and production-ready engineering practices. AI Driven Swift Architecture🧑💻 Master Clean Architecture, TDD, and modernization with Claude Code support 🛠️ Build SwiftUI apps powered by Apple Foundation Models and on-device intelligence 📱 Apply MCP workflows to create AI-driven feature agents in real projects 💬 Let’s TalkHow often do you actually update your paywall after shipping it? Advertise with usInterested in sponsoring this newsletter and reaching a highly engaged audience of tech professionals? Simply reply to this email or leave a comment and our team will get in touch with next steps. You're currently a free subscriber to Mobile & App DevPro Newsletter by Packt. For the full experience, upgrade your subscription. |






