Part 2 of the Edge AI on iOS series. Start with Part 1: The State of Edge AI on Mobile Devices.
What it actually costs in battery, heat, and accuracy.
The single most-misunderstood truth in mobile AI is that the bottleneck isn't compute. It's memory bandwidth. Flagship phones in 2026 ship neural processors in the ~100 TOPS class, comparable to a 2017 data-centre GPU. What they don't have is the 2-3 TB/s memory bandwidth a real GPU enjoys. Your phone, whether it's an iPhone 17 Pro, a Galaxy S26 Ultra, or a Pixel 10, has 50-90 GB/s. A 30-50ร gap. This single constraint shapes everything else about edge AI on mobile.
The hardware reality
As Meta's Vikas Chandra put it in his 2026 State of the Union on on-device LLMs: "decode is memory-bound; you load the entire model weights for each token generated, so the compute units sit idle waiting for memory."
This is why 4-bit quantisation gives you 4ร the throughput, not just 4ร the storage savings. It's why 2-bit and 1.58-bit are now competitive. And it's why small, specialised, quantised models often beat bigger, more capable ones on the same phone. The smaller one finishes.
Flagship mobile silicon, mid-2026
| Chip | Process | NPU peak | On-device claim |
|---|---|---|---|
| Apple A19 Pro (iPhone 17 Pro) | TSMC N3P | 16-core Neural Engine + GPU Neural Accelerators | MacBook-level AI performance; native mxfp4 |
| Snapdragon 8 Elite Gen 5 | TSMC N3P | Hexagon NPU, ~80-100 TOPS, +37% gen-on-gen | First agentic NPU; up to 220 tok/s referenced |
| MediaTek Dimensity 9500 | TSMC N3P | NPU 990, BitNet 1.58-bit, compute-in-memory | 128K context; on-device 4K image generation |
| Google Tensor G5 (Pixel 10) | TSMC 3nm | 4th-gen TPU, +60% vs G4 | Hosts Matryoshka-nested Gemini Nano (2B in 4B) |
| Samsung Exynos 2600 | Samsung 2nm GAA | 32,768 MACs, +113% NPU vs Exynos 2500 | First commercial 2nm mobile SoC; native ExecuTorch |
The three costs
On-device AI trades three currencies that don't exist in the cloud version of the same question: watts, degrees, and benchmark points. Skipping these costs is the giveaway that someone hasn't actually shipped with the tech.
Battery
Apple's own numbers say enabling Apple Intelligence on the iPhone 16 series adds 0.4-0.9% to daily drain. Almost imperceptible, because the Foundation Model only spins up reactively. But measure sustained inference and the picture changes. Greenspector's independent September 2025 benchmarks found that continuous local LLM inference drains a phone battery 12-15ร faster than idle. Gaming-level power draw, not browsing-level. The bright spot is the new generation of tiny specialised models: Google's Gemma 3 270M, INT4-quantised, handled 25 conversations on 0.75% of battery on a Pixel 9 Pro. Three orders of magnitude cheaper than running a 7B model.
Heat
A peer-reviewed 2026 study measuring sustained-load LLM inference on flagship phones reported that the iPhone 16 Pro loses nearly half its throughput within two iterations of back-to-back generation, and the Galaxy S24 Ultra hits a hard OS-enforced GPU frequency floor that terminates inference entirely. Published tokens-per-second benchmarks are almost always cold-start figures. The number you actually experience after the third or fourth long generation can be 40-60% lower on flagships and an order of magnitude lower on mid-range phones. Phones have terrible heat dissipation by physics.
Accuracy
A top-tier on-device model in May 2026 (Apple's Foundation Models 3B, Gemma 3 4B, Qwen3 4B, Phi-4-mini) performs roughly at the level of GPT-3.5 Turbo from late 2023 on standard benchmarks. Useful for bounded tasks. Not GPT-5.5. Failure modes that show up in practice: weak world knowledge, brittle multi-step reasoning, noisy long-context retrieval, and tool-use that maxes out at single-tool calls. Apple's own developer guidance is the most honest line in the category: don't use Foundation Models for code, math, or factual Q&A. Match the model to the task. Route to the cloud when you can't.
Cost summary
| Cost | Magnitude in real terms | Mitigation |
|---|---|---|
| Battery | 12-15ร faster drain under sustained inference; ~0.5% daily drain in triggered-only use | Smaller models, bursty inference, caching |
| Heat | iPhone 16 Pro loses ~50% throughput by 2nd sustained iteration; S24 Ultra can throttle to zero | Benchmark warm, prefer NPU, pause-and-resume |
| Accuracy | GPT-3.5-era benchmark quality; fails on world knowledge, multi-step reasoning, complex tool use | Match model to task, constrain output, hybrid fallback |
What the next silicon generation is rumoured to change
WWDC 2026 lands in two weeks; Apple's September event is still four months out. But the chip leaks have been remarkably consistent across multiple analysts, and the direction they point is exactly the one this post predicts. If the rumours hold, the next generation of mobile silicon is being engineered for the bandwidth problem, not the compute problem.
The A20 Pro, expected in the iPhone 18 Pro and the rumoured foldable iPhone in September 2026, is reported to move to TSMC's 2nm (N2) process, with roughly 15% better performance and 30% better efficiency over the A19 Pro. The detail that matters most for edge AI is packaging. According to GF Securities analyst Jeff Pu, the A20 Pro will use TSMC's Wafer-Level Multi-Chip Module (WMCM) technology, which integrates RAM directly onto the same wafer as the CPU, GPU, and Neural Engine, before the chips are even cut. The processor and memory sit in closer physical proximity than ever before. Memory bandwidth goes up. Latency goes down. The Neural Engine waits less for memory. This is the single most important rumoured change for on-device AI in 2026, because it directly attacks the bottleneck this post is built around.
On the Android side, Qualcomm's Snapdragon 8 Elite Gen 6 (expected at Snapdragon Summit, September 2026) is reportedly splitting into a standard and Pro variant. Both are expected on TSMC's 2nm process with a new 2+3+3 CPU layout. The Pro adds LPDDR6 memory support (vs LPDDR5X on the standard) and an Adreno 850 GPU. The current Elite Gen 5 already claims 220 tokens per second of on-device LLM throughput; Gen 6 is expected to push that further with more efficient NPU silicon, though Qualcomm has been quieter than Apple on the AI-specific gains.
Three patterns are worth noticing across these leaks. First, every flagship vendor is on 2nm in 2026; the process node race has converged. Second, the actual differentiator is now packaging and memory architecture (WMCM for Apple, LPDDR6 for Qualcomm's Pro), not transistor count. Third, none of the leaked specs mention dramatic increases in raw TOPS. The industry has internalised what Chandra wrote in January: TOPS isn't the bottleneck. The race is now for memory bandwidth, thermal headroom, and efficient packaging. Which means if you're designing apps for late-2026 and 2027 phones, you should assume the bandwidth gap to data-centre GPUs narrows meaningfully (though it doesn't close), and the sustainable token-per-second number on a flagship phone goes up by maybe 1.5-2ร rather than 10ร.
Worth flagging: all of this is leak-grade, not announcement-grade. The next two months will clarify a lot, starting with WWDC.
What to build on Monday
Three patterns are worth building today. First, on-device structured extraction: turn freeform user text into typed records using Apple's @Generable on iOS or Gemini Nano's structured output on Android. Second, hybrid routing: use the cloud only when the task genuinely needs it. Third, treat LoRA adapters as your long-term personalisation play; that's where the next two years of differentiation lives.
If you take one practical action from this post, make it this: install PocketPal AI, download Gemma 3 4B Q4 from Hugging Face, and try it offline for a day. You'll feel the shift this series is about.
WWDC 2026 lands in just under two weeks. Apple will almost certainly announce something material. But the thesis underneath these two posts is older and stronger than any keynote: memory bandwidth, small models, hybrid architectures, privacy as construction rather than promise. Those don't shift in a year. Next post in the series: hands-on with Apple's Foundation Models framework, end to end.
Further reading
If this post is your starting point, these are the six pieces that go deeper. Each one repays the time.
- Vikas Chandra & Raghuraman Krishnamoorthi โ On-Device LLMs: State of the Union, 2026 (Meta AI Research, January 2026). The definitive technical reference. Memory bandwidth as the real bottleneck, the full quantisation Pareto, why MoE is hard on phones.
- Martin Alderson โ No, it doesn't cost Anthropic $5k per Claude Code user (March 2026). The clearest dissection of frontier API economics on the public internet. Where the 10ร markup comes from and why builders are walking away.
- Apple Machine Learning Research โ Apple Intelligence Foundation Language Models Tech Report 2025 (July 2025, arXiv:2507.13575). The architecture paper for the model now shipping on hundreds of millions of devices. KV-cache sharing, 2-bit QAT, the two-block design.
- Simon Willison's local-llms archive (simonwillison.net/tags/local-llms). The most consistent practitioner voice on running models locally. Run-it-yourself energy, year after year.
- Greenspector โ What is the environmental impact of local AI on our smartphones? (September 2025). The independent battery benchmarks that ground the 12-15ร drain figure. Real measurements on real phones.
- Anna Tong โ Cursor Goes to War for AI Coding Dominance (Forbes, March 2026). The original article that triggered the $5,000 debate. Context for why Cursor is building its own model family.
And if you want the single shortest case for edge AI, the closing line of Chandra's piece does it in fifteen words: "Phones didn't become GPUs. The field learned to treat memory bandwidth, not compute, as the binding constraint."