Edge AI on Mobile: The Honest Trade-offs

Part 2 of the Edge AI on iOS series. Start with Part 1: The State of Edge AI on Mobile Devices.

What it actually costs in battery, heat, and accuracy.

The single most-misunderstood truth in mobile AI is that the bottleneck isn't compute. It's memory bandwidth. Flagship phones in 2026 ship neural processors in the ~100 TOPS class, comparable to a 2017 data-centre GPU. What they don't have is the 2-3 TB/s memory bandwidth a real GPU enjoys. Your phone, whether it's an iPhone 17 Pro, a Galaxy S26 Ultra, or a Pixel 10, has 50-90 GB/s. A 30-50× gap. This single constraint shapes everything else about edge AI on mobile.

The hardware reality

As Meta's Vikas Chandra put it in his 2026 State of the Union on on-device LLMs: "decode is memory-bound; you load the entire model weights for each token generated, so the compute units sit idle waiting for memory."

This is why 4-bit quantisation gives you 4× the throughput, not just 4× the storage savings. It's why 2-bit and 1.58-bit are now competitive. And it's why small, specialised, quantised models often beat bigger, more capable ones on the same phone. The smaller one finishes.

Flagship mobile silicon, mid-2026

Chip	Process	NPU peak	On-device claim
Apple A19 Pro (iPhone 17 Pro)	TSMC N3P	16-core Neural Engine + GPU Neural Accelerators	MacBook-level AI performance; native mxfp4
Snapdragon 8 Elite Gen 5	TSMC N3P	Hexagon NPU, ~80-100 TOPS, +37% gen-on-gen	First agentic NPU; up to 220 tok/s referenced
MediaTek Dimensity 9500	TSMC N3P	NPU 990, BitNet 1.58-bit, compute-in-memory	128K context; on-device 4K image generation
Google Tensor G5 (Pixel 10)	TSMC 3nm	4th-gen TPU, +60% vs G4	Hosts Matryoshka-nested Gemini Nano (2B in 4B)
Samsung Exynos 2600	Samsung 2nm GAA	32,768 MACs, +113% NPU vs Exynos 2500	First commercial 2nm mobile SoC; native ExecuTorch

The three costs

On-device AI trades three currencies that don't exist in the cloud version of the same question: watts, degrees, and benchmark points. Skipping these costs is the giveaway that someone hasn't actually shipped with the tech.

Battery

Apple's own numbers say enabling Apple Intelligence on the iPhone 16 series adds 0.4-0.9% to daily drain. Almost imperceptible, because the Foundation Model only spins up reactively. But measure sustained inference and the picture changes. Greenspector's independent September 2025 benchmarks found that continuous local LLM inference drains a phone battery 12-15× faster than idle. Gaming-level power draw, not browsing-level. The bright spot is the new generation of tiny specialised models: Google's Gemma 3 270M, INT4-quantised, handled 25 conversations on 0.75% of battery on a Pixel 9 Pro. Three orders of magnitude cheaper than running a 7B model.

Heat

A peer-reviewed 2026 study measuring sustained-load LLM inference on flagship phones reported that the iPhone 16 Pro loses nearly half its throughput within two iterations of back-to-back generation, and the Galaxy S24 Ultra hits a hard OS-enforced GPU frequency floor that terminates inference entirely. Published tokens-per-second benchmarks are almost always cold-start figures. The number you actually experience after the third or fourth long generation can be 40-60% lower on flagships and an order of magnitude lower on mid-range phones. Phones have terrible heat dissipation by physics.

Accuracy

A top-tier on-device model in May 2026 (Apple's Foundation Models 3B, Gemma 3 4B, Qwen3 4B, Phi-4-mini) performs roughly at the level of GPT-3.5 Turbo from late 2023 on standard benchmarks. Useful for bounded tasks. Not GPT-5.5. Failure modes that show up in practice: weak world knowledge, brittle multi-step reasoning, noisy long-context retrieval, and tool-use that maxes out at single-tool calls. Apple's own developer guidance is the most honest line in the category: don't use Foundation Models for code, math, or factual Q&A. Match the model to the task. Route to the cloud when you can't.

Cost summary

Cost	Magnitude in real terms	Mitigation
Battery	12-15× faster drain under sustained inference; ~0.5% daily drain in triggered-only use	Smaller models, bursty inference, caching
Heat	iPhone 16 Pro loses ~50% throughput by 2nd sustained iteration; S24 Ultra can throttle to zero	Benchmark warm, prefer NPU, pause-and-resume
Accuracy	GPT-3.5-era benchmark quality; fails on world knowledge, multi-step reasoning, complex tool use	Match model to task, constrain output, hybrid fallback

What the next silicon generation is rumoured to change

WWDC 2026 lands in two weeks; Apple's September event is still four months out. But the chip leaks have been remarkably consistent across multiple analysts, and the direction they point is exactly the one this post predicts. If the rumours hold, the next generation of mobile silicon is being engineered for the bandwidth problem, not the compute problem.

The A20 Pro, expected in the iPhone 18 Pro and the rumoured foldable iPhone in September 2026, is reported to move to TSMC's 2nm (N2) process, with roughly 15% better performance and 30% better efficiency over the A19 Pro. The detail that matters most for edge AI is packaging. According to GF Securities analyst Jeff Pu, the A20 Pro will use TSMC's Wafer-Level Multi-Chip Module (WMCM) technology, which integrates RAM directly onto the same wafer as the CPU, GPU, and Neural Engine, before the chips are even cut. The processor and memory sit in closer physical proximity than ever before. Memory bandwidth goes up. Latency goes down. The Neural Engine waits less for memory. This is the single most important rumoured change for on-device AI in 2026, because it directly attacks the bottleneck this post is built around.

On the Android side, Qualcomm's Snapdragon 8 Elite Gen 6 (expected at Snapdragon Summit, September 2026) is reportedly splitting into a standard and Pro variant. Both are expected on TSMC's 2nm process with a new 2+3+3 CPU layout. The Pro adds LPDDR6 memory support (vs LPDDR5X on the standard) and an Adreno 850 GPU. The current Elite Gen 5 already claims 220 tokens per second of on-device LLM throughput; Gen 6 is expected to push that further with more efficient NPU silicon, though Qualcomm has been quieter than Apple on the AI-specific gains.

Three patterns are worth noticing across these leaks. First, every flagship vendor is on 2nm in 2026; the process node race has converged. Second, the actual differentiator is now packaging and memory architecture (WMCM for Apple, LPDDR6 for Qualcomm's Pro), not transistor count. Third, none of the leaked specs mention dramatic increases in raw TOPS. The industry has internalised what Chandra wrote in January: TOPS isn't the bottleneck. The race is now for memory bandwidth, thermal headroom, and efficient packaging. Which means if you're designing apps for late-2026 and 2027 phones, you should assume the bandwidth gap to data-centre GPUs narrows meaningfully (though it doesn't close), and the sustainable token-per-second number on a flagship phone goes up by maybe 1.5-2× rather than 10×.

Worth flagging: all of this is leak-grade, not announcement-grade. The next two months will clarify a lot, starting with WWDC.

What to build on Monday

Three patterns are worth building today. First, on-device structured extraction: turn freeform user text into typed records using Apple's @Generable on iOS or Gemini Nano's structured output on Android. Second, hybrid routing: use the cloud only when the task genuinely needs it. Third, treat LoRA adapters as your long-term personalisation play; that's where the next two years of differentiation lives.

If you take one practical action from this post, make it this: install PocketPal AI, download Gemma 3 4B Q4 from Hugging Face, and try it offline for a day. You'll feel the shift this series is about.

WWDC 2026 lands in just under two weeks. Apple will almost certainly announce something material. But the thesis underneath these two posts is older and stronger than any keynote: memory bandwidth, small models, hybrid architectures, privacy as construction rather than promise. Those don't shift in a year. Next post in the series: hands-on with Apple's Foundation Models framework, end to end.