The State of Edge AI on Mobile Devices

"I have always seen mobile devices as the closest man has come to truly integrating technology into his way of life."

That conviction has shaped everything I think about technology. We sleep with our phones, navigate by them, photograph our lives through them, pay with them, find each other through them. No other piece of computing has burrowed this deep into human routine. And now, quietly, that same device is becoming intelligent on its own terms. Edge AI on mobile is no longer a research demo. In May 2026, with Apple about to take the WWDC stage and Google just back from I/O, roughly two billion phones around the world ship with a usable on-device language model. Apple Foundation Models on iOS 26. Gemini Nano on Android. Open-weights models from Hugging Face running through PocketPal and Off Grid for anyone who wants to bring their own. This is the introductory post in a series, a working overview of where mobile edge AI actually is, what people are doing with it, and where this is heading.

What's actually shipping

Apple's Foundation Models framework shipped with iOS 26 in September 2025: a roughly 3-billion-parameter language model, 2-bit quantisation-aware training, KV-cache sharing across blocks, exposed to developers through Swift. Same model that powers Apple Intelligence: Writing Tools, notification summaries, live translation across 13 languages, Genmoji.

On Android, Google's Gemini Nano v3 is the equivalent, currently shipping on Pixel 10 and Galaxy S26. Nano 4, built on the new open-weights Gemma 4 family, entered AICore Developer Preview in April 2026 with up to 4× faster performance and 60% better battery efficiency than its predecessor. Samsung Galaxy AI layers proactive features on top. Qualcomm calls the Snapdragon 8 Elite Gen 5 the first "agentic NPU." MediaTek's Dimensity 9500 became the first mobile silicon to support BitNet 1.58-bit quantisation in hardware.

Underneath all of this sits a quietly powerful third layer: open-weights models from Hugging Face running locally through free apps like PocketPal AI, Off Grid, and Google AI Edge Gallery. Gemma 3 4B, Qwen3 4B, Phi-4-mini, SmolLM3, Llama 3.2. Any of them downloadable in GGUF format and running at 15-25 tokens per second on flagship hardware. The gap between Apple's polished first-party framework and a free open-source app loading a Hugging Face model is smaller than the marketing suggests.

Edge vs cloud: where the line is now

Cloud AI made the last decade. It still wins for anything that needs frontier reasoning, broad world knowledge, or long multi-turn conversations. GPT-5.5, Claude Opus 4.7, Gemini 3 live in data centres for a reason. But on a defined set of tasks, the trade-off has flipped. Edge AI now wins on latency (under 20ms to first token vs 200-500ms cloud round-trip), on privacy (data never leaves the device), on cost (zero per-query inference for the developer, zero rate limits for the user), and on availability (works on aeroplanes, in tunnels, in countries where your cloud provider is blocked).

The honest framing isn't "cloud vs edge." It's a routing decision per request. Apple's Private Cloud Compute and Google's Gemini Pro fallback both exist because the right architecture is hybrid: keep on-device by default, escalate to the cloud only when the local model genuinely can't do the job. The apps that win the next two years will be the ones that make that routing decision invisibly well.

What we're actually using it for

Strip away the announcements and shipped use cases fall into seven patterns:

Summarise. Notifications, emails, voice memos, screenshots, articles.
Translate. Text, phone calls, in-person conversations, live captions.
Rewrite and proofread. Writing Tools, Magic Compose, smart reply.
Extract and classify. Mail categorisation, screenshot indexing, photo tagging.
Transcribe. Voice memos, calls, meetings.
Suggest. Smart replies, todo parsing, journal prompts.
Generate images. Genmoji, Magic Editor, on-device Stable Diffusion.

What's not on the list: nobody ships a general-purpose chatbot on-device. That's the negative space that defines the category. On-device models are engines for focused tasks. Apple is explicit in its developer documentation: don't use Foundation Models for code generation, math, or factual Q&A. The shipping pattern matches the model size.

Why this is happening now

Five forces are pushing the industry edgeward simultaneously.

Latency. Cloud round-trips add 200-500ms before the first token; on-device generates in under 20ms. For AR, real-time translation, voice assistants, that's the difference between magical and broken.
Privacy. Data that never leaves the device cannot be breached, logged, or subpoenaed. For health, finance, and personal content this is now a regulatory requirement in much of Europe.
Cost. Cloud inference at consumer scale destroys margins on free-tier features. On-device shifts that cost to hardware the user already owns.
Availability. Offline-by-default is a feature, not a degradation. Aeroplane mode no longer means dumb mode.
True personalisation. A cloud model is the same instance for every user on Earth. The personalisation it offers is a workaround: a long prompt, a context window, a RAG index over your data, all reconstructed every session. An on-device model can actually become yours. It can absorb your preferences, writing style, and domain vocabulary into its weights, through LoRA adapters today and full test-time training tomorrow, with none of that data ever leaving the device. Personalisation in the cloud is a feature. Personalisation on the edge is a property of the model itself.

It's not just phones. There's a run on Mac minis

If you want a single concrete signal of where edge AI is heading, watch what's happening to the Mac mini. In February 2026, a surge in the local AI assistant OpenClaw triggered an actual supply shortage of M4 Mac minis. Tech outlets called the device "the hardware of choice for the decentralised AI movement." A $600 box, drawing 30-40 watts, sitting silently on a shelf, running a 24/7 personal AI agent. Always on. No subscription. No cloud round-trip.

The numbers behind the run make sense. A four-node M4 Pro cluster costs $6,000 in hardware and $15 a month in electricity. It runs Llama 3.1 70B and Qwen2.5 locally, replacing roughly $500 a month of cloud inference for a small agency. A single M4 Pro with 48GB unified memory at around $2,000 is what the community has converged on as the "sweet spot" for serious local LLM work. Apple's unified memory architecture turns out to be unusually well-suited to language models, because LLM inference is memory-bandwidth bound and unified memory eliminates the VRAM bottleneck that cripples consumer GPUs at the same price point.

Phones are the most intimate edge device. Macs are the most powerful edge device individuals own outright. The same shift is showing up on both.

And the cloud bill is real

Cloud AI feels free at the $20 tier, until you tally what most active users are actually paying. ChatGPT Plus: $20/month. Claude Pro: $20/month. Google AI Pro: $19.99/month. The standard tier has converged at exactly $20 across providers. Power users are running two or three of these in parallel. The $1,320-a-year stack is a real and common pattern in 2026. Above that, OpenAI's Pro tier is $200/month, Claude Max sits at $100-$200/month, Google AI Ultra at $99.99-$200/month. These are not enterprise prices. These are individual humans paying them.

And the prices you see are heavily subsidised. Frontier cloud AI in 2026 is a money-losing business across the industry. OpenAI, Anthropic, and Google are all running at substantial losses, funded by venture capital and strategic investment rather than unit economics. Reporting in early 2026 around Anthropic's Claude Code Max plan made the dynamic concrete: a viral Forbes piece claimed each heavy user could consume up to $5,000 of compute against a $200 subscription. The debunk, led by Martin Alderson and widely circulated, sharpened rather than softened the picture: that $5,000 figure conflated retail API pricing with actual inference cost, and comparable open-weight models on OpenRouter run at roughly one-tenth of frontier API rates while remaining profitable. So the true cost was closer to ~$500 per heavy user, a ~$300/month loss on the extreme tail. The honest version of the story isn't that any single provider is uniquely losing money. It's that the entire frontier API category prices at roughly 10× actual inference cost, while burning cash on training and research above that.

What the markup means for anyone building on top: you pay it. Cursor, the AI coding tool, pays retail API rates for every token its users burn. For heavy users, $5,000/month is a plausible real number. Which is why Cursor is now openly building its own Composer model family on open-source DeepSeek, Kimi, and Qwen. The economics of building consumer AI features on top of someone else's API are punishing once you cross a certain scale, and the response, for serious builders, is to move off the API and onto models you control. That is, in part, what edge AI is.

A consumer-facing AI feature with 50,000 active users sending a few queries a day is a five-figure monthly cloud bill before you've shipped a single piece of revenue. Those bills exist against subsidised provider prices. When the subsidies stop, the bills get worse. On-device inference doesn't get worse; it stays at zero. That's not a small structural advantage. It's the difference between a feature you can give everyone and a feature you have to gate.

Qualcomm CEO Cristiano Amon, at Davos this year: "whoever has presence on the edge is going to win. The edge is where the humans are."

What this could become

Look at what cloud AI did to work in the last three years. Drafting, summarising, translating, coding, research, customer support. Every knowledge-work category has been quietly restructured. Whole job descriptions look different now. Whole tools have been replaced by a chat window and a model behind it.

Now imagine that same disruption happening on a device that's already on your person sixteen hours a day. Models powerful enough to draft, plan, summarise, and reason without ever leaving your phone. No round-trip. No bill. No privacy negotiation. And, this is the part most people miss, an open layer between apps that lets them all talk to the same local model with shared context.

Today every app on your phone is an island. Your messages app doesn't know what your calendar knows. Your calendar doesn't know what your notes app knows. Your fitness app doesn't know about the meeting you just scheduled that conflicts with your run. The model on the device has every reason to know all of it. It's on the same hardware, with the user's permission, behind the same privacy boundary. What it needs is an interoperability layer: a standard way for apps to expose context and tools to the local model, the way MCP exposes tools to cloud models today. When that layer arrives, the phone stops being a launcher for thirty disconnected apps and starts behaving like a single intelligent surface that knows you, holds your context, and acts across applications on your behalf. Privately, locally, instantly.

That's the trajectory worth paying attention to. Not bigger models in bigger data centres. Smaller, faster, more personal intelligence on the device you already trust with the rest of your life.

The next wave of intelligence will be unlocked by truly edge-driven smartphones. When AI becomes a layer that apps sit on, share data with on-device, and that personalises and trains for you, by you.

What's coming next in this series

This post is the orientation. The next piece goes into the engineering reality: why memory bandwidth (not compute) is the real bottleneck on phones, what edge AI actually costs in battery, heat, and accuracy, and what to build first as a developer. By the end of the series we'll be designing apps that assume the cloud is optional.

Next in the series: Edge AI on Mobile: The Honest Trade-offs — what on-device AI really costs in battery, heat, and accuracy.