Part 3 of the Edge AI on iOS series. Start with Part 1: The State of Edge AI on Mobile and Part 2: The Honest Trade-offs.
No cloud, no dependencies, no vector database. iOS 26, on-device.
Edge AI means inference that runs entirely on the device, with no server round-trip. It usually gets sold on privacy and latency. The under-told part is that as of iOS 26 you can build a complete retrieval-and-generation feature on the edge with two first-party Swift frameworks and zero dependencies. Retrieval is NLContextualEmbedding (NaturalLanguage, available since iOS 17), which turns text into vectors so you can rank your own data by meaning. Generation is LanguageModelSession (Foundation Models, new in iOS 26). The whole thing runs on the phone, offline, with no API key and no Package.swift.
You'll see this pattern called "on-device RAG," and the name matters because it predicts the behaviour. Classical RAG pre-retrieves context and prepends it to the prompt; retrieval runs unconditionally, before the model does. What Foundation Models actually gives you is different: retrieval composed as a tool the model chooses to call. The model decides whether to retrieve, generates the search query itself, and your retrieval runs inside that tool call. The most surprising thing I hit while building this, covered below, only happens because retrieval is tool-gated rather than pre-computed. So I'll call it what it is: retrieval as a tool.
The demo is a restaurant menu you can ask in plain English โ "anything vegan with mushrooms?" โ that answers from your own data rather than the model's training. The real code is below, layer by layer. The interesting part is what surprised me when I ran it, and I get into that at the end.
The shape of the thing
Four components. Two of them are the two halves of the pattern, retrieve then generate:
- The data. 50 hand-written menu items, in memory. No SQLite for v1.
- The embedder. Turns each item into a vector once, and the query into a vector at search time. This is the R.
- Retrieval. Cosine similarity, ranked, top-3.
- The session. An LLM with a registered tool that runs retrieval and writes the answer. This is the G.
The seam between retrieval and generation is a tool call: the model decides it needs menu data, calls your Swift function, gets matches back, and generates from them.
The data
struct MenuItem: Identifiable, Sendable {
let id: Int
let name: String
let description: String
let price: Double
let allergens: [String]
let dietary: [String]
var embeddingText: String {
let allergenText = allergens.isEmpty ? "no listed allergens" : allergens.joined(separator: ", ")
return "\(name). \(description) Contains: \(allergenText)."
}
}
Nothing clever. The only design decision is embeddingText: we embed a composed sentence rather than just the name, so a query about allergens or ingredients has something to match against.
The embedder
NLContextualEmbedding is a BERT-family encoder. "Contextual" is the important word. Unlike older static embeddings (word2vec, GloVe, or Apple's own NLEmbedding) that give one fixed vector per word, a contextual model runs the whole sentence through a transformer, so a word's vector shifts with its neighbours. That's the difference between "bank" in river bank and in savings bank.
final class MenuEmbeddingStore: @unchecked Sendable {
private(set) var embeddedItems: [(menuItem: MenuItem, vector: [Float])] = []
private let embeddingModel: NLContextualEmbedding
init() throws {
guard let model = NLContextualEmbedding(language: .english) else {
throw EmbeddingError.modelUnavailable
}
self.embeddingModel = model
}
func prepare(with items: [MenuItem]) async throws {
if !embeddingModel.hasAvailableAssets {
_ = try await embeddingModel.requestAssets()
}
try embeddingModel.load()
embeddedItems = try items.map { ($0, try embed($0.embeddingText)) }
}
The model ships as a system asset, not inside your binary. requestAssets() triggers a one-time OS-managed download that's shared across apps, so you never bundle weights yourself.
This is the part most walkthroughs skip. NLContextualEmbedding returns one vector per token, not one per sentence. To get a single vector for a menu item you have to combine them, and the standard, simplest choice is mean pooling: average the token vectors.
func embed(_ text: String) throws -> [Float] {
let result = try embeddingModel.embeddingResult(for: text, language: .english)
let dimension = embeddingModel.dimension
var sum = [Float](repeating: 0, count: dimension)
var tokenCount = 0
result.enumerateTokenVectors(in: text.startIndex..<text.endIndex) { vector, _ in
for i in 0..<dimension { sum[i] += Float(vector[i]) }
tokenCount += 1
return true
}
guard tokenCount > 0 else { return sum }
return sum.map { $0 / Float(tokenCount) }
}
}
What's in that vector? A few hundred floats. Read dimension rather than hardcoding it. The individual numbers aren't human-interpretable; there's no "dimension 12 = spiciness." Meaning is spread across all of them, which is why you compare whole vectors rather than inspect components. enumerateTokenVectors also hands you a Range<String.Index> into the original string per token, not the internal token string. You get sub-word granularity without ever touching sub-word spelling, and your pooling loop divides by tokenCount, so it stays correct whether a word splits into one piece or four.
Retrieval
Cosine similarity, written out rather than pulled from Accelerate, because the point is to read it:
func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
var dot: Float = 0, magA: Float = 0, magB: Float = 0
for i in 0..<min(a.count, b.count) {
dot += a[i] * b[i]
magA += a[i] * a[i]
magB += b[i] * b[i]
}
let denom = sqrt(magA) * sqrt(magB)
return denom > 0 ? dot / denom : 0
}
Cosine measures the angle between two vectors, ignoring magnitude. 1.0 is identical direction, 0 is unrelated. Because it reads angle and not distance, a three-word dish and a twenty-word dish stay comparable. Retrieval is then a brute-force scan: score all 50, sort, take three.
func topMatches(for query: String, count: Int = 3) throws -> [(item: MenuItem, score: Float)] {
let queryVector = try embed(query)
return embeddedItems
.map { (item: $0.menuItem, score: cosineSimilarity(queryVector, $0.vector)) }
.sorted { $0.score > $1.score }
.prefix(count)
.map { $0 }
}
50 items makes brute force instant. This is the line you'd swap for sqlite-vec at 50,000.
The session
The tool is where retrieval plugs into generation. The model reads name and description to decide when to call it, so the description is assertive on purpose:
struct LookupMenuTool: Tool {
let name = "lookupMenu"
let description = """
Look up real dishes from this restaurant's menu by ingredient, dietary need, \
or allergen. Always call this before recommending anything. Never invent dishes.
"""
let store: MenuEmbeddingStore
@Generable
struct Arguments {
@Guide(description: "What the guest wants, e.g. 'vegan dish with mushrooms'.")
let query: String
}
func call(arguments: Arguments) async throws -> String {
let matches = try store.topMatches(for: arguments.query, count: 3)
guard !matches.isEmpty else { return "No matching menu items were found." }
return matches.map { m in
"\(m.item.name) โ \(m.item.description) $\(String(format: "%.2f", m.item.price)). "
+ "Dietary: \(m.item.dietary.joined(separator: ", "))."
}.joined(separator: "\n")
}
}
@Generable is guided generation. The framework constrains the model's output so Arguments is always a valid Swift value, and you never parse JSON. The tool returns a plain String, which is all the model needs to ground its reply. Wire it into a session once and reuse it so the conversation has memory:
let tool = LookupMenuTool(store: store)
let session = LanguageModelSession(tools: [tool]) {
"You are a concise waiter. Always call lookupMenu to find real dishes, then "
+ "recommend only from what it returns. Never invent dishes or prices."
}
Then session.streamResponse(to:) gives you tokens as they generate. Each streamed partial.content is the full response so far, a cumulative snapshot rather than a delta. Replace your text buffer instead of appending to it, or you'll duplicate everything.
The thing that surprised me
A happy-path demo teaches you nothing you couldn't get from the docs. The one finding worth the whole build came from routing retrieval through the model's judgment, which classical pre-retrieval RAG never does. Here's what happened in testing.
The query that reaches the embedder is not what the user typed. I went looking for a garbage-in-garbage-out failure on purpose. I typed ghghhhhhh vegan meal into the field, expecting the nonsense prefix to drag the query vector somewhere useless. To watch it happen, I dropped a print at the top of the retrieval function:
func topMatches(for query: String, count: Int = 3) throws -> [(item: MenuItem, score: Float)] {
print("topMatches received: \(query)")
let queryVector = try embed(query)
// ...
}
The console printed:
topMatches received: vegan meal
Not ghghhhhhh vegan meal. Just vegan meal. The garbage was gone before retrieval ever ran, and that reframed the whole pipeline for me. The flow isn't user text โ embedder. It's user text โ LanguageModelSession โ model generates a tool query โ embedder. In my run, the LLM read the nonsense, decided the ghghhhhhh was noise, and generated a clean arguments.query of vegan meal to hand the tool. The embedder only ever saw the model's paraphrase, two hops from my keystrokes. (It's model behaviour, so a different run, or a different OS build, might reformulate it differently. That variability is itself the point of this section.)
This is the single most important thing to understand about retrieval-as-a-tool, and most tutorials skip it: there's a model in the middle rewriting the input. It cuts both ways. On the upside, it sanitizes messy real-world input for free, so typos and filler get cleaned up before they hit cosine similarity. On the downside, it's non-deterministic, so the same user sentence can produce a slightly different tool query, and therefore different retrieval scores, from one run to the next. The first time I saw scores shift between runs I blamed the embedder. The embedder was innocent every time, because embed() and cosineSimilarity() are fully deterministic. The variability lives in the LLM upstream of them. (Pull that print before you ship, but keep it in during development; it's the fastest way to see what your model is actually asking for.)
The same tool-gating explains a quieter risk: the model decides whether to call lookupMenu at all, and bare fragments trigger it less reliably than full questions. An assertive tool description and explicit session instructions ("if the guest mentions any food term, call lookupMenu") cut the miss rate sharply, but they don't drive it to zero, because tool invocation is a model judgment rather than a guarantee. When you can't tolerate that, the escape hatch is to stop asking: call topMatches yourself on the raw input and inject the results into the prompt. That gives up the elegance of letting the model decide, and buys back determinism. Which tradeoff is right depends on whether an occasional missed call is a minor annoyance or a demo-stopping silence.
The tool doesn't have to be local
Everything above keeps retrieval on the device. The seam, though, is narrower than it looks. lookupMenu is just a Swift function the model calls; nothing requires its body to run locally. Swap the topMatches call for a network request and the same tool now retrieves from a REST API, a remote vector database, or a server you control. The generation half stays on-device; only retrieval goes out.
The moment retrieval hits the network you've given up the offline guarantee and the privacy story that justified going to the edge in the first place: the query, or whatever you derive from it, leaves the device. Treat remote retrieval as a deliberate choice for cases where your data genuinely can't live on the phone โ it's too large, too sensitive to embed client-side, or changes too often โ not as a default. For a menu, keep it local. For a 10-million-row product catalog that updates hourly, the tool-call boundary is exactly where you'd reach out.
The economics of moving inference to the edge
The cost argument is easy to state badly. With a cloud LLM you pay per token, for every user's every query, indefinitely; usage growth is cost growth. Running generation on-device moves that compute onto hardware the user already owns, so your marginal inference cost per query is zero. A thousand users asking a thousand questions each costs you the same as one: nothing, on the inference line.
The honest caveat is that "zero" is only the developer's cloud bill. The compute didn't vanish, it relocated. The user pays in battery drain, thermal headroom, and the requirement to own a recent, Apple-Intelligence-capable device. So the edge doesn't make inference free in absolute terms; it shifts the cost from your metered cloud account to the user's hardware, and it caps your exposure no matter how heavily the feature gets used.