PassLane “Coach” — Master Plan
AI learning companion · plan only, no code · generated 2026-06-18

PassLane AI Companion — The Master Plan

Codename: Coach · Principal / Head of Product + Engineering · 2026-06-18

Status: definitive. Supersedes all prior drafts.

Provenance of claims (read this first). Two classes of fact appear below, tagged where it matters. [repo-verified] = checked against /Users/arizona/CLAUDE CODE/passlane on 2026-06-18 (file/line/count confirmed). [API-assumption] = Anthropic API behavior or pricing as of the 2026-01 knowledge cutoff, to be re-confirmed against current docs before the Phase-1 build. The plan is engineered so that no [API-assumption] flipping silently breaks a load-bearing section — each such dependency carries an explicit fallback. We do not claim blanket "everything is verified"; a plan whose moat is honesty cannot afford a single falsifiable boast.


1. Executive Summary

PassLane already turns dead commute time into mastery for a brutal exam — roughly half of insurance-license candidates fail, almost always from under-preparation and skipped state law. Coach is the teacher who rides along: a calm, candid exam instructor who lives inside the existing one-file app, speaks in the same af_bella voice that already reads questions aloud, is silent until summoned, and never says a word it can't trace to a vetted explanation — when the bank doesn't cover something, Coach says so instead of inventing the law. It earns its keep in three postures the learner pulls, never the app pushes: Train (teach a concept), Test (drill weak areas and coach mock exams), Talk (think a question through). The economics that killed Quizlet's Q-Chat are designed out from day one: the entire teaching corpus is pre-generated offline against the fixed bank, human-reviewed, content-hashed, and shipped into local app data — so the high-value path runs fully offline at zero runtime cost and ships free to every learner, with only live open-ended chat metered behind a key-holding edge proxy as the Pro headline.

What we are building, in one sentence: A grounded, voice-first study companion that lives inside PassLane, speaks in its existing voice, and can be summoned to Train, Test, or Talk you toward your license — provably never inventing the law, never leaking an answer during an exam, and never running up an unmetered bill.

Two honest constraints stated up front, because the plan is built around them:

  1. Spoken Coach has a real audio gap to close. [repo-verified] Only 215 of 323 AZ questions have read-aloud clips today; 108 (33%) have none, and no Coach copy is voiced at all. The af_bella generation pipeline is absent from the repo. So the spoken commute is a built deliverable with a named work-stream and cost line (§4.7, §6.3), not an inherited freebie. Text Coach ships first and is fully functional for 100% of the bank.
  2. Offline audio fails for a specific, now-correctly-diagnosed reason (cross-origin service-worker bypass, not a cache refusal), which changes the fix (§4.7).

2. Design Principles (the non-negotiables)

Each is enforced in code or a CI gate, and each killed a tempting alternative.

  1. Grounding buys correctness; model tier buys polish. Every factual claim is tied to a vetted bank explanation via the Citations API. The cheap model (Haiku 4.5) is the default because correctness comes from the retrieved source, not the parameter count. We pay for Sonnet/Opus only where warmth and judgment — not facts — are the value.
  1. Local-first is a hard constraint, enforced in code — not a slogan. Core study sends nothing and needs no cloud. The companion is purely additive: if Coach is unavailable, the existing text+tap study path is byte-for-byte unchanged. The heavy teaching path is pre-generated and ships in the app like the question banks already do (scripts/export-pack.mjs rebuilds the same questions*.json filenames with no index.html change).
  1. Honesty is the moat — made mechanical, not promised. Coach shows its source on every claim; when grounding is thin it refuses and says so. This is the literal UX of "no hallucinations," and it mirrors the shipped voice-out rule [repo-verified, speak(), index.html:2790–2793: "Recordings are the ONLY voice. Never fall back to robotic system TTS"].
  1. Quiet by default. Coach speaks only at four earned moments (you ask, a feedback reveal, you ask to be drilled, a rare threshold-warmth). Per-answer chatter is forbidden by construction — it would regress the codebase's deliberate restraint (warmthTail fires once at the 3rd-miss or 8-streak; the Coach Reveal is neutral, no buzzer). "The AI talks too much" must be impossible, not merely tuned away.
  1. Exam integrity is a wall, inherited from existing gates. Coach is hard-disabled whenever isExam is true — the mic is already hidden [repo-verified, index.html:2737], read-aloud already gated [2809/2839]. "Build mastery, never enable cheating" is enforceable at gates already in the repo, on two surfaces (in-app and the public answer-audio CDN — see §5.4).
  1. The voice contract is frozen behavior. All mic access routes through the single startListening/stopListening chokepoint (ISOLATION RULE #3). The partialResults-resolves-empty quirk is load-bearing. node voice-sandbox/harness.js must exit 0 before and after any change near the listen window.
  1. Pre-generate once, serve forever. The fixed bank means the entire companion corpus is computed offline (Batch API), reviewed, and cached. Runtime LLM cost is effectively $0; live calls exist only for what genuinely cannot be precomputed — a learner's own words.
  1. Build inside the constitution. Own a cx- CSS prefix, one render-region writer, plain JS / no build step (introducing TS or a bundler here crosses the simplicity line and is not warranted). Anchor every edit by symbol, not raw line number — the 287KB single file shifts.
  1. Engine-aware by construction, PassLane first. Coach reads grounding from the same per-exam content packs the STATE_FILE map and D1 verticals→exams→categories→questions schema already model. Prompts, refusal copy, voice-id, and the confusable-map live as per-vertical config. Build and tune for Arizona/insurance first; CDL/NCLEX/real-estate inherit the companion with no code fork.

3. The Companion Experience — Train / Test / Talk

3.1 Persona

One presence, unnamed-feeling (the UI says "Coach," never a mascot, no avatar — brand law is warm, calm, teacher-first, de-cheesed). It is the same af_bella voice as read-aloud, so Coach is the teacher who's been reading you the questions, now leaning in — the seamlessness Speak's users praise and Duolingo Max's "scripted, like free AI" lacks. Diction: short sentences, plain English, names the exact concept and the exact misconception, no "Great job!" filler. Candor is the character — on thin grounding: "The bank doesn't cover that one head-on — here's the closest principle it does teach." A tutor that bluffs on a licensing exam is a liability.

Note on the voice contract: "af_bella" is not a code-enforced constant — [repo-verified] it appears zero times in index.html and exists only as a single top-level "voice":"af_bella" field in app/audio/states-manifest.json (manifest convention, not frozen API). We make it a real contract: the export/voice pipeline stamps and asserts voice === 'af_bella' on every new Coach clip, the way harness.js makes the voice contract real — so nothing can silently ship a clip in a different voice.

Mandatory AI + scope disclosure (an Anthropic AUP contract term for a high-risk vertical, not a flourish) opens Coach's first turn of a session, once, in PassLane's voice:

"Quick note — I'm an AI study coach. I help you learn the exam's answers. I'm not a licensed agent, and this isn't insurance advice. Okay, let's get you ready."

This single line satisfies the AUP disclosure requirement, draws the exam-prep-vs-advice legal line, and meets the FTC honest-AI bar at once.

3.2 Default posture — "Quiet Companion"

The shipped study loop (mode_select → reading → listening → feedback → advancing) is untouched. Coach earns the right to speak in exactly four moments, then returns to silence:

You ASK
Trigger explicit Ask gesture + an off-script question
Coach does Train/Talk, grounded (clip if pre-gen'd & voiced, else text + earcon)
You answered / didn't
Trigger normal feedback reveal
Coach does the existing speakFeedback/revealUnanswered, optionally enriched by pre-gen elaboration
You ask to be QUIZZED
Trigger "drill my weak spots"
Coach does builds a queue via existing Leitner/weakCategories, runs the normal answer flow
A real threshold fires
Trigger existing warmthTail points (3rd-miss, 8-streak, return-after-gap)
Coach does a rare, pre-recorded encourage line — never per-answer

3.3 TRAIN — "teach me this" (mostly offline, FREE)

The highest-ROI, lowest-risk surface, and it ships free because it's local data.

Card [repo-verified, questions.json[0], id pc001, correct D]: For a property insurance policy, insurable interest must exist at what point in time? (A) any time (B) when applied for (C) inception and loss (D) at the time of loss. Learner picks C.
Coach (cites pc001): "Close — C is the classic trap. For property, you only need insurable interest at the time of the loss, so it's D. You're borrowing the life rule, where the interest only has to exist at inception. That inception-vs-loss split is the whole point of this one."

3.4 TEST — "drill me / coach my mock exam" (reuses existing machinery)

3.5 TALK — "let's talk it through" (Pro, online, the small metered tail)

Open-ended grounded Q&A, Sonnet 4.6, warm, hard-capped ~2–3 turns, tethered to the question's explanation, text-reply on screen at launch (no system TTS — §9). Socratic is a ≤1-beat scalpel, then a clear answer (RCT evidence: Socratic-heavy tutoring shows no outcome gain and feels withholding to time-pressured adults). Opus 4.8 is reserved for a single premium end-of-session mock-exam diagnostic that reasons across the whole miss-pattern. Never when isExam. Latency budget and failure behavior are specified in §3.9.

3.6 The voice-first loop — two postures over one mic lifecycle

The shipped listen window is frozen behavior tuned for SHORT utterances [repo-verified: taskHint=.confirmation, contextualStrings=['A'..'D'], VOICE_SILENCE_BUDGET_MS=5000, VOICE_HARD_CAP_MS=12000 an absolute per-question ceiling, NATIVE_RESTART_COOLDOWN_MS=400 guarding the teardown race]. A conversation needs the opposite. So:

Decisive scope call — spoken multi-turn Ask is spike-gated; TEXT-ASK is the Phase-3 default and ships regardless. A long ASK window that reopens the mic mid-question is exactly the reopen-cycling the 5s budget was introduced to kill (the budget replaced a 14s one for this reason), and it collides with the absolute 12s per-question cap and the teardown race. Multi-turn dictation on the local speech fork is untested. So text-ASK (an "Ask Coach…" field, same grounding, same citations) is the conversational tier on day one of Phase 3, with zero long-mic risk. Spoken ASK ships only after an on-device spike proves, on real iOS and Android, that (a) the engine doesn't bail mid-sentence, (b) reopening doesn't trip the 2-sessions race, and (c) ASK runs on a non-question generation id with its own ceiling decoupled from the 12s cap. The spike is scheduled against docs/IOS-VOICE-TEST-PLAN.md [repo-verified to exist].

3.7 Memory — read, don't store

All cross-session continuity is a READ over state that already persists locally (az_ keys: Leitner box, per-category accuracy, miss/right streaks, sessionMissed). Coach already knows what you keep missing — it just isn't speaking from it yet. The return-opener is specific and grounded — "Last time, claims-made vs occurrence tripped you twice — want to start there?" — sourced entirely from local data. Persist at most a tiny local cx_memory object (last topic, last weak categories, last-seen date) in localStorage. No transcripts, no PII, no third party. This is the highest-trust, lowest-risk feature and it ships first — and it doubles as the struggle-signal source that lets us honor the no-tracking promise without adding analytics.

3.8 Graceful text fallback (an equal citizen, not a downgrade) — including the unvoiced third

Because the entire pre-generated corpus renders into the existing feedback-expl DOM region with no model call and no network, text is first-class: silent study, sound-off, no-mic/permission-denied, the web build, and offline all degrade to a clean text experience. Every Coach claim — spoken or text — carries an inline "source: Q pc001" chip; when Coach has no vetted answer it shows "I don't have a vetted answer for that yet" rather than inventing.

The 33%-unvoiced case is a day-one UX state, not an edge case [repo-verified: 108 AZ questions have no clip]. Defined behavior when Coach would speak but the question has no base clip and no Coach clip yet:

This makes the experience whole for 100% of the bank from the first ship, with audio as progressive enhancement.

3.9 Live-path latency & failure budget (Plane B felt-responsiveness)

Cost caps (§4.6) govern spend; this governs feel for an anxious commuter:


4. The Intelligence Architecture

4.1 Two planes

4.2 Retrieval — ranked selection, NOT a vector DB

The relevant unit is a single known question — the one on screen, or the top keyword/category hit. Retrieve that question + its same-category siblings, never a whole bank. A vector DB at launch would violate zero-cloud-to-launch, add latency, and add infra to operate. Written upgrade trigger: add embeddings only when (a) a future vertical's sibling set overflows a sane prompt budget, or (b) the eval harness's recall@k drops below bar (§5.2). The eval gate forces the upgrade; we don't pre-build it.

4.3 Citation & grounding — the hallucination firewall

Each retrieved explanation is one Citations API custom-content block, so cited text is token-free and composes with prompt caching and Batch. Claude returns block-level citations that render as the source chip. A system-prompt refusal contract ("answer ONLY from the provided explanations; if absent, say so and offer to drill a related concept; never invent statutes, numbers, deadlines, or state-specific law") plus a runtime guard (zero citations on a factual claim → suppress the spoken reply, show "no vetted answer yet") make the moat mechanical. [API-assumption: Citations API block-level behavior and ZDR-eligibility per 2026-01 docs; confirm before build.]

4.4 Generation is a TWO-PHASE pipeline (with a documented collapse-path)

[API-assumption — load-bearing, confirm first: Citations is incompatible with Structured Outputs (400 error), and toggling citations invalidates the tools cache.] If that holds, a single grounded-and-structured call is impossible, so:

Fallback if the incompatibility is lifted: if a future API revision lets Citations and Structured Outputs co-exist, the two phases collapse into one grounded-and-structured call — the containment check becomes an output field rather than a second pass, halving pre-gen cost. The architecture is therefore not brittle to this fact flipping; the judge survives as a CI gate regardless (§5.2), since we still want an independent containment assertion even if generation is structured.

4.5 Model routing & caching

All grounded explain/quiz/why + all Plane A pre-gen
Model Haiku 4.5 [API-assumption: $1/$5 per M, cache-read 0.1×]
Why Correctness is from grounding, so the cheap model is the default
Live Talk + explain-back grading
Model Sonnet 4.6 [$3/$15]
Why Warmth and open-ended judgment
End-of-session deep diagnostic
Model Opus 4.8 [$5/$25]
Why Rare premium ceiling, reasons across the whole miss-pattern

Cache at the CATEGORY/whole-bank prefix level, not per-question. [API-assumption: Haiku 4.5 cacheable-prefix minimum = 4,096 tokens.] [repo-verified] AZ explanations average ~37 words (~50 tokens); full per-question context (stem+choices+explanation) averages ~79 words (~105 tokens) — both far below a 4,096-token floor, so a single question's grounding can never clear it. For offline Batch the discount is moot (one pass). For the live path, cache the stable system prompt + the state bank as one shared prefix ([repo-verified] AZ's full bank ≈ 40,993 tokens clears the floor comfortably) so every live call reuses one big cached context — that's where the 0.1× actually pays.

4.6 Worker proxy, key custody & metering

A sibling of worker/src/index.js, reusing its bearer-auth + KV rate-limit + CORS + fail-closed pattern [repo-verified ~lines 84–110], holding ANTHROPIC_API_KEY as a wrangler secret — never in the 287KB client bundle. Zero Data Retention enabled on the live route [API-assumption: Citations is ZDR-eligible; Batch is not — fine, Batch only ever processes the fixed bank, never user data]. It forwards only question_id + the text turn. Fits the Worker free tier (~100k req/day) because the heavy path is offline.

Metering is a HARD pre-req of Plane B, not a footnote. [repo-verified] Pro is a localStorage boolean (isPro(), index.html:2148), the Worker's only identity is a shared token + a client-asserted device-id — both trivially rotated, and a heavy talker (~40 Sonnet turns/day ≈ ~$9.75/mo at assumed pricing) sinks the $59.99/yr ≈ $5/mo plan. Before Plane B ships: (1) server-verify the store receipt (Google Play / App Store) and mint a per-install signed token — the client flag may gate UI but never gates spend; (2) hard per-user budgets (daily turn cap + monthly token ceiling) that degrade gracefully to Plane A when spent ("You've used today's deep chat — here's the grounded explanation"); (3) a global kill-switch + spend ceiling that fails closed; (4) treat device-id as untrusted and cap globally so mass rotation can't exceed the budget.

4.7 Offline & the audio truth (corrected diagnosis + honest coverage)

The flagship "study out loud on your commute, offline" scenario is not real today and we will not claim it before it is. Two distinct facts, both [repo-verified]:

(a) Why offline audio fails — corrected root cause. AZ audio streams from a cross-origin Pages CDN (AUDIO_CDN = https://passlane-5jv.pages.dev, index.html:4270; clipSrc() = ${AUDIO_CDN}/audio/${name}.mp3, index.html:4293). The service worker does cache .mp3 cache-first (sw.js:56, the isCacheFirst branch) — but it never sees these requests, because the fetch handler bails on cross-origin at the top: if (url.origin !== location.origin) return; (sw.js:34). (The earlier draft's "sw.js refuses to cache mp3" was wrong; the cause is the cross-origin early-return, which changes the fix.)

The fix, and its constraint — an Owner fork (§9):

We pick one explicitly before Phase 1; the recommendation is A because it reuses shipped SW behavior and resolves integrity in the same move.

(b) Coverage — the honest numbers. [repo-verified] On disk: 430 AZ clips = 215 -q + 215 -a; states-manifest.json enumerates 215 pc/es ids (manifest voice: af_bella). Therefore *~66% of AZ questions can read the question aloud; ~33% (108) cannot read anything, and 0% of Coach copy is voiced. The af_bella TTS pipeline is absent from the repo*. Consequences, made explicit:

Install-size impact [measured from repo]: adding rationale + 3 distractor explanations roughly doubles per-bank text (AZ ~256KB → ~0.5MB; six states → low-single-digit MB) — an accepted install add. Audio is the real storage cost and is bounded explicitly in §6.4. We do not inherit a parked offline-audio problem silently; we name it, price it, and gate on it.

4.8 Data-flow (ASCII)

OFFLINE — one-time, Plane A (the free path)
   vetted question bank
    → SME audit (state-law first) + content-hash
    → GENERATE: Haiku 4.5 + Batch, citations on → grounded prose
    → JUDGE: a separate check drops any sentence that adds a fact not in the source
    → human SME review → ship into the app
       (text for 100% of the bank; af_bella audio where voiced)

ON DEVICE (app/index.html)
   mic → one listen chokepoint → classify the utterance:
      • a letter (A–D)        → the normal answer flow (unchanged)
      • "ask" + ask-mode on   → Coach answers, grounded and cited
   exam in progress → Coach is fully disabled (hard wall)
   Train / Test → local pre-generated text (offline) → spoken if a clip exists

EDGE — Pro, online only (Plane B)
   verify store receipt → per-install token → per-user + global caps + kill-switch
    → Claude (Haiku / Sonnet / Opus) → any uncited claim is suppressed
    → if it errors or is slow, fail closed to the local grounded explanation

5. Reliability & Trust

5.1 Hallucination defense (the stack, not a setting)

Grounding (Citations over vetted explanations, no open web at runtime) → explicit "I don't know" permission in the prompt → the Phase-2 containment judge that fails the build on any introduced statute/number/citation → a runtime guard that suppresses uncited claims → strictest grounding + mandatory human review for state-law items. The corpus being finite is converted from a limitation into a trust feature.

5.2 The evaluation / accuracy harness (how we prove correctness)

A Node vm/assert/exit-1 CI gate, modeled on voice-sandbox/harness.js, over a golden question→grounded-answer set, asserting: (1) every answer cites a correct source block; (2) the cited explanation actually contains the claim [the Phase-2 judge, run as a gate]; (3) out-of-scope → refusal, and real-world-advice → redirect; (4) no answer leaks under a simulated exam state; (5) the recall@k bar that triggers the embedding upgrade (target recall@k ≥ 0.95 on the golden set; a drop below forces §4.2's vector-DB upgrade). Correctness is a mechanical gate the codebase already lives by.

5.3 Phase 0 — audit the bank itself (the unexamined single point of failure)

Grounding amplifies the source: a wrong vetted explanation becomes a confident, cited, spoken wrong lesson. Coach's correctness ceiling is the bank's. Before Coach ships: an SME review pass over state-law items first (the AZ-specific law categories — highest legal exposure), then the rest by difficulty; stamp every question with last_reviewed (extend the existing D1 version/status columns); and ship a "this looks wrong" report affordance from day one on every Coach response. For the AZ launch bank (323 Qs) this is a full human pass — cheap at that size and it removes all ambiguity for the content that defines first impressions. Pass rubric (the gate is numeric): review is "passed" when an SME confirms 0 factual errors in state-law items and ≤2% factual-error rate across all 323, every flagged item corrected and re-reviewed. This is also the concrete Anthropic-AUP "qualified professional reviews before dissemination" mechanism, enforced by the rule disseminated == passed_review at the content-hash gate.

5.4 Exam integrity — two surfaces, both walled

5.5 Explain-back grading — measurable gate before it can ever penalize

Ungraded self-check is the default (zero false-negative risk, full retrieval benefit). Graded mode — which may touch Leitner state — unlocks only after the Sonnet grader hits a measured bar (≥95% agreement, ≤2% false-"wrong") on a fixture of 100+ human-labeled paraphrases, run as an opt-in on-device-only eval that reports an aggregate accuracy number with no transcript stored (so the no-tracking promise holds). Even then: grade generously, accept the concept in any phrasing, never silently demote (always "I'll bring this back"), always offer "I actually meant X." This numeric instrument is the gold standard the §7 success criteria are modeled on.

5.6 Privacy & compliance (the launch gates)

[repo-verified] The zero-transmission claim is asserted in at least three places, including in-app at index.html:1226 ("no accounts and no tracking … does not collect, transmit, sell, or share") plus privacy.html plus the paywall legal link. A live transcript-forwarding proxy makes all of them false at once — simultaneously an App Review 5.1.2(i) rejection risk, a Google Play Data-safety mismatch, and FTC exposure. Therefore:

5.7 Rollback / kill-path for a bad shipped corpus (the highest-consequence gap)

A pre-generated Coach explanation ships inside the binary. If a wrong, cited, spoken lesson reaches production, content-hashing and the next review cycle are too slow for a licensing exam. So:

5.8 Validating that Coach actually works (outcome measurement under no-tracking)

The pedagogical effect sizes (elaborated feedback d≈.49; testing g≈.5–.6; interleaving) justify the design; they don't prove transfer to PassLane users. Under the no-tracking constraint we still measure outcomes, the §5.5 way:

5.9 Honesty as moat

Every defense above is also a marketing asset no incumbent can match: every answer traced to a vetted explanation; refuses rather than invents; refuses to leak an answer during an exam; a wrong lesson can be killed remotely within hours; your questions are never used to train AI and are deleted within ~7 days. This is the credible opposite of ExamFX's refund/guarantee grievances and the antidote to the 33–79% hallucination rates that plague general AI tutors — the brand's "honesty is the moat," made operational.


6. Cost Model at Scale

The architecture's whole point: marginal AI cost approaches zero because the expensive work is computed once, offline, and shipped. This is precisely the economics ("per-user inference ate the margins") that killed Quizlet's Q-Chat — designed out. All dollar figures use [API-assumption] pricing (2026-01); the structural conclusion (near-zero, one-time, single-digit dollars) is robust to reasonable price drift and is what matters.

6.1 One-time pre-generation (Plane A) — arithmetic shown

[repo-verified counts: AZ = 323; all six banks = 3,392 (CA 583 + FL 641 + NY 667 + NC 588 + TX 590 + AZ 323).] Per question ≈ 600 tokens in (stem+choices+explanation+prompt) / ~340 tokens out (elaborated rationale + 3 distractor repairs). Haiku 4.5 Batch [API-assumption: $0.50 in / $2.50 out per 1M].

AZ only (323)
Input cost 0.19M → ~$0.10
Output cost 0.11M → ~$0.28
One-time total ≈ $0.40–$2
All six states (3,392)
Input cost 2.04M → ~$1.02
Output cost 1.15M → ~$2.88
One-time total ≈ $4–6

The two-phase judge pass adds a second Haiku call of similar magnitude; the realistic envelope is single-digit dollars for the whole six-state corpus, one-time. We never anchor on a number a reviewer can falsify in a spreadsheet — the conclusion is what matters and it is robust. (If §4.4's incompatibility is lifted, the judge folds into generation and this roughly halves.) Regeneration on a bank edit is cheap because answers are content-hashed.

6.2 Runtime, at scale

10k
Plane A (offline teaching) $0 — ships in app data
Plane B (live Talk, Pro only) A fraction subscribe; capped per-user; Sonnet whole-bank-prefix cache → a few cents/session
Net Near-zero; comfortably inside Worker free tier
100k
Plane A (offline teaching) $0
Plane B (live Talk, Pro only) Per-user daily/monthly budgets + global ceiling hold the line; degrade to Plane A when spent
Net Bounded by design, not by hope
1M
Plane A (offline teaching) $0
Plane B (live Talk, Pro only) Worker free tier is ~100k req/day; the live tail is a small Pro fraction; if it ever approaches the cap, scale the Worker (still cents per active session) and the budgets cap worst-case
Net Stays near-zero because the heavy path never hits the network

Why it stays near-zero: (1) the teaching corpus is pre-generated and local — the most-used feature costs nothing at runtime; (2) live calls exist only for a learner's own words, a small Pro-gated fraction; (3) the live path is Haiku-default with whole-bank prompt caching at 0.1×; (4) per-user and global spend caps make the worst case bounded, not unbounded; (5) Opus is a rare, server-enforced ceiling. The premium price is therefore near-pure margin that funds premium design. (Anchor the price against exam-prep incumbents — a ~$130 ExamFX seat for 60 days — not against $4 consumer tutors; $59.99/yr never-expiring is the affordable, premium, honest option.)

6.3 af_bella voice generation — a real, priced work-stream (was missing)

Because [repo-verified] 108 AZ questions are unvoiced + 100% of Coach copy is unvoiced and the pipeline is absent from the repo, voicing is a line item, not an afterthought:

6.4 Storage / install-size budget (was unbounded)


7. The Build Roadmap

Every success criterion below carries a number and an instrument, modeled on §5.5, and references the runbooks that already exist in docs/ [repo-verified: STUDY-SESSION-TEST-MATRIX.md, IOS-VOICE-TEST-PLAN.md, LAUNCH-RUNBOOK.md] rather than vague phrases.

CRAWL — ships first, post-launch (offline, FREE, zero new infra, zero new privacy surface)

Scope: Phase 0 bank audit (state-law SME pass + last_reviewed + "looks wrong" → az_reports local queue) → Memory-opener + TRAIN (text, offline, 100% of bank): pre-generated elaborated feedback + misconception repair at the feedback seam, plus scripted reframing/calibration lines. Then, after the af_bella pipeline (§6.3) exists and Option-A same-origin audio (§4.7) lands, spoken Coach + SW-cached offline audio for the voiced subset, with the unvoiced third in the §3.8 text-only state. Gate a 12-interaction Coach taste mirroring [repo-verified VOICE_FREE_LIMIT=12]; the elaborated feedback itself is free to every learner.

Success criteria (numeric, instrumented):

WALK — TEST tier + the conversational wedge

Scope: discrimination drills on existing Leitner/weak machinery; mock-exam before/after coaching (mute during) + calibration confrontation; ungraded explain-back; Ask-mode proven in the harness first (S12+), then TEXT-ASK as the conversational surface.

Success criteria (numeric, instrumented):

RUN — Plane B live tier (Pro, online, the metered tail)

Scope: TALK (Sonnet, text-reply) + graded explain-back (past the §5.5 gate) + the Opus end-of-session diagnostic, behind the key-holding sibling Worker, with the §5.7 remote denylist live. Spoken ASK only if the on-device spike passes.

Success criteria — all blocking:

Tiering law: free gets real teaching (elaborated feedback is local data, so it's free) — Pro is the conversation and live coaching, never "pay to get explanations at all." Coach is the headline Pro unlock — "your instructor, on call" — reusing the existing pw- paywall.


8. Risks & Mitigations

1
Risk Bank correctness is the true SPOF — grounding amplifies a wrong/stale explanation into a confident, cited, spoken wrong lesson
Severity High
Mitigation Phase 0 SME audit (state-law first) + numeric rubric (§5.3) + last_reviewed + day-one az_reports affordance; correctness bounded by the bank's; disseminated==passed_review
2
Risk Bad shipped corpus is live until next release — a cited wrong lesson is in the binary
Severity High
Mitigation Remote per-question Coach denylist (§5.7) suppresses a bad id to base-bank-only with no app update; privacy-clean (download, not telemetry)
3
Risk Exam-key leak via public CDN — [repo-verified] /audio/pc001-a.mp3 is HTTP-200 enumerable with sequential ids; naively keying new AI clips ${id}-a widens it
Severity High
Mitigation Hashed non-sequential keys for all new audio; owner picks §4.7 Option A (same-origin/R2 — fixes integrity and offline in one move) vs accept-rationale-public; never a new guessable -a clip publicly
4
Risk Runaway live-AI bill — [repo-verified] isPro() is a spoofable localStorage flag; device-id rotatable; Opus unbounded
Severity High
Mitigation Server-verified receipt → per-install token; per-user and global budgets + kill-switch; fail-closed to Plane A; client flag gates UI only
5
Risk False privacy policy — a live proxy makes the zero-transmission claim ([repo-verified] 3 places incl. in-app:1226) false → App Review + Play + FTC
Severity High
Mitigation Plane A stays offline (policy true); Plane B ships only with every surface rewritten + consent gate + on-device STT + Data-safety update, coupled to the release
6
Risk Hallucinated / over-synthesized law — fabricated statute/number a student repeats on the exam
Severity High
Mitigation Citations grounding + Phase-2 containment judge (fails build on introduced facts) + runtime uncited-claim suppression + strict state-law human review
7
Risk Voice-funnel regression — an ASK branch or multi-turn mic breaks the partialResults-empty contract / staleness guards tuned for short answers
Severity High
Mitigation All mic via startListening/stopListening; ASK is a separate posture with its own gen-id/budget; harness S12+ exit 0 before any UI
8
Risk Offline audio fixed at the wrong layer — [repo-verified] clips bypass the SW via the cross-origin early-return (sw.js:34), not a cache refusal
Severity High
Mitigation Correct diagnosis in §4.7; Option A same-origin reuses the existing cache-first SW (no new code); Option B requires CORS + airplane-mode spike before relying on it
9
Risk Spoken Coach over-promised — [repo-verified] only 215/323 AZ voiced, 0% Coach copy voiced, pipeline absent
Severity High
Mitigation Text Coach ships for 100% of bank; voicing is a priced work-stream (§6.3) with the unvoiced third in a defined text-only UX (§3.8); spoken criteria gated on the pipeline
10
Risk Spoken multi-turn ASK may not work on-device — collides with the 12s cap + teardown race; untested on the local fork
Severity Med
Mitigation Spoken ASK spike-gated (iOS and Android, per IOS-VOICE-TEST-PLAN.md); TEXT-ASK ships regardless so the tier doesn't depend on the spike
11
Risk Explain-back false "you're wrong" silently demoting earned mastery
Severity Med
Mitigation Ungraded self-check is default; graded gated behind ≥95%/≤2% on-device eval; never silent-demote; "I actually meant X" recheck
12
Risk Store AI-feature rejection — missing reporting / consent / age-rating; report sink wired to a parked route
Severity Med
Mitigation Per-reply report control: local az_reports queue in offline phases, verified-live Worker sink in RUN; consent gate; honest age-rating; reporting treated as a launch blocker
13
Risk Live-path feels slow for an anxious commuter
Severity Med
Mitigation §3.9 budget: stream tokens, source chip pinned immediately, ≤1.5s/≤3s first-token target, 8s→fail-closed to grounded local text, failed turn costs no budget
14
Risk Over-instrumentation destroys pace/feel — a regression of shipped restraint
Severity Med
Mitigation Strict frequency caps (explain-back ~1-in-5, warmth one-shot at existing thresholds); everything pull-driven/skippable; never in Exam mode; one-jewel-per-section binds Coach
15
Risk Stale pre-gen artifacts when the bank is edited
Severity Med
Mitigation Key each artifact to a question content-hash; regenerate on change (cheap via Batch); denylist covers the gap until a fixed release
16
Risk Offline-audio storage unbounded — a fully-cached voice set is tens of MB/state
Severity Med
Mitigation §6.4: opt-in per state, ≤150MB LRU ceiling, text-only install, "manage offline voice" control
17
Risk Outcome unproven — effect-size literature may not transfer; no-tracking blocks measurement
Severity Med
Mitigation §5.8 opt-in on-device aggregates-only readiness self-report; optional separately-consented coarse beacon (owner fork) — never on by default
18
Risk Load-bearing API facts stale — Citations×Structured-Outputs incompatibility, pricing, 4,096 cache floor are [API-assumption]
Severity Low
Mitigation Each tagged; §4.4 has a collapse-path if incompatibility lifts; pricing drift doesn't change the structural near-zero conclusion; confirm against current docs before Phase 1

9. Open Decisions for the Owner

To keep the owner's surface honest, this is split: calls the Principal owns (stated for transparency, not for re-litigation, per the "own the decision after research" discipline) vs genuine forks with real cost/liability tradeoffs and no obvious default.

9A. Decisions made — FYI, won't relitigate

9B. Genuine forks needing the owner

  1. Audio architecture + integrity (one decision, two payoffs). [repo-verified] the spoken answer key is publicly enumerable at /audio/<id>-a.mp3 today, and offline audio is broken by the cross-origin SW bypass. §4.7 Option A (move/proxy audio same-origin / behind the auth+rate-limited Worker/R2 the D1 schema already anticipates) fixes both integrity and offline caching in one move and reuses the existing cache-first SW — recommended. Option B (keep cross-origin CDN + add an IndexedDB blob cache with mandatory CORS + an airplane-mode spike) is more code and more failure surface. New AI audio uses hashed keys regardless. The Principal recommends A; the owner confirms the infra activation cost.
  1. The qualified reviewer of record. Who is the licensed-insurance SME that signs off the pre-generated corpus and backstops factual-correctness liability and the Anthropic high-risk-review posture — Gino, or a contracted licensed agent? This sets the review-cadence pattern for every future vertical. No default the Principal can pick — it is a liability/credential question.
  1. Launch path for the conversational tier. Ship the high-confidence offline companion + TEXT-ASK first and defer spoken multi-turn until the on-device spike passes (Principal recommends this), or treat spoken "talk to it" as core to the day-one promise and fund the spike up front? Genuine because it trades launch speed against a marquee promise, with real spike cost.
  1. Population-level outcome signal (optional). Hold to purely on-device, user-only readiness numbers (§5.8, default), or authorize a single separately-consented, aggregates-only readiness-delta beacon so the team can see whether Coach moves pass rates at population scale? Genuine because even a coarse beacon touches the "no tracking" brand line and needs the owner's explicit blessing.

Supporting workspace: /Users/arizona/CLAUDE CODE/passlane/docs/companion/ [repo-verified: exists, empty]. The plan above is the document; no code was written and recon was read-only, per scope.

Private working document — unlisted, not indexed. PassLane / Somos LLC. ↑ top