The standard architecture for meeting AI looks like this: audio goes up to a cloud service, a transcription model processes it, the transcript gets fed to a language model, and a response comes back down. The whole round trip takes somewhere between two and eight seconds depending on the vendor, the model, and your network conditions.
For post-call summarization, that latency is irrelevant. The meeting is already over. But for real-time assistance — the case where you need a suggestion in the next five seconds, while someone is still talking — the cloud pipeline is fundamentally the wrong architecture.
This post is about why, and what local audio processing actually involves at a technical level.
How cloud audio pipelines work
To understand the alternative, it helps to be precise about what cloud audio pipeline actually means. Here's the data flow for a typical meeting recording bot:
Every step introduces latency. Network round-trip to the transcription service: 200–600ms. Model inference time: 500ms–2s. LLM processing of the transcript: 1–3s. By the time you see a suggestion, the conversational window it was relevant to has usually closed.
What data actually leaves your device
The privacy question isn't just about whether a company claims to delete your recordings. It's about the attack surface: how many systems your audio passes through, how many third-party processors handle it, and how many storage layers it touches before it's purged.
With a cloud recording bot, your audio typically passes through:
- The meeting platform's media servers (Zoom, Google, Microsoft)
- The bot vendor's ingestion infrastructure
- A third-party transcription provider (if the vendor outsources ASR)
- A large language model API (OpenAI, Anthropic, or the vendor's own)
- The vendor's storage layer (for transcript archival)
- Any CRM or integration the vendor is syncing to
Each of these is governed by a separate privacy policy, a separate data retention schedule, and a separate security posture. The vendor's own privacy policy may be excellent. The sub-processor they use for transcription may have a 90-day retention policy buried in their terms.
How local audio processing works instead
The critical difference: audio never leaves the device. The transcription step happens locally. The only thing that leaves the device is the text of a prompt — and only when the user triggers a suggestion. No continuous audio stream is sent anywhere.
The latency advantage
A Whisper-small model running on an M-series Apple Silicon chip via the Neural Engine processes a 3-second audio chunk in approximately 15–40ms of wall-clock time. The same transcription sent to a cloud API and returned is typically 300–800ms, plus queuing and routing overhead.
For post-call summarization, neither of these matters. For real-time coaching, the difference between 40ms and 600ms is the difference between being ahead of the conversation and being behind it.
There's also a reliability dimension. Cloud pipelines fail. Network timeouts, rate limits, provider outages — any of these can interrupt a real-time assistant at exactly the moment you need it. A local model has no dependency on network conditions.
The tradeoffs
Local processing isn't a free lunch. The architectural choice comes with genuine costs worth being honest about:
| Dimension | Local processing | Cloud pipeline |
|---|---|---|
| Transcription latency | ~15–40ms | 300–800ms+ |
| Network dependency | None (ASR step) | Required throughout |
| Audio privacy | Stays on device | Multiple processors |
| CPU / battery impact | Higher device load | Minimal |
| Cross-device sync | Not possible | Supported |
| Transcript searchability | Local only | Cloud-searchable |
| Model update cycle | App update required | Transparent updates |
The CPU cost is real. Running a Whisper-small model continuously during a call uses roughly 15–25% of one CPU core on an M2 MacBook Pro, with Neural Engine offloading reducing this significantly on M-series chips. On older Intel hardware, the overhead is more pronounced — around 20–30 minutes less battery life on a long meeting day.
The cross-device sync limitation is also real. If you want to review what was said on your phone after a call, a local-first architecture doesn't support that without explicit sync — and enabling sync reintroduces the privacy tradeoffs you were trying to avoid.
The use case determines the tradeoff. For post-call documentation and team collaboration, cloud pipelines are the better architecture — centralized storage and search are features, not liabilities. For real-time in-the-moment assistance where latency and privacy are the primary concerns, local processing wins on every dimension that matters.
Why this is the right architecture for real-time use
Post-call summarization requires: access to the full transcript, structured notes, CRM integration, and team sharing. None of these require low latency. All of them benefit from centralized storage. Cloud is the right answer for this use case.
Real-time assistance requires: sub-second response time, continuity through network interruptions, no visible presence in the meeting infrastructure, and strong privacy guarantees. Local processing is the right answer for this use case.
The mistake the current generation of meeting AI tools made was trying to build both with the same pipeline. You can't build a low-latency real-time assistant on a cloud audio pipeline without enormous engineering effort — and even then, you still have a bot in the participant list, which is a separate and unfixable problem.
Local audio processing isn't a compromise forced by privacy concerns. It's the architecturally correct choice for the real-time case — and the privacy properties fall out of it for free.
What this looks like in practice on macOS
On Apple Silicon Macs, the stack that makes this practical looks roughly like:
- ScreenCaptureKit (macOS 12.3+) for system audio capture — the same API screen recording software uses, respecting
setContentProtectionflags from other apps. - CoreML + a distilled Whisper model for on-device speech recognition, running on the Neural Engine for efficient inference without major CPU or GPU impact.
- A local context accumulator that maintains a rolling transcript window and detects segments worth acting on — topic shifts, questions, named entities.
- A remote LLM API call triggered on demand, passing only the text context and the user's query. Audio is never in this request.
The result is an assistant that feels faster than cloud-based alternatives because the slow step — transcription — happens locally. The only network hop is for the LLM response, and that's bounded by the model's output speed rather than the transcription pipeline's round-trip latency.
This is what Sway is built on. The architecture isn't a privacy claim — it's a consequence of optimizing for the latency requirements of real-time use, which happens to also mean your audio never goes anywhere.
See local audio processing in action.
Sway runs entirely on your Mac. No bot, no cloud audio, no recording. Free 7-day trial.
