How I Built Sway — The Technical Decisions Behind a Private, Local-First AI

The original idea for Sway was not "build a private AI assistant." It was: I keep blanking on things during important calls. I know the answer but I can't access it under pressure. There should be a tool for that.

That's a normal enough product insight. The interesting part was what happened when I tried to figure out how to build it. Every technical decision I made in the first three months turned out to be load-bearing — not just for performance or privacy, but for whether the product was even possible at all.

This is a post about those decisions: why I made them, what I got wrong, and what the architecture looks like now.

Why Electron

The first decision was platform. A meeting assistant needs access to system audio — the audio that's going through your Mac's output. A browser extension can't do this. A web app can't do this. Only a native application has the permission surface to capture system audio on macOS without routing through the meeting platform's API.

The options were: native Swift, Flutter, Tauri, or Electron. I've shipped Swift apps before, but Swift's async ecosystem in 2025 was still maturing and I wanted to move fast. Flutter was overkill for a single-window utility. Tauri was tempting — smaller binary, Rust backend — but the ecosystem for the specific APIs I needed (Core Audio bindings, Accessibility APIs for window behavior) was thin.

Electron won for one reason: the combination of Node.js in the main process and a full Chromium renderer means I can use the entire npm ecosystem for AI API calls, audio processing, and IPC — while still getting native macOS APIs through the main process when I need them. The binary is large. The memory footprint is larger than I'd like. Those are real tradeoffs. But the iteration velocity was worth it at this stage.

What I'd do differently: I'd structure the IPC layer more carefully from day one. The channel proliferation that happens as you add features — "analyze-audio", "analyze-image", "gemini-chat", "chat-with-history", "commentator-analyze" — creates a maintenance surface that compounds. If I were starting over, I'd abstract earlier into a single typed IPC dispatcher with a message schema.

The audio pipeline

Getting system audio on macOS without a bot joining the call required using the ScreenCaptureKit framework — Apple's modern replacement for the older CGDisplayStream APIs. ScreenCaptureKit can capture the audio output from specific applications (Zoom, Meet, Teams) or the entire system output, at configurable sample rates, in real time.

The challenge is that ScreenCaptureKit requires an explicit user permission grant — Screen Recording permission in System Preferences. This is non-negotiable. Apple requires it. The onboarding flow for Sway has to walk users through granting this permission before audio capture works, which adds friction to first-run but is the right tradeoff for an app that explicitly markets itself as privacy-respecting. You should have to grant permission for something like this.

The audio arrives as a PCM stream that needs to be converted before it can be sent to a transcription or AI model. I process it through a pipeline:

Capture from ScreenCaptureKit at 48kHz stereo
Downsample to 16kHz mono (what most speech models expect)
Buffer into chunks appropriate for the target API
Either transcribe locally (for low-latency feedback) or send the buffer as base64 to an AI provider

The local transcription path uses Whisper — specifically a small quantized model that runs on Apple Silicon fast enough to be useful in real time without burning through battery or heating the fans noticeably. The larger models are more accurate but too slow for live conversation use.

"The tradeoff between model quality and response latency is the central engineering problem of real-time AI. You need the answer before the moment passes."

The screen-privacy problem

The hardest part of building Sway was not audio. It was the window.

The product requirement was clear: the Sway window must be private to screen sharing. When you're in a Zoom call sharing your screen, the other participants should not be able to see Sway. This is what makes the product usable in professional contexts — without it, you'd have to toggle the app off every time you shared your screen, which defeats the purpose entirely.

macOS has a mechanism for this: setContentProtection(true) on an NSWindow. When this is set, the window is excluded from the OS screen capture layer. Zoom, Meet, Teams, and every other screen recording tool on macOS captures from that layer. Setting this flag makes your window private to all of them.

The problem is that setContentProtection also makes the window private to everything that uses the OS capture layer — including AppleScript, the screencapture command-line tool, Playwright, and any automated testing framework that takes screenshots at the OS level. For testing purposes, the window simply does not exist.

The workaround is Chrome DevTools Protocol. CDP's Page.captureScreenshot renders from the Chromium compositor directly — it bypasses the OS capture layer entirely. So all visual testing for Sway has to be done via CDP scripts that connect to the Electron app's remote debugging port. There is no other way to take a screenshot of the app while content protection is on.

This constraint shapes the entire testing infrastructure. You can't use Playwright's standard screenshot API. You can't use macOS automation tools. Everything that needs visual verification goes through CDP.

The two-mode layout system

Sway has two fundamentally different operating modes. In private mode — the default — the window is small, non-focusable, and sized to its content. It auto-resizes as text appears and disappears. In active mode, the window behaves like a normal interactive application: the user can resize it, it responds to keyboard input, and it occupies a fixed viewport.

These two modes require completely different CSS architectures. private mode needs height: auto on the root so the window can shrink to content. Active mode needs height: 100% propagated through every ancestor element so content fills the window rather than overflowing it.

The mechanism I ended up with is a single class on body: active-mode. When this class is present, a CSS rule overrides the root height to 100% and propagates down the tree. When it's absent, everything defaults to auto. A single boolean in the main process sends an IPC event to the renderer, which applies or removes the class.

This sounds simple. The devil is in the details. Every flex child defaults to min-height: auto, which prevents shrinking below content size even when flex: 1 is set. The chat container needs min-height: 0 paired with every flex: 1. Get one wrong and the layout breaks in one of the two modes. I've fixed this bug in the wrong direction twice.

The rule I've internalized: height: 100% only works if every single ancestor also has an explicit height. One height: auto or height: fit-content anywhere in the chain breaks propagation completely. In a two-mode layout system, this chain must be verified for both modes independently.

The IPC architecture

Electron's main process and renderer communicate through IPC channels. The main process handles system-level operations — audio capture, screenshot, window management, AI API calls. The renderer handles UI state and user interaction.

The architecture I landed on exposes everything the renderer needs through a contextBridge in the preload script. The renderer never calls Electron APIs directly. It calls window.electronAPI.invoke("channel-name", args) and waits for a response. This provides a clean boundary and makes the renderer essentially a normal React app that happens to have access to a custom API surface.

The problem with this approach is that ipcMain.handle() returns a value to the caller but doesn't notify other subscribers. When a keyboard shortcut in the main process changes app state — toggling private mode, say — the handler needs to also call event.sender.send("event-name", payload) to push the update to the renderer. Every handler that changes observable state needs this second call. Forgetting it produces bugs where the UI is out of sync with the main process state, which are genuinely disorienting to debug.

The AI provider architecture

Sway doesn't call AI providers directly from the renderer. All AI calls go through the Electron main process, which either calls a Firebase Cloud Function proxy or — for local-only operations — processes directly with a local model.

The proxy layer exists for rate limiting and usage metering. The trial gate needs to be enforced server-side — client-side enforcement is trivially bypassed. Every AI call increments a Firestore counter via a Cloud Function before proxying to Gemini. If the counter is over the limit, the function returns 429 and the app shows a friendly upgrade prompt.

The failure mode I spent the most time on was cold starts. Cloud Functions can take two to four seconds to respond on a cold start. In a real-time AI context, four seconds is an eternity — the conversational moment has moved on. The mitigations are: minInstances: 1 on the most latency-sensitive functions (keeps at least one warm), a 30-second fetch timeout (prevents indefinite hangs), and a user-facing loading state that acknowledges the delay without blocking the UI.

"Cold starts are the enemy of real-time AI. A four-second delay when a prospect asks you a question is not a delay — it's a loss."

What I got wrong

The honest list:

Underestimating the testing surface. The content protection constraint means I can't use standard testing tools. I spent too long trying to make Playwright work before accepting that CDP is the only path and building the tooling around it.
Overbuilding the feature set early. The app shipped with screenshot analysis, voice dictation, a real-time commentator, and a chat interface. Each is useful. Each added surface area that needed testing and maintenance. A tighter initial scope would have been faster to ship and easier to iterate on.
The window auto-resize interaction with content protection. When content protection is on, the window is private to external tools, which means any auto-resize driven by a ResizeObserver sends IPC calls that can't be verified without CDP. I had at least three bugs in this interaction that only showed up when I tested via CDP.

What surprised me

The thing that surprised me most was how much of the hard work is in the UX surface, not the AI. Getting Gemini to generate a useful objection response from an audio transcript is not hard. Getting the window to be the right size, in the right position, with the right opacity, appearing and disappearing at the right moments without stealing focus from the meeting — that's the hard part. The AI is a commodity. The private, non-intrusive, perfectly-timed UI is the product.

The second thing that surprised me was the market's appetite for the "no bot" positioning. I expected the privacy angle to be a nice-to-have. It turned out to be the primary reason people chose Sway over alternatives. Enterprise buyers especially — people who can't have a recording bot in their calls for compliance reasons, or who work with counterparties who would object to being recorded — found this more compelling than any AI capability.

You can't add "no bot" to a product that was built around a bot. It requires a different architecture from the first line of code. That constraint, which felt like a limitation when I was building it, turned out to be the moat.

Built to be private. Available now.

Sway is a native macOS app. No bot. No recording. Local audio. Free 7-day trial — download and run your first session in two minutes.

Download for macOS — Free More articles

Why Electron

The audio pipeline

The screen-privacy problem

The two-mode layout system

The IPC architecture

The AI provider architecture

What I got wrong

What surprised me

Built to be private. Available now.

The Case for Local Audio Processing

Why No Bot in Your Meeting Matters