A macOS menu bar app that automatically detects Microsoft Teams meetings, records dual-track audio (app + mic), and produces on-device transcripts with speaker diarization. Also includes a real-time dictation mode that types transcribed speech into any focused text field.
No cloud, no LLM, no external APIs — everything runs on-device on Apple Silicon.
- Teams auto-detection — Polls
IOPMCopyAssertionsByProcess()every 3 seconds for Teams power assertions (PreventUserIdleDisplaySleep/NoDisplaySleepAssertion/ "Call in progress"). Requires 2 consecutive hits before triggering; applies a 5-second cooldown on end. Recognises Teams by bundle ID (com.microsoft.teams/com.microsoft.teams2) with localized-name fallbacks for Teams classic, New Teams, and "work or school" variants. - Meeting title extraction — Reads the Teams window title via the Accessibility API and strips the
| Microsoft Teamssuffix for the transcript filename. - Dual-track recording — The mic track is captured via
AVAudioEngine; the app track is captured viaCATapDescription+ a private aggregate device + a raw AUHAL output unit. The tap collects all Teams-related CoreAudio process object IDs (main + renderer children) so audio from Electron/Chromium sub-processes isn't lost. Both tracks land on disk as 48 kHz WAVs and are never mixed during recording. - Recording reliability — The app-audio tap self-tests after 2 seconds. If it is silent, Heard rebuilds the tap/aggregate/IOProc chain once with a fresh helper-process scan; persistent silence marks the job as mic-only and the menu bar shows "Recording (mic only)". Turning watching off during an active meeting synchronously finalizes that recording and preserves the processing state.
- Roster scraping — Reads Teams participant names via Accessibility APIs every 15 s while a meeting is active. Three strategies: known roster-panel identifiers → AXList/AXTable containers → window-title parsing. Filters out UI control strings (mute, raise hand, etc.).
- On-device pipeline — Sequential stages per job: preprocessing (48 kHz → 16 kHz mono via
AVAudioConverter, in-memory Silero VAD trimming +VadSegmentMap) → Parakeet TDT V2/V3 transcription (per-track, parallel 4-chunk processing, with CTC vocabulary boosting) → Inverse Text Normalization (punctuation formatting) → mic-bleed deduplication → LS-EEND + WeSpeaker diarization → speaker assignment and Markdown output. - Speaker identification — Cosine-distance matching (threshold 0.40, confidence margin 0.10) against a persistent speaker database, with embedding diversity management and auto-update on confident matches. Uses UUID-based placeholders (e.g.
Speaker_1AB23C) to guarantee universally unique speaker profiles. - Cumulative transcription stats — Every transcribed segment seamlessly increments the matched speaker's total transcribed hours and word count, which are visible in the Settings tab.
- Roster-aware auto-naming — When one speaker is unmatched and exactly one roster name is unclaimed, the roster name is assigned automatically without prompting.
- Speaker naming window — When unmatched speakers remain, a dedicated "Name Speakers" window opens with a playable audio clip (~10 s of each speaker's clearest speech, extracted from the 48 kHz recording), a text field pre-populated with any roster suggestion, and a 120 s auto-dismiss countdown.
- Custom vocabulary boosting — CTC-based keyword boosting via
Parakeet CTC 110Mapplied to both meeting transcription and dictation. Terms are added from Settings → General (min 3 chars, max 50 terms). Falls back gracefully if the CTC model can't be loaded. - Persistent job queue —
pipeline_queue.jsonsurvives app restarts; failed jobs are re-queued on relaunch up to a lifetime cap of 6 retries. Non-retryable errors (no audio, too short) fail immediately; transient errors retry 3× per session with exponential backoff (5 s, 30 s, 5 min). User-initiated retry resets the count. - Long-meeting handling — 4 h hard cap; on hit, the current recording is finalized and a fresh one starts if the meeting is still active.
- Markdown output — Timestamped, speaker-labeled transcripts written to a configurable output folder. Filename dates are configurable (
YYMMDDorYYYY-MM-DD). Consecutive segments from the same speaker are merged into continuous blocks. When laptop mic + speakers cause remote audio to bleed into the mic track, overlapping mic segments are dropped when the app track already contains the same words. - In-meeting notes — Press a global hotkey during an active recording (default ⌃⇧N, configurable in Settings → General) to open a floating composer panel. Type a note, ⌘↩ to save, Esc to cancel. The note's timestamp is captured at panel-open time, so a slow typer's note still anchors to the moment they reacted. Notes are interleaved chronologically with spoken segments and rendered as
[mm:ss] _**Note from <Your Name>:** …_(italicized, distinct from speaker blocks) so a reader can tell what was said versus what was added as supplemental context. Survives crashes viapipeline_queue.json; if the meeting ends while the composer is still open, the note is attached to the just-finished job.
- Real-time speech-to-text — Uses FluidAudio's
SlidingWindowAsrManagerwith overlapping windows and an internal stable/volatile text split. Audio is fed asAVAudioPCMBufferstraight from the mic tap; the manager handles resampling, chunking, and context accumulation internally. - Incremental injection — Confirmed text flows word by word in real time as the sliding window confirms it; remaining volatile text is flushed on stop. No batching — typed as you speak.
- Punctuation normalization — FluidAudio's Inverse Text Normalization (ITN) runs natively, converting spoken forms like "comma", "period", and "new line" to their written equivalents seamlessly before text injection.
- Filler word stripping — Standalone instances of "uh", "um", "er", "ah", "hmm", "hm", "uhh", "umm", "mhm" are automatically removed before injection; word boundaries and case-insensitivity apply. Keeps transcripts clean without user intervention.
- Vocabulary boosting — When CTC models are downloaded,
configureVocabularyBoosting()is applied so custom terms take effect in real time. - Text injection —
CGEvent.keyboardSetUnicodeString+postToPidtargets the frontmost app; falls back to HID-level events, then clipboard paste. All paths require Accessibility permission. - Global hotkey — Default ⌃⇧D, registered via Carbon
RegisterEventHotKey. No Accessibility permission needed for the hotkey itself, so it keeps working across ad-hoc rebuilds. Rebindable via a Record sheet in the Dictation tab. Hotkey input is validated to block forbidden system shortcuts and warn on weak combos. - Hotkey reuse fix — Proper cleanup after stopping —
SlidingWindowAsrManager.cleanup()closes the internalAsyncStreambefore starting a new session, preventing state corruption from rapid presses. State validation throwsDictationError.notIdleto surface conflicts. - Accessibility revocation detection — If the user revokes Accessibility permission mid-dictation,
AppModel.startAXPolling()detects it every 2 seconds and gracefully stops, showing an orange banner with a "Re-grant Access…" button. - Floating dictation indicator (HUD) — An optional on-screen pill at the bottom-center shows "Dictating" with a waveform icon while active (opt-in via Settings → Dictation). Appears at full opacity, dims to 35% after 2.5 seconds, and fades out on stop.
- Toggle and push-to-talk modes — Tap to toggle, or hold the hotkey to dictate and release to stop.
- Model keep-alive — ASR models stay resident for a configurable period (0 – 10 min, default 120 s) after dictation stops to avoid reload latency between utterances.
- Visual design — "Paper" warm palette throughout: off-white background (
#F5EFE4), card surfaces (#FBF7EF), warm sidebar (#EBE2CE), accent blue (#3F5C8C). Forced light-only appearance. Custom 188 px sidebar with aHeardMarksquircle glyph header; settings panels useSettingsCard/CardRow/ToggleRowcard primitives rather than systemFormgroups. - App icon — Ships a full 16-512 px AppIcon asset set.
bundle.shcompiles it intoAppIcon.icns,Info.plistreferences it viaCFBundleIconFile, and the About tab renders the real bundle icon. - Menu bar icon — SF Symbols with symbol effects: static
recordingtapewhen idle, pulsing/breathingrecord.circlewhile recording or dictating, iterativewaveformwhile processing,exclamationmark.circle.fillon error,person.crop.circle.badge.exclamationmarkwhen waiting for speaker naming. Icon uses accent tint (blue) when actively capturing audio (recording or dictating) and primary tint (white/dark gray) otherwise. Opacity is 100% when the app is watching or recording, and 50% when paused (phase is dormant, not dictating, and meeting detection is off). - Menu bar dropdown — 268 px wide Paper-palette panel. Current status card (dark
#2E3338background when recording/dictating, Paper background otherwise), Recent Meetings list (up to 3 completed/failed jobs with click-to-open and right-click to reveal/retry/dismiss), quick actions (start dictation, open transcripts, name speakers, settings, quit), and developer-mode simulate buttons. - Settings window — 880×600, five tabs: General (launch at login, auto-watch, developer mode, custom vocabulary, output folder, permissions), Dictation (enable, push-to-talk, hotkey recorder with validation feedback, model keep-alive slider, live status, show dictation HUD checkbox), Models (download cards with progress, pipeline keep-alive, "Unload All Models"), Speakers (your name, inline rename, merge, delete, search, sort, with prompts to retroactively update all past transcripts), About.
- Name Speakers window — 560×520 standalone scene with audio-clip playback per candidate, roster-suggestion hints, save/skip, and a 120 s countdown.
- macOS 15.0+
- Apple Silicon (FluidAudio CoreML/ANE models are ARM-only)
- FluidAudio 0.14.3+ (resolved automatically via SPM)
# Compile only
swift build
# Run from the terminal (⚠ mic permission is attributed to the terminal, not Heard)
swift run Heard
# Build a .app bundle and launch it (recommended for day-to-day use)
./scripts/bundle.sh
open build/Heard.app
# Stable-signed build — required for dictation so the Accessibility grant
# persists across rebuilds. Replace "Heard Dev" with any local identity.
./scripts/bundle.sh --sign "Heard Dev"
open build/Heard.app
# Release build
./scripts/bundle.sh --releaseTip: If dictation's Accessibility grant gets stuck after rebuilds, reset it with
tccutil reset Accessibility com.execsumo.heardand re-grant.
Single-process SwiftUI menu bar app. Three scenes: a MenuBarExtra(.window) dropdown, a Settings Window, and a Speaker Naming Window. WindowActivationCoordinator reference-counts .regular activation policy across the two windows so keyboard focus survives closing one while the other stays open. All persistence is JSON files under ~/Library/Application Support/Heard/.
Heard/
├── Sources/
│ ├── Heard/
│ │ └── MTApp.swift # @main, MenuBarExtra + two Window scenes
│ └── HeardCore/
│ ├── AppModel.swift # Central state, action handlers, dictation wiring
│ ├── CoreModels.swift # AppPhase, PipelineJob, SpeakerProfile,
│ │ # AppSettings, HotkeyCombo, ModelKind, …
│ ├── Services.swift # MeetingDetector, RecordingManager (AUHAL + tap),
│ │ # PipelineProcessor, PermissionCenter,
│ │ # TranscriptWriter, TempFileCleanup,
│ │ # AudioDeviceCleanup, LaunchAtLogin,
│ │ # WindowActivationCoordinator
│ ├── Stores.swift # SettingsStore, SpeakerStore, PipelineQueueStore
│ ├── Views.swift # MenuBarView, SettingsView, SpeakerNamingView
│ ├── AudioProcessing.swift # AudioPreprocessor, VadSegmentMap
│ ├── AudioClipExtractor.swift # Extract speaker audio clips for the naming prompt
│ ├── SpeakerAssignment.swift # SpeakerMatcher, SegmentMerger, cosine distance
│ ├── ModelDownloadManager.swift # Pre-download & status for FluidAudio models
│ ├── DictationManager.swift # SlidingWindowAsrManager wrapper + incremental injection
│ ├── TextInjector.swift # CGEvent unicode / HID / clipboard paths
│ ├── HotkeyManager.swift # Carbon RegisterEventHotKey wrapper (multi-hotkey registry)
│ ├── MeetingNoteComposer.swift # Floating panel for in-meeting notes
│ └── RosterReader.swift # Teams roster via AXUIElement
├── Tests/HeardTests/
│ └── TestRunner.swift # Lightweight harness — no XCTest / no Xcode
├── scripts/
│ ├── bundle.sh # Build + bundle + (optional) sign
│ └── diagnose.swift # Print what Heard sees from Teams & power assertions
├── Info.plist # LSUIElement, NSMicrophoneUsageDescription
├── Heard.entitlements # Audio input only (no sandbox)
├── Package.swift # SPM config, FluidAudio 0.14.3+
├── spec.md # Product source of truth
├── handoff.md # Current implementation status
├── ROADMAP.md # Planned improvements (this doc)
└── CLAUDE.md # Working rules for AI assistants
| Target | Kind | Purpose |
|---|---|---|
HeardCore |
library | All models, services, views, stores |
Heard |
executable | @main app entry, depends on HeardCore |
HeardTests |
executable | Test runner with a lightweight in-house harness |
- No sandbox. Required for
CATapDescriptionapp-audio tapping andCGEventtext injection. The entitlements file grantsaudio-inputand nothing else. - AUHAL + private aggregate device for the tap.
AVAudioEngine.inputNodesilently re-binds to the system default input whenprepare()runs after akAudioOutputUnitProperty_CurrentDevicechange, so a standalonekAudioUnitSubType_HALOutputconfigured beforeAudioUnitInitializeis the only reliable path to capture from the tap's aggregate device. - Multi-process Teams tap. New Teams renders audio in renderer/GPU child processes, not in the process holding the power assertion. The recorder enumerates every Teams-related CoreAudio process object and taps them all.
- SlidingWindowAsrManager for dictation. Dictation uses FluidAudio's
SlidingWindowAsrManagerwith the.streamingpreset (11 s chunks, 2 s left/right context). The manager handles overlapping windows, resampling, and stable/volatile text promotion internally. Confirmed text is injected incrementally; remaining volatile text is flushed on stop. This also restores vocabulary boosting, which was removed from the batchAsrManagerAPI in FluidAudio 0.13.6. - Carbon for global hotkeys.
RegisterEventHotKeydoesn't need Accessibility permission and survives ad-hoc rebuilds, unlikeCGEvent.tapCreateorNSEventmonitors (which can't suppress events).HotkeyManagerkeeps a[UInt32: HotkeyManager]registry sharing one Carbon event handler so multiple shortcuts (dictation, meeting notes) coexist with no per-instance handler churn. - Per-track transcription. The app and mic tracks are transcribed separately, then merged by timestamp. Avoids crosstalk artifacts from a pre-mixed source.
- Local user is the mic track. No diarization needed to identify the local user — the mic track is always "Me" (or the name set in Settings → Speakers). The mic embedding is stored and updated silently each meeting.
All models are CoreML, loaded via FluidAudio into ~/Library/Application Support/FluidAudio/Models/. They auto-download on first use or can be pre-downloaded from Settings → Models.
| Kind | Model | Purpose |
|---|---|---|
batchVad |
Silero VAD v6 | In-memory silence trimming during preprocessing |
batchParakeet |
Parakeet TDT V2 0.6B (default) or V3 | English speech-to-text for meetings and dictation. V2 is recommended for English; V3 adds a Cyrillic-script guard. Selectable in Settings → Models. |
diarization |
LS-EEND + WeSpeaker | Speaker segmentation + 256-d embedding extraction |
ctcVocabulary |
Parakeet CTC 110M | Optional CTC vocabulary boosting for custom terms |
Pipeline models stay unloaded during recording. The Models tab exposes pipeline keep-alive (how long meeting models stay resident after a job completes, 0 – 10 min) and a force-unload button. The Dictation tab has its own keep-alive slider for the dictation ASR.
| Permission | Purpose | When required |
|---|---|---|
| Microphone | Record the local user | Always |
| Screen Recording | Surface the Teams window title for nicer filenames | Recommended |
| Accessibility | Read Teams window title & roster, inject dictation text | Required for dictation; recommended for roster-based auto-naming |
Only Microphone is strictly required; everything else degrades gracefully. Permission status is shown inside the General tab with deep-links to the right System Settings pane.
~/Library/Application Support/Heard/
├── pipeline_queue.json # Persistent job queue
├── speakers.json # Speaker embeddings + metadata
└── recordings/ # 48 kHz WAVs, auto-cleaned after 48 h
~/Library/Application Support/FluidAudio/Models/
├── silero-vad-coreml/
├── parakeet-tdt-0.6b-v2-coreml/ # or parakeet-tdt-0.6b-v3-coreml/ if V3 selected
├── speaker-diarization-coreml/
└── parakeet-tdt-0.6b-v2-ctc-110m-coreml/
~/Documents/Heard/ # Default transcript output (configurable)
└── 260324_Sprint_Planning.md # Or 2026-03-24_Sprint_Planning.md based on settings
Orphan WAVs from previous crashes are cleaned on launch, and any private aggregate devices left behind by a crashed recording (com.execsumo.heard.tap.*) are destroyed at the same time. Files referenced by an in-flight pipeline job are always preserved.
swift run HeardTests114 tests covering VadSegmentMap, cosine distance, SpeakerMatcher, SegmentMerger, AudioPreprocessor, TranscriptWriter (incl. note interleaving and rename safety), SpeakerStore, PipelineQueueStore (incl. MeetingNote round-trip and legacy queue decode), and RosterReader via a lightweight in-house harness — no XCTest or Xcode required.
Manual smoke test: ./scripts/bundle.sh && open build/Heard.app, enable Developer Mode in Settings → General, then use Simulate Meeting from the menu bar to exercise the full flow without a real Teams call.
Diagnostics: swift scripts/diagnose.swift prints what Heard sees from Teams processes and power assertions — useful for debugging detection on a specific machine.
spec.md— Full product specification (source of truth)handoff.md— Current implementation status and known issuesROADMAP.md— Planned improvements and stretch ideasCLAUDE.md— Working rules for AI assistants