Skip to content

execsumo/Heard

Repository files navigation

Heard

A macOS menu bar app that automatically detects Microsoft Teams meetings, records dual-track audio (app + mic), and produces on-device transcripts with speaker diarization. Also includes a real-time dictation mode that types transcribed speech into any focused text field.

No cloud, no LLM, no external APIs — everything runs on-device on Apple Silicon.

Features

Meeting Transcription

  • Teams auto-detection — Polls IOPMCopyAssertionsByProcess() every 3 seconds for Teams power assertions (PreventUserIdleDisplaySleep / NoDisplaySleepAssertion / "Call in progress"). Requires 2 consecutive hits before triggering; applies a 5-second cooldown on end. Recognises Teams by bundle ID (com.microsoft.teams / com.microsoft.teams2) with localized-name fallbacks for Teams classic, New Teams, and "work or school" variants.
  • Meeting title extraction — Reads the Teams window title via the Accessibility API and strips the | Microsoft Teams suffix for the transcript filename.
  • Dual-track recording — The mic track is captured via AVAudioEngine; the app track is captured via CATapDescription + a private aggregate device + a raw AUHAL output unit. The tap collects all Teams-related CoreAudio process object IDs (main + renderer children) so audio from Electron/Chromium sub-processes isn't lost. Both tracks land on disk as 48 kHz WAVs and are never mixed during recording.
  • Recording reliability — The app-audio tap self-tests after 2 seconds. If it is silent, Heard rebuilds the tap/aggregate/IOProc chain once with a fresh helper-process scan; persistent silence marks the job as mic-only and the menu bar shows "Recording (mic only)". Turning watching off during an active meeting synchronously finalizes that recording and preserves the processing state.
  • Roster scraping — Reads Teams participant names via Accessibility APIs every 15 s while a meeting is active. Three strategies: known roster-panel identifiers → AXList/AXTable containers → window-title parsing. Filters out UI control strings (mute, raise hand, etc.).
  • On-device pipeline — Sequential stages per job: preprocessing (48 kHz → 16 kHz mono via AVAudioConverter, in-memory Silero VAD trimming + VadSegmentMap) → Parakeet TDT V2/V3 transcription (per-track, parallel 4-chunk processing, with CTC vocabulary boosting) → Inverse Text Normalization (punctuation formatting) → mic-bleed deduplication → LS-EEND + WeSpeaker diarization → speaker assignment and Markdown output.
  • Speaker identification — Cosine-distance matching (threshold 0.40, confidence margin 0.10) against a persistent speaker database, with embedding diversity management and auto-update on confident matches. Uses UUID-based placeholders (e.g. Speaker_1AB23C) to guarantee universally unique speaker profiles.
  • Cumulative transcription stats — Every transcribed segment seamlessly increments the matched speaker's total transcribed hours and word count, which are visible in the Settings tab.
  • Roster-aware auto-naming — When one speaker is unmatched and exactly one roster name is unclaimed, the roster name is assigned automatically without prompting.
  • Speaker naming window — When unmatched speakers remain, a dedicated "Name Speakers" window opens with a playable audio clip (~10 s of each speaker's clearest speech, extracted from the 48 kHz recording), a text field pre-populated with any roster suggestion, and a 120 s auto-dismiss countdown.
  • Custom vocabulary boosting — CTC-based keyword boosting via Parakeet CTC 110M applied to both meeting transcription and dictation. Terms are added from Settings → General (min 3 chars, max 50 terms). Falls back gracefully if the CTC model can't be loaded.
  • Persistent job queuepipeline_queue.json survives app restarts; failed jobs are re-queued on relaunch up to a lifetime cap of 6 retries. Non-retryable errors (no audio, too short) fail immediately; transient errors retry 3× per session with exponential backoff (5 s, 30 s, 5 min). User-initiated retry resets the count.
  • Long-meeting handling — 4 h hard cap; on hit, the current recording is finalized and a fresh one starts if the meeting is still active.
  • Markdown output — Timestamped, speaker-labeled transcripts written to a configurable output folder. Filename dates are configurable (YYMMDD or YYYY-MM-DD). Consecutive segments from the same speaker are merged into continuous blocks. When laptop mic + speakers cause remote audio to bleed into the mic track, overlapping mic segments are dropped when the app track already contains the same words.
  • In-meeting notes — Press a global hotkey during an active recording (default ⌃⇧N, configurable in Settings → General) to open a floating composer panel. Type a note, ⌘↩ to save, Esc to cancel. The note's timestamp is captured at panel-open time, so a slow typer's note still anchors to the moment they reacted. Notes are interleaved chronologically with spoken segments and rendered as [mm:ss] _**Note from <Your Name>:** …_ (italicized, distinct from speaker blocks) so a reader can tell what was said versus what was added as supplemental context. Survives crashes via pipeline_queue.json; if the meeting ends while the composer is still open, the note is attached to the just-finished job.

Dictation

  • Real-time speech-to-text — Uses FluidAudio's SlidingWindowAsrManager with overlapping windows and an internal stable/volatile text split. Audio is fed as AVAudioPCMBuffer straight from the mic tap; the manager handles resampling, chunking, and context accumulation internally.
  • Incremental injection — Confirmed text flows word by word in real time as the sliding window confirms it; remaining volatile text is flushed on stop. No batching — typed as you speak.
  • Punctuation normalization — FluidAudio's Inverse Text Normalization (ITN) runs natively, converting spoken forms like "comma", "period", and "new line" to their written equivalents seamlessly before text injection.
  • Filler word stripping — Standalone instances of "uh", "um", "er", "ah", "hmm", "hm", "uhh", "umm", "mhm" are automatically removed before injection; word boundaries and case-insensitivity apply. Keeps transcripts clean without user intervention.
  • Vocabulary boosting — When CTC models are downloaded, configureVocabularyBoosting() is applied so custom terms take effect in real time.
  • Text injectionCGEvent.keyboardSetUnicodeString + postToPid targets the frontmost app; falls back to HID-level events, then clipboard paste. All paths require Accessibility permission.
  • Global hotkey — Default ⌃⇧D, registered via Carbon RegisterEventHotKey. No Accessibility permission needed for the hotkey itself, so it keeps working across ad-hoc rebuilds. Rebindable via a Record sheet in the Dictation tab. Hotkey input is validated to block forbidden system shortcuts and warn on weak combos.
  • Hotkey reuse fix — Proper cleanup after stopping — SlidingWindowAsrManager.cleanup() closes the internal AsyncStream before starting a new session, preventing state corruption from rapid presses. State validation throws DictationError.notIdle to surface conflicts.
  • Accessibility revocation detection — If the user revokes Accessibility permission mid-dictation, AppModel.startAXPolling() detects it every 2 seconds and gracefully stops, showing an orange banner with a "Re-grant Access…" button.
  • Floating dictation indicator (HUD) — An optional on-screen pill at the bottom-center shows "Dictating" with a waveform icon while active (opt-in via Settings → Dictation). Appears at full opacity, dims to 35% after 2.5 seconds, and fades out on stop.
  • Toggle and push-to-talk modes — Tap to toggle, or hold the hotkey to dictate and release to stop.
  • Model keep-alive — ASR models stay resident for a configurable period (0 – 10 min, default 120 s) after dictation stops to avoid reload latency between utterances.

UI

  • Visual design — "Paper" warm palette throughout: off-white background (#F5EFE4), card surfaces (#FBF7EF), warm sidebar (#EBE2CE), accent blue (#3F5C8C). Forced light-only appearance. Custom 188 px sidebar with a HeardMark squircle glyph header; settings panels use SettingsCard/CardRow/ToggleRow card primitives rather than system Form groups.
  • App icon — Ships a full 16-512 px AppIcon asset set. bundle.sh compiles it into AppIcon.icns, Info.plist references it via CFBundleIconFile, and the About tab renders the real bundle icon.
  • Menu bar icon — SF Symbols with symbol effects: static recordingtape when idle, pulsing/breathing record.circle while recording or dictating, iterative waveform while processing, exclamationmark.circle.fill on error, person.crop.circle.badge.exclamationmark when waiting for speaker naming. Icon uses accent tint (blue) when actively capturing audio (recording or dictating) and primary tint (white/dark gray) otherwise. Opacity is 100% when the app is watching or recording, and 50% when paused (phase is dormant, not dictating, and meeting detection is off).
  • Menu bar dropdown — 268 px wide Paper-palette panel. Current status card (dark #2E3338 background when recording/dictating, Paper background otherwise), Recent Meetings list (up to 3 completed/failed jobs with click-to-open and right-click to reveal/retry/dismiss), quick actions (start dictation, open transcripts, name speakers, settings, quit), and developer-mode simulate buttons.
  • Settings window — 880×600, five tabs: General (launch at login, auto-watch, developer mode, custom vocabulary, output folder, permissions), Dictation (enable, push-to-talk, hotkey recorder with validation feedback, model keep-alive slider, live status, show dictation HUD checkbox), Models (download cards with progress, pipeline keep-alive, "Unload All Models"), Speakers (your name, inline rename, merge, delete, search, sort, with prompts to retroactively update all past transcripts), About.
  • Name Speakers window — 560×520 standalone scene with audio-clip playback per candidate, roster-suggestion hints, save/skip, and a 120 s countdown.

Requirements

  • macOS 15.0+
  • Apple Silicon (FluidAudio CoreML/ANE models are ARM-only)
  • FluidAudio 0.14.3+ (resolved automatically via SPM)

Build & Run

# Compile only
swift build

# Run from the terminal (⚠ mic permission is attributed to the terminal, not Heard)
swift run Heard

# Build a .app bundle and launch it (recommended for day-to-day use)
./scripts/bundle.sh
open build/Heard.app

# Stable-signed build — required for dictation so the Accessibility grant
# persists across rebuilds. Replace "Heard Dev" with any local identity.
./scripts/bundle.sh --sign "Heard Dev"
open build/Heard.app

# Release build
./scripts/bundle.sh --release

Tip: If dictation's Accessibility grant gets stuck after rebuilds, reset it with tccutil reset Accessibility com.execsumo.heard and re-grant.

Architecture

Single-process SwiftUI menu bar app. Three scenes: a MenuBarExtra(.window) dropdown, a Settings Window, and a Speaker Naming Window. WindowActivationCoordinator reference-counts .regular activation policy across the two windows so keyboard focus survives closing one while the other stays open. All persistence is JSON files under ~/Library/Application Support/Heard/.

Heard/
├── Sources/
│   ├── Heard/
│   │   └── MTApp.swift               # @main, MenuBarExtra + two Window scenes
│   └── HeardCore/
│       ├── AppModel.swift            # Central state, action handlers, dictation wiring
│       ├── CoreModels.swift          # AppPhase, PipelineJob, SpeakerProfile,
│       │                             # AppSettings, HotkeyCombo, ModelKind, …
│       ├── Services.swift            # MeetingDetector, RecordingManager (AUHAL + tap),
│       │                             # PipelineProcessor, PermissionCenter,
│       │                             # TranscriptWriter, TempFileCleanup,
│       │                             # AudioDeviceCleanup, LaunchAtLogin,
│       │                             # WindowActivationCoordinator
│       ├── Stores.swift              # SettingsStore, SpeakerStore, PipelineQueueStore
│       ├── Views.swift               # MenuBarView, SettingsView, SpeakerNamingView
│       ├── AudioProcessing.swift     # AudioPreprocessor, VadSegmentMap
│       ├── AudioClipExtractor.swift  # Extract speaker audio clips for the naming prompt
│       ├── SpeakerAssignment.swift   # SpeakerMatcher, SegmentMerger, cosine distance
│       ├── ModelDownloadManager.swift # Pre-download & status for FluidAudio models
│       ├── DictationManager.swift    # SlidingWindowAsrManager wrapper + incremental injection
│       ├── TextInjector.swift        # CGEvent unicode / HID / clipboard paths
│       ├── HotkeyManager.swift       # Carbon RegisterEventHotKey wrapper (multi-hotkey registry)
│       ├── MeetingNoteComposer.swift # Floating panel for in-meeting notes
│       └── RosterReader.swift        # Teams roster via AXUIElement
├── Tests/HeardTests/
│   └── TestRunner.swift              # Lightweight harness — no XCTest / no Xcode
├── scripts/
│   ├── bundle.sh                     # Build + bundle + (optional) sign
│   └── diagnose.swift                # Print what Heard sees from Teams & power assertions
├── Info.plist                        # LSUIElement, NSMicrophoneUsageDescription
├── Heard.entitlements                # Audio input only (no sandbox)
├── Package.swift                     # SPM config, FluidAudio 0.14.3+
├── spec.md                           # Product source of truth
├── handoff.md                        # Current implementation status
├── ROADMAP.md                        # Planned improvements (this doc)
└── CLAUDE.md                         # Working rules for AI assistants

Targets

Target Kind Purpose
HeardCore library All models, services, views, stores
Heard executable @main app entry, depends on HeardCore
HeardTests executable Test runner with a lightweight in-house harness

Key Design Decisions

  • No sandbox. Required for CATapDescription app-audio tapping and CGEvent text injection. The entitlements file grants audio-input and nothing else.
  • AUHAL + private aggregate device for the tap. AVAudioEngine.inputNode silently re-binds to the system default input when prepare() runs after a kAudioOutputUnitProperty_CurrentDevice change, so a standalone kAudioUnitSubType_HALOutput configured before AudioUnitInitialize is the only reliable path to capture from the tap's aggregate device.
  • Multi-process Teams tap. New Teams renders audio in renderer/GPU child processes, not in the process holding the power assertion. The recorder enumerates every Teams-related CoreAudio process object and taps them all.
  • SlidingWindowAsrManager for dictation. Dictation uses FluidAudio's SlidingWindowAsrManager with the .streaming preset (11 s chunks, 2 s left/right context). The manager handles overlapping windows, resampling, and stable/volatile text promotion internally. Confirmed text is injected incrementally; remaining volatile text is flushed on stop. This also restores vocabulary boosting, which was removed from the batch AsrManager API in FluidAudio 0.13.6.
  • Carbon for global hotkeys. RegisterEventHotKey doesn't need Accessibility permission and survives ad-hoc rebuilds, unlike CGEvent.tapCreate or NSEvent monitors (which can't suppress events). HotkeyManager keeps a [UInt32: HotkeyManager] registry sharing one Carbon event handler so multiple shortcuts (dictation, meeting notes) coexist with no per-instance handler churn.
  • Per-track transcription. The app and mic tracks are transcribed separately, then merged by timestamp. Avoids crosstalk artifacts from a pre-mixed source.
  • Local user is the mic track. No diarization needed to identify the local user — the mic track is always "Me" (or the name set in Settings → Speakers). The mic embedding is stored and updated silently each meeting.

Models

All models are CoreML, loaded via FluidAudio into ~/Library/Application Support/FluidAudio/Models/. They auto-download on first use or can be pre-downloaded from Settings → Models.

Kind Model Purpose
batchVad Silero VAD v6 In-memory silence trimming during preprocessing
batchParakeet Parakeet TDT V2 0.6B (default) or V3 English speech-to-text for meetings and dictation. V2 is recommended for English; V3 adds a Cyrillic-script guard. Selectable in Settings → Models.
diarization LS-EEND + WeSpeaker Speaker segmentation + 256-d embedding extraction
ctcVocabulary Parakeet CTC 110M Optional CTC vocabulary boosting for custom terms

Pipeline models stay unloaded during recording. The Models tab exposes pipeline keep-alive (how long meeting models stay resident after a job completes, 0 – 10 min) and a force-unload button. The Dictation tab has its own keep-alive slider for the dictation ASR.

Permissions

Permission Purpose When required
Microphone Record the local user Always
Screen Recording Surface the Teams window title for nicer filenames Recommended
Accessibility Read Teams window title & roster, inject dictation text Required for dictation; recommended for roster-based auto-naming

Only Microphone is strictly required; everything else degrades gracefully. Permission status is shown inside the General tab with deep-links to the right System Settings pane.

Data

~/Library/Application Support/Heard/
├── pipeline_queue.json   # Persistent job queue
├── speakers.json         # Speaker embeddings + metadata
└── recordings/           # 48 kHz WAVs, auto-cleaned after 48 h

~/Library/Application Support/FluidAudio/Models/
├── silero-vad-coreml/
├── parakeet-tdt-0.6b-v2-coreml/      # or parakeet-tdt-0.6b-v3-coreml/ if V3 selected
├── speaker-diarization-coreml/
└── parakeet-tdt-0.6b-v2-ctc-110m-coreml/

~/Documents/Heard/        # Default transcript output (configurable)
└── 260324_Sprint_Planning.md # Or 2026-03-24_Sprint_Planning.md based on settings

Orphan WAVs from previous crashes are cleaned on launch, and any private aggregate devices left behind by a crashed recording (com.execsumo.heard.tap.*) are destroyed at the same time. Files referenced by an in-flight pipeline job are always preserved.

Testing

swift run HeardTests

114 tests covering VadSegmentMap, cosine distance, SpeakerMatcher, SegmentMerger, AudioPreprocessor, TranscriptWriter (incl. note interleaving and rename safety), SpeakerStore, PipelineQueueStore (incl. MeetingNote round-trip and legacy queue decode), and RosterReader via a lightweight in-house harness — no XCTest or Xcode required.

Manual smoke test: ./scripts/bundle.sh && open build/Heard.app, enable Developer Mode in Settings → General, then use Simulate Meeting from the menu bar to exercise the full flow without a real Teams call.

Diagnostics: swift scripts/diagnose.swift prints what Heard sees from Teams processes and power assertions — useful for debugging detection on a specific machine.

References

  • spec.md — Full product specification (source of truth)
  • handoff.md — Current implementation status and known issues
  • ROADMAP.md — Planned improvements and stretch ideas
  • CLAUDE.md — Working rules for AI assistants

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors