Skip to content

events.jsonl mutex timeout and stale inuse.*.lock files after long-lived session grows large #2609

@d0d1

Description

@d0d1

Summary

On Copilot CLI 1.0.21 (Linux), a long-lived or frequently-resumed session can accumulate a very large ~/.copilot/session-state/[session-id]/events.jsonl file. Once the file is large enough (~120 MiB / ~37 k lines in my case), subsequent event writes begin failing permanently with:

Failed to flush events for [session-id]:
  Error: Failed to append to JSONL file
    ~/.copilot/session-state/[session-id]/events.jsonl:
  Error: timeout while waiting for mutex to become available

Multiple stale inuse.[pid].lock files also remain in the session directory after the owning processes have exited, suggesting the cleanup path does not run on all exit scenarios.

Environment

Detail Value
Copilot CLI version 1.0.21
OS Linux (Ubuntu)
Shell Bash inside tmux

Steps to Reproduce

  1. Start a Copilot CLI session and keep resuming / reusing the same session over an extended period (days to weeks).
  2. Allow ~/.copilot/session-state/[session-id]/events.jsonl to grow large (observed at ~123 MiB).
  3. Open or resume the same session from additional Copilot CLI processes.
  4. Eventually, event flush calls begin failing with the mutex-timeout error above.

Note: I have not isolated a minimal deterministic repro yet; the issue emerged organically during normal extended use.

Observed Behavior

  • Repeated mutex-timeout errors in multiple ~/.copilot/logs/process-*.log files, from at least two different PIDs — not a single transient failure.
  • Stale lock files remain after the processes exit:
    ~/.copilot/session-state/[session-id]/inuse.[pid-a].lock
    ~/.copilot/session-state/[session-id]/inuse.[pid-b].lock
    ~/.copilot/session-state/[session-id]/inuse.[pid-c].lock
    
    Each file contains only its PID. ps -p [pid] confirms none of these processes are still running.
  • The session's first events.jsonl record shows "alreadyInUse": false; a later session.resume record shows "alreadyInUse": true, confirming overlapping access.
  • No matching errors appeared in ~/.copilot/logs/copilot.log — only in the per-process logs.

Expected Behavior

  • Event appends should not permanently fail due to mutex contention on a large file.
  • Stale inuse.*.lock files should be cleaned up when their owning process exits (gracefully or via signal).
  • Ideally, events.jsonl should be rotated, truncated, or otherwise bounded to prevent unbounded growth.

Diagnostic Clues for Maintainers

  • The failure is on the write path (appending events), not the read/resume path.
  • The mutex implementation appears to use a file-based lock with a fixed timeout; on a large file the append may exceed the timeout window, or a previously-crashed process's lock may never be released.
  • Checking whether the lock acquisition timeout scales with file size, and whether stale-lock detection runs before acquisition, would likely pinpoint the root cause.

Related Issues

Issue Relevance
#2209 Large events.jsonl / long-lived session corruption (read-path focus)
#2490 Session corruption with large event files (read-path focus)
#1790 Feature request: clean up stale inuse.*.lock files
#2323 Long-lived / sub-agent session-state corruption
#2543 Session-state corruption in sub-agent scenarios
#2217 Crash resilience / events corruption

This issue differs from all of the above because the primary symptom is write-path mutex contention (not read-path corruption or resume failures), combined with stale lock files that are never reclaimed.

Suggested Labels

bug, session-state, events


Filed from sanitized local evidence. No private repository names, absolute home paths beyond ~/.copilot/..., or credentials are included.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions