Skip to content

Disconnect during job start causes the agent to hang #1089

@Fox32

Description

@Fox32

Describe the bug

If I disconnect after received job request and before the session is started, the job hangs and is killed with job is unresponsive

Relevant log output

In the end the job is hard killed after a job is unresponsive log entry.

[15:56:05.303] DEBUG (88816): job exiting
[15:57:04.963] WARN (88816): job is unresponsive
[15:57:04.963] DEBUG (89120): SIGTERM received in job proc
    jobID: "AJ_67dQtGe5Ck2E"

Describe your environment

1.0.48

Minimal reproducible example

I have very simple agent with a greeting inside of onEnter:

class CustomAgent extends voice.Agent {
    constructor() {
        super({
            instructions:
                "You are a helpful assistant, you can hear the user's message and respond to it.",
        })
    }

    async onEnter(): Promise<void> {
        // Note that I have a greeting on enter
        this.session.say('Hello, how can I help you today?', { allowInterruptions: false })
    }
}

// …

const agent = new CustomAgent()

const session = new voice.AgentSession({
    stt: 'assemblyai/universal-streaming:en',
    llm: 'openai/gpt-4.1-mini',
    tts: 'cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc',
    vad: jobContext.proc.userData.vad! as silero.VAD,
    turnDetection: new turnDetector.MultilingualModel(),
    voiceOptions: {
        preemptiveGeneration: true,
    },
    connOptions: {
        llmConnOptions: {
            maxRetry: 1,
            retryIntervalMs: 2000,
            timeoutMs: 60000,
        },
    },
})

// Inside the entrypoint I first sleep to make it easier to get to this error:
await new Promise((resolve) => setTimeout(resolve, 2000))

await session.start({
    agent,
    room: jobContext.room,
    inputOptions: {
        noiseCancellation: BackgroundVoiceCancellation(),
    },
})

I added the timeout to make it easier to reproduce the issue.

Additional information

I already tried to debug the issue:

If I connect and disconnect during the 2s timeout, I have the agent hanging, ultimately resulting in job is unresponsive.

My idea what the problem is from initial debugging:

  1. Inside the shutdown code of the agent session at some point closeImplInner is called with drain = false (default value).
  2. activity.interrupt is therefore called and awaited (here the shutdown code is hanging on the await!)
  3. Inside AgentActivity.interrupt all speech handles are interrupted.
  4. As there is a speech handle active (see my greeting in onEnter), the lower part with the done callback is executed
  5. The done callback would resolve the returned future, but never does so

As far as I see, the doneCallbacks are executed by _markDone (via the doneFut). I guess _markDone is not called properly if the speech handle is interrupted. I haven't investigated further yet… Maybe you have an idea?

In general it waits for the speech to be completed before starting to shutdown, for testing I added a setImmediate(() => session.shutdown({ reason: 'test', drain: false })) after the session start to get to it quicker. The behavior is the same.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions