More than turn-taking.
A model that predicts conversational dynamics.
A fine-tuned small model that decides not just what an agent says, but when — whether to continue, yield the floor, take a turn, or backchannel — from partial two-party conversation context.
Agents wait for silence.
People don't.
Real conversation is a constant negotiation of the floor — overlaps, backchannels, trailing-off pauses. A fixed silence threshold misses all of it. This model classifies the turn-taking decision directly.
What it can decide.
[CONTINUE]Keep speaking.
[YIELD]Stop and hand the floor back.
A snapshot of the moment.
role: LISTENING history: [User] i have not had to make that decision for current user stream: regarding care for any of my family members however i've had decide → [CONTINUE]
Watch it decide, in real time.
Simulated tick by tick.
Switchboard and SBCSAE transcripts feed a dual-track interval tree. A scanner advances one tick at a time and, at each step, assembles the labeled snapshot shown in a snapshot of the moment.
At each tick
Look-ahead logic reads the future interval-tree state to assign one of four ground-truth actions: continue, yield, take turn, or backchannel.
Curation
- Deduplicate samples by content hash
- Validate and filter against curation rules
- Balance label ratios for training
- Export as ChatML / JSONL rows
Curated 20K mix
Switchboard + SBCSAE, stratified by action.
- CONTINUE
- listen 19%
- CONTINUE
- speak 22%
- YIELD
- 15%
- TAKETURN
- 24%
- BACKCHANNEL
- 19%
Small on purpose.
base Qwen2.5-0.5B method LoRA r=16 α=32 quant 4-bit · bf16 added action tokens (embed + lm_head trainable)
The accuracy lied.
On clean rule-matched samples it looked solved. On unconstrained conversation it collapsed — the model learned the curation rules' shortcuts, not semantic turn-taking. Action labels alone are underdetermined.
Teach it the cues, not the call.
Role-conditioned auxiliary tasks decompose the decision into an interpretable rubric of conversation-state dimensions.
| Listening dimension | Levels |
|---|---|
| Utterance completion | complete / incomplete / abandoned |
| Turn projection | more coming / finished / unclear |
| Floor license | licensed / weak / unlicensed |
| Backchannel opportunity | yes / no |
Side by side.
| Checkpoint | Accuracy | Latency |
|---|---|---|
| 0.5B · 40K noisy | 77.1% | ~110 ms/sample |
| 0.5B · 20K curated | 90.0% | ~134 ms/sample |
| 1.5B · 20K curated | 85.5% | ~162 ms/sample |
Curated accuracy does not equal in-the-wild performance. Bigger did not win.
Frequently asked questions
- VAD detects speech vs silence. This model classifies turn-taking intent — whether to hold, yield, jump in, or backchannel — from conversational context, not just endpoint detection.