● Conversation Behavior Model · research preview

More than turn-taking.
A model that predicts conversational dynamics.

A fine-tuned small model that decides not just what an agent says, but when — whether to continue, yield the floor, take a turn, or backchannel — from partial two-party conversation context.

The problem

Agents wait for silence.
People don't.

Real conversation is a constant negotiation of the floor — overlaps, backchannels, trailing-off pauses. A fixed silence threshold misses all of it. This model classifies the turn-taking decision directly.

Four actions, two roles

What it can decide.

  • [CONTINUE]

    Keep speaking.

  • [YIELD]

    Stop and hand the floor back.

One decision

A snapshot of the moment.

role: LISTENING
history:
  [User] i have not had to make that decision for
current user stream: regarding care for any of my
                      family members however i've had
decide → [CONTINUE]
Live demo · more-than-turn

Watch it decide, in real time.

Labels

Simulated tick by tick.

Switchboard and SBCSAE transcripts feed a dual-track interval tree. A scanner advances one tick at a time and, at each step, assembles the labeled snapshot shown in a snapshot of the moment.

SWB + SBCSAEinterval treetick scannerlook-ahead labelcurationChatML sample
Dual-track interval tree with user and bot speech intervals and a tick cursor stepping through time.

At each tick

Look-ahead logic reads the future interval-tree state to assign one of four ground-truth actions: continue, yield, take turn, or backchannel.

Curation

  • Deduplicate samples by content hash
  • Validate and filter against curation rules
  • Balance label ratios for training
  • Export as ChatML / JSONL rows

Curated 20K mix

Switchboard + SBCSAE, stratified by action.

Model

Small on purpose.

base    Qwen2.5-0.5B
method  LoRA  r=16  α=32
quant   4-bit · bf16
added   action tokens (embed + lm_head trainable)
What we found

The accuracy lied.

89.97%curated test
~23%natural data

On clean rule-matched samples it looked solved. On unconstrained conversation it collapsed — the model learned the curation rules' shortcuts, not semantic turn-taking. Action labels alone are underdetermined.

Where it goes next

Teach it the cues, not the call.

Role-conditioned auxiliary tasks decompose the decision into an interpretable rubric of conversation-state dimensions.

Listening dimensionLevels
Utterance completioncomplete / incomplete / abandoned
Turn projectionmore coming / finished / unclear
Floor licenselicensed / weak / unlicensed
Backchannel opportunityyes / no
Checkpoints

Side by side.

CheckpointAccuracyLatency
0.5B · 40K noisy77.1%~110 ms/sample
0.5B · 20K curated90.0%~134 ms/sample
1.5B · 20K curated85.5%~162 ms/sample

Curated accuracy does not equal in-the-wild performance. Bigger did not win.

FAQ

Frequently asked questions

VAD detects speech vs silence. This model classifies turn-taking intent — whether to hold, yield, jump in, or backchannel — from conversational context, not just endpoint detection.