● Conversation Behavior Model · research preview

More than turn-taking.
A model that predicts conversational dynamics.

A fine-tuned small model that decides not just what an agent says, but when — whether to continue, yield the floor, take a turn, or backchannel — from partial two-party conversation context.

Try the live demo HuggingFace ↗

The problem

Agents wait for silence.
People don't.

Real conversation is a constant negotiation of the floor — overlaps, backchannels, trailing-off pauses. A fixed silence threshold misses all of it. This model classifies the turn-taking decision directly.

Four actions, two roles

What it can decide.

[CONTINUE]
Keep speaking.
[YIELD]
Stop and hand the floor back.

One decision

A snapshot of the moment.

role: LISTENING
history:
  [User] i have not had to make that decision for
current user stream: regarding care for any of my
                      family members however i've had
decide → [CONTINUE]

Live demo · more-than-turn

Watch it decide, in real time.

Labels

Simulated tick by tick.

Switchboard and SBCSAE transcripts feed a dual-track interval tree. A scanner advances one tick at a time and, at each step, assembles the labeled snapshot shown in a snapshot of the moment.

SWB + SBCSAEinterval treetick scannerlook-ahead labelcurationChatML sample

Dual-track interval tree with user and bot speech intervals and a tick cursor stepping through time.

At each tick

Look-ahead logic reads the future interval-tree state to assign one of four ground-truth actions: continue, yield, take turn, or backchannel.

Curation

Deduplicate samples by content hash
Validate and filter against curation rules
Balance label ratios for training
Export as ChatML / JSONL rows

Curated 20K mix

Switchboard + SBCSAE, stratified by action.

CONTINUE: listen 19%
CONTINUE: speak 22%
YIELD: 15%
TAKETURN: 24%
BACKCHANNEL: 19%

Model

Small on purpose.

base    Qwen2.5-0.5B
method  LoRA  r=16  α=32
quant   4-bit · bf16
added   action tokens (embed + lm_head trainable)

What we found

The accuracy lied.

89.97%curated test

~23%natural data

On clean rule-matched samples it looked solved. On unconstrained conversation it collapsed — the model learned the curation rules' shortcuts, not semantic turn-taking. Action labels alone are underdetermined.

Where it goes next

Teach it the cues, not the call.

Role-conditioned auxiliary tasks decompose the decision into an interpretable rubric of conversation-state dimensions.

Listening dimension	Levels
Utterance completion	complete / incomplete / abandoned
Turn projection	more coming / finished / unclear
Floor license	licensed / weak / unlicensed
Backchannel opportunity	yes / no

Checkpoints

Side by side.

Checkpoint	Accuracy	Latency
0.5B · 40K noisy	77.1%	~110 ms/sample
0.5B · 20K curated	90.0%	~134 ms/sample
1.5B · 20K curated	85.5%	~162 ms/sample

Curated accuracy does not equal in-the-wild performance. Bigger did not win.

FAQ

Frequently asked questions

: VAD detects speech vs silence. This model classifies turn-taking intent — whether to hold, yield, jump in, or backchannel — from conversational context, not just endpoint detection.

Agents wait for silence.People don't.