Tiny Interaction Models - Rajan Agarwal

I post-trained Qwen3 30B-A3B into a text interaction model. The policy consumes typing as a timestamped event stream and emits one action per sampling tick: idle, mark a span, delegate a lookup, integrate a pending result, skip a stale one, or respond. Supervised fine-tuning on teacher-labeled streams covers the mechanics. A short DPO pass with interleaved SFT replay covers when not to act, which is the harder learning problem. The strongest demo has the model issuing a lookup mid-sentence, holding the returned fact across several ticks, and surfacing it only when the typing yields an opening.

Why text?

Thinking Machines released interaction models in May: models trained natively on continuous streams, so silence, overlap, and interruption stay in the model's context instead of being handled by a harness around it. All of their demos are audio or video (posture callouts, live translation, counting pushups). I wanted to know what the text version looks like.

Text is a bad fit for streaming at first glance. Speech is a stream whether you want it to be or not; typing is deliberate and you can edit before committing. A model that reacts to every keystroke would be unusable, and a model that waits for Enter is a regular chatbot. So the real question is what granularity of streaming input makes sense for text.

The framing that made this tractable is to perceive constantly and act rarely. A person listening to you processes everything you say in real time and mostly stays quiet, and that is the target here. The model conditions on the unfinished sentence at every tick, and its most common output is idle.

One scope note before the demos: this system is event-driven, not temporal. Time enters the model as deltas between events in the prompt, so the model reads time rather than experiencing it. "Remind me every five seconds to breathe" is exactly what this architecture can't do, and I come back to that at the end.

Demos

Each demo has to show three things at once: the model can do the behavior, it does it at the right time, and it doesn't do it any other time. A capability demo without the last two is a chatbot that interrupts.

The held fact

I start typing about a match. The model catches the unresolved fact mid-sentence and delegates a lookup on its own. The result comes back (Morocco beat the Netherlands on June 30, 2026, a post-cutoff fact, so the base model cannot know it and the weave is provably tool-sourced), and the model holds it. Each hold is logged with a reason like "user did not ask to surface the pending lookup." It surfaces the fact only when my typing gives it an opening.

Sped-up cut: delegate mid-sentence, the fact card arrives while typing continues, several held beats, then the weave at the opening.

Filming this taught me three requirements. The fact has to post-date the model's training or the lookup looks staged. The delegate → fetch → hold chain has to land while typing continues, or you've filmed a turn-based bot with extra steps. And the weave has to wait for an opening, because knowing something is not a reason to say it.

Marks

These are the instruction-conditioned behaviors, closest to TML's "tell me when I slouch." Ask it in-stream to underline filler words and it marks um, basically, kinda as you type them. Ask for animals and it flags dolphin, heron, narwhal, including species that never appeared in training, since the base model already knows the category and only had to learn where to point the mark. Say stop and it stops. Marks render as an underline or a small callout and never take the turn.

The ask arrives in the stream; quokka, kestrel, and wombat get flagged as they're typed.

"Call me out on filler words": um and the multi-word "you know" underlined mid-flow.

Doing nothing

This is the least filmable demo: the model sits through normal typing, with revisions and pauses and half-finished thoughts, and does nothing. Close to half of the training decisions are idle and I kept them at full weight. If you downweight idle you get a model that fidgets.

No instruction active, nothing pending. Every tick still runs; the model keeps choosing idle.

The stream

The policy conditions on a single serialized event stream and outputs one action per tick:

<stream_event index=14 <t+650ms> source=user state=active revised=false>
so morocco played the netherlands last night and i think
</stream_event>
<stream_event index=15 <t+1800ms> source=tool state=paused>
{"result": "Morocco 2, Netherlands 1 (Jun 30, 2026)"}
</stream_event>
<PREDICT_THIS_ACTION>

Keystrokes (as snapshots) and tool results arrive through the same queue. The system prompt states it directly: the stream itself is the only context.

There are six actions: idle, mark(kind, span, style), delegate(tool, args), integrate(text, source), skip(reason), respond(text). Only respond bids for the conversational floor; everything else renders as an annotation. That split is what lets the model underline a word or drop in a fact card without taking the turn.

Every event in a synthetic stream carries its ground-truth action, so one stream produces many training rows, one per decision point; the unit of supervision is the decision.

Data

Scenario scaffolds generate the situations: typing sliced into 2–6 word chunks with realistic timing, plus revisions, pauses, and the occasional backchannel. There are scaffold families for lookup-weave, instruction marks, and quoted instructions (someone mentioning "underline my fillers" inside a quote should trigger nothing).

The scaffolds also emit heuristic placeholder actions, and the rule of the corpus is that these are never trained on. From the labeling policy: "heuristic actions are placeholders; do not train on heuristic actions; a teacher must replace every action." A frontier model relabels every decision point against the behavior spec: mark only complete units, only under an active instruction; delegate when the typing reaches an unresolved fact; integrate while it's live; skip once it's stale.

The corpus is small on purpose, around two thousand teacher-labeled decision points, mixed roughly 2:1 with general assistant data to preserve base behavior. Schema-valid was never enough: a row can pass every automated check and still be a weird thing to do while someone is typing, so every batch went through a manual read before it was allowed near training. That review caught things none of the checks did.

Training

Stage one is plain SFT on the labeled streams, LoRA on Tinker on a Qwen3 30B-A3B base.

The SFT model over-integrates. It surfaces the fetched fact even after the moment has passed, because the teacher's labels do the same thing. Probing the teacher directly gave me the finding the rest of the recipe depends on: the teacher never generates the skip action, but it recognizes it reliably. Ask it to label a stale-fact situation and it integrates; show it integrate vs skip as a pair and it picks skip essentially every time. Generating a behavior and recognizing it are different capabilities, and imitation only transfers the first. The tempting action is also almost never wrong. The fetched score is correct; the word really is an animal. In a generated set of appropriateness pairs, a correctness-based reward preferred the intrusive action in every single case. The distinction the model needs is wanted vs unwanted, and only pairwise preference carries that.

So stage two is a short DPO pass on pairs where the rejected branch is correct but unwanted, starting from the SFT checkpoint, with a small amount of low-learning-rate SFT replay mixed in so stage one doesn't get forgotten. The whole pass is tens of steps and a few minutes of wall clock.

Why not on-policy distillation?

I skipped Tinker's OPD recipe deliberately. OPD's implicit reward is the teacher's likelihood on the student's own tokens: it pulls the student toward whatever the teacher would have generated from each state the student visits. But the teacher here generates over-integration. Distill toward its per-token preferences and you transfer that eagerness with high fidelity. You would be teaching, densely and efficiently, the exact behavior the preference stage exists to remove. OPD transfers what the teacher does; I needed what the teacher approves of, and the two split apart at the appropriateness boundary.

There's also a shape mismatch. OPD works best when the output is long and every token carries signal, like reasoning chains or code. Here a tick's output is a short grammar-constrained action where most tokens are forced, and the learnable part is one discrete choice (idle vs mark vs integrate vs skip) under a heavy idle prior. Token-level KL mostly grades tokens the grammar already fixed; preference pairs put all of the gradient on the choice. None of this is a knock on OPD. For moving capability into a small model it's still the best tool I know, and it may come back for the delegate-discipline work below. It just has nothing to say about appropriateness, because there is no capability gap there: the SFT model is fully capable of skipping, it just doesn't prefer to.

Engineering

The runtime emulates a full-duplex model with a turn-based one, under one rule throughout: the model decides, the runtime times and executes.

Time slicing. The browser samples the textarea on a cadence (350ms after the first keystroke, 650ms while typing continues, 1.8s once you pause) and POSTs full snapshots rather than deltas. The server appends each snapshot as a timestamped event and re-renders the whole history into the stream every tick. Snapshots make revisions free (the newest snapshot is simply the current truth) and make the stream robust to dropped requests. Inference is serialized with a busy flag and a one-slot pending queue: if a tick is running when new snapshots arrive, they collapse into the slot. Under load the stream gets coarser, but it never blocks the textarea.

Backchanneling. respond is suppressed while the user holds the floor mid-typing; marks, fact cards, and weaves render as annotations that don't take the turn. Around the policy there is a thin license layer: the in-stream instruction overrides the model's own labels (no marking animals if you asked for fillers), instructions that only appear inside quotes go straight to idle, and everything the layer blocks lands in a hidden audit trail instead of disappearing. The screen shows one quiet line per executed action; the audit trail shows everything the model tried. Every demo take was verified against that trail. The trail is also where I found out the model was often cleaner than my scaffolding: at natural typing speed the policy emits well-formed spans mid-sentence, and most of the junk in early takes came from my own recovery heuristics.

Speed. The demo checkpoint started as a hosted LoRA running at minutes per tick. I merged the adapter into the base, quantized to 4 bits for local inference, and added a prefix KV cache: consecutive ticks share almost the entire prompt, so each tick reuses the longest common token prefix and prefills only the delta. Tick latency went from ~15 seconds to ~1 second. This is a cheap imitation of what real interaction models get natively, where state is the accumulated KV and each new slice is a marginal update to it.

What didn't work and what's next

Confabulation under latency. With a 10-second tool, one discarded take shows the model weaving in "the Netherlands won 2-1" before the lookup returned, with the wrong team and the wrong score. This is the whole case for tool-sourced weaves and for showing provenance in the UI.
Re-delegation. The demo checkpoint sometimes fires a second lookup while the first is still pending; you can see duplicate fact cards in one shipped take. Delegate discipline is a timing judgment, which puts it in the preference tier I haven't trained yet.
Skip is flaky on camera. It passes repeatably at the API level and misses on film more than it should. Skip is the thinnest behavior in the data by an order of magnitude, and it lives in the preference tier where data costs the most to make.
No user marks. Several testers instinctively tried to mark spans for the model. The event schema has no user action type; marks belong to the assistant. Fixing this is a schema and corpus change, not a training run.

The event-driven line is also real. Everything the system does is a reaction to something entering the queue; nothing originates at a moment. Recurring timing ("every five seconds") would need timer events as first-class stream citizens, a nudge action that doesn't bid for the floor, and a corpus that currently has zero timer scenarios. Even then the loop resolves in seconds, not milliseconds. TML gets time as a native dimension of the model; a tick-based text system only reads the clock.

Next is mostly preference data: delegate discipline, weave timing, and enough skip pairs that restraint films as reliably as it evals. The runtime also still carries context scaffolding (lenses, guard rails) that the trained policy doesn't need and the demos don't use, and it should either be removed or studied properly.

Rajan Agarwal · X LinkedIn GitHub

Citation

Please cite this work as:

Rajan Agarwal, "Tiny Interaction Models", rajan.sh, Jun 2026. https://www.rajan.sh/tiny-interaction-models

Or use the BibTeX citation:

@misc{agarwal2026tinyinteractionmodels,
  author = {Rajan Agarwal},
  title = {Tiny Interaction Models},
  year = {2026},
  howpublished = {rajan.sh},
  note = {https://www.rajan.sh/tiny-interaction-models},
}