Technical Deep Dive
May 2026 by Colossyan Research Team

NEO 1.5: Identity-Aware Talking-Head Performance

NEO 1.5 places speaker identity inside the attention pass for the first time. Audio, motion, and identity reason together in a single block we call Unified Joint Attention, instead of identity steering the network from the outside.

Audio Motion Identity Unified Joint Attention audio × motion × identity × N LAYERS audio′ motion′ identity′ Motion latents to renderer, to video Audio and identity shape motion inside the block, then drop out of the read-out.
In NEO, only audio and motion shared the attention pass. NEO 1.5 adds identity as a third stream inside it.

What Actually Changed

NEO 1.5 keeps the two-stage pipeline, the diffusion foundation, and the performance data from NEO. The change is that speaker identity now sits inside the attention pass rather than around it.

In NEO, audio and motion were the two streams that reasoned together. Identity shaped the result from outside the attention pass, as a conditioning signal. NEO 1.5 extends that joint attention from two streams to three, so identity becomes a stream of its own. It co-evolves with audio and motion inside the same pass rather than steering them from the side.

The clearest way to see the difference is to watch a speaker hold a half-second pause. In NEO, the face goes still because nothing in the attention pass knows whose face it is. In NEO 1.5, the eyes refocus and the breath comes back, because identity sits in the same pass as audio and motion.

Audio Speaker identity NEW Reference latent VOCAL 1.5 Unified Joint Attention VOCAL 1.5 is the only stage that changed. Animated latents to renderer Renderer unchanged Video

Where Identity Lives

In NEO, identity entered as a side conditioner around the attention pass. Audio and motion reasoned together; identity influenced the result from outside. In NEO 1.5, identity enters the attention pass itself as a third stream alongside audio and motion.

NEO Audio Motion Joint Attention AUDIO × MOTION Identity (side conditioner) Identity influence is indirect. It never enters attention. NEO 1.5 Audio Motion Identity Unified Joint Attention A × M × IDENTITY All three streams attend together. Identity influence is direct.
Identity reaches NEO through modulation around the attention pass. In NEO 1.5 it reasons inside it.

Where Identity-Aware Attention Shows Up

Same model family, same audio input, same reference image. The four pillars below are the places Unified Joint Attention changes the output.

Speaker-Faithful Identity

The generated performance stays consistent with the person you started from, across long durations and difficult phonemes.

Geometry-Aware Mouth

Lip shapes fit the speaker's actual face. Wide vowels and fast consonants stay plausible for this speaker's jaw rather than a generic one.

Locked on Fast Consonants

The mid-syllable wobble that two-stream models produce on rapid plosives is visibly damped once identity reasons inside the attention pass.

Lifelike Pauses and Gaze

Silences read as the same person breathing and glancing. Identity carries the performance through the frames where audio drops out.

See It Side By Side

Same audio, same reference image, NEO on the left and NEO 1.5 on the right. The four cases below mirror the pillars above.

01

Speaker-Faithful Identity

A long take on the same reference image. The face stays consistent across difficult phonemes.

02

Geometry-Aware Mouth

Wide vowels and rapid consonants. Mouth shapes that fit the speaker's actual jaw.

03

Locked on Fast Consonants

Rapid plosives. The mid-syllable wobble two-stream models produce, side by side with NEO 1.5.

04

Lifelike Pauses and Gaze

A half-second pause mid-sentence. NEO holds a face. NEO 1.5 holds a person.

Inside VOCAL 1.5

Four stages compose the upgraded Voice-to-Latent model. Unified Joint Attention is the change; the rest preserves the diffusion foundation from NEO.

1. Audio Understanding

Extracts rich speech cues, including phonetic timing, rhythm, emphasis, pauses, and prosody, that guide the full facial performance and not just the lips. The encoder is the same multi-scale stack we shipped with NEO; the upgrade is what happens to its features once they enter the attention block.

2. Unified Joint Attention

The new attention core. Audio, motion, and speaker identity travel as peer streams and mix inside a single shared attention field. Every token, in every stream, attends to every other, so all three streams attend to one another in the same block.

Each block runs that joint pass, refines each stream on its own, and repeats through the network. Audio and identity shape the motion stream during the pass; only the motion stream is read out as the performance.

3. Identity-Aware Generation

Uses speaker information as part of the animation process rather than as a final pass-through. The model treats identity as part of the motion problem itself, which is what keeps the performance consistent across difficult phoneme boundaries and through long silences. What the model is solving for is faithful motion over time, beyond a static identity match.

4. Guided Diffusion

VOCAL 1.5 stays a diffusion-based generator. At inference time, it exposes controllable guidance across the signals that define a performance, so you can tune how strongly each one steers the result. Some clips want stronger audio responsiveness, others want stricter identity preservation or steadier reference behaviour.

Building Blocks

α
Audio understanding

Phonetic timing, rhythm, emphasis, pauses, prosody.

β
Unified Joint Attention

Audio, motion, and identity reason together in one pass.

γ
Identity-aware generation

Speaker geometry shapes motion, not a post-step.

δ
Guided diffusion

Tune audio, identity, and reference influence at inference.

Depth and Focus

Putting three streams in one attention pass is the headline change. Two further refinements make that pass behave well at depth. Attention residuals act across the stack of layers, keeping early information reachable late. Differential attention works inside a single attention step, sharpening which tokens actually matter.

Attention Residuals: A Skip Across Depth

A standard deep transformer is a strictly sequential stack. Useful early features, like a clean audio onset or a stable identity signature, can get progressively overwritten through depth. Vanilla residuals add every layer's contribution with equal weight, so the model cannot decide which depth's representation matters most for a given input.

Attention residuals replace that uniform accumulation with a learned, input-dependent blend across depth. The simplest way to think about it: attention, but over layers instead of over tokens. At every block boundary the network blends all cached states with softmax weights rather than forwarding only the most recent one. The weights shift with the input, so the model can favour shallow or deep representations as needed.

In the unified design, that gives every stream a skip-across-depth highway. Motion can recover an early kinematic estimate if later layers over-smooth it. Audio can preserve sharp phonetic timing that deep mixing would blur. Identity can hold its speaker signature steady rather than drifting.

CACHED LAYER STATES h₀ h_B h_2B h_3B learned query q score each state SOFTMAX WEIGHTS w weights shift with the input blended h_new Each token, in each stream, chooses how much to draw from shallow versus deep layers.

Differential Attention: Cancelling Attention Noise

Softmax attention almost never assigns exactly zero weight to irrelevant tokens. Summed over a long sequence, that attention noise dilutes the signal. In a multimodal field it is worse, because motion, audio, and identity tokens all compete in one place, so motion queries can leak attention onto unrelated audio frames and the reverse.

Differential attention cancels that common-mode noise, by analogy with a differential amplifier that rejects noise shared between two signal lines. Each head is split into two sub-heads that compute two independent softmax maps over the same tokens, and the output uses their difference. If both maps place spurious weight on the same irrelevant tokens, subtracting them cancels that shared mode, while the genuine signal survives.

The variant is compute neutral: splitting each head into two halves means there are half as many differential heads, each computing two maps, so the total cost matches standard attention.

A₁ signal + noise floor λ · A₂ same noise, weaker peak = A₁ − λA₂ noise cancels, signal stays λ is depth-scheduled: early layers subtract less, deeper layers subtract more aggressively for sharper focus.

Why Unify the Attention At All

Treating audio, motion, and identity as separate streams aligns well on benchmark metrics but produces visible artefacts in real performances. Unified Joint Attention is the answer to where those artefacts show up.

Identity drift on hard phonemes

Fast consonants and wide vowels push the mouth into uncertain shapes. With identity in the same step as the phoneme, every shape stays plausible for this face.

Pauses that look dead

When audio leads, a silence becomes an empty hole in the motion. Identity fills it with the speaker's own gaze and breathing, so the pause still reads as a person.

Emphasis that fits the speaker

The same rising pitch should move two people differently. Letting emphasis meet identity before motion is set makes the gesture this speaker's own.

Guided Generation

VOCAL 1.5 stays a diffusion-based generator, and at inference time it still supports guided generation across the signals that define a performance. You tune how strongly each one steers the result.

Some clips want stronger audio responsiveness, others want stricter identity preservation or steadier reference behaviour. Try the balance below.

Audio Identity Reference

Illustrative only. The sliders show the kind of control guided generation exposes, not exact inference parameters.

NEO vs NEO 1.5

The table below summarises the five dimensions Unified Joint Attention moves. The pillars and architecture above explain why each one moves.

DimensionNEONEO 1.5
Identity preservationGenericSpeaker-faithful
Mouth-shape fit to faceApproximateGeometry-aware
Stability on fast consonantsWobblyLocked
Pause and gaze coherenceInconsistentLifelike
Lip-sync accuracyStrongMaintained

The Foundation Stays

NEO 1.5 is an architectural change. The dataset, the diffusion foundation, and the reinforcement-learning recipe carry over from NEO. It runs on the same dataset that grounds NEO, so the quality gain comes from how the attention pass is wired, not from more data. NEO 1.5 sits between NEO and our full-body successor, NEO 2.

FAQ

Frequently Asked Questions

NEO 1.5 is an upgrade to Colossyan's NEO talking-head model. The architectural change is concentrated in one place: VOCAL 1.5, the Voice-to-Latent stage, replaces NEO's two-stream attention with Unified Joint Attention. Audio, motion, and speaker identity travel as peer streams inside a single attention pass. The renderer, the data, and the rest of the pipeline are unchanged.

In NEO, audio and motion reasoned together inside the attention pass; identity reached the network as a side conditioner around that pass. NEO 1.5 promotes identity into the attention pass itself, so all three signals attend to one another before any motion is committed. The change shows up at phoneme boundaries, during silences, and on emphasised words.

An attention block where three streams (audio, motion, and identity) mix in a single shared field rather than as separately conditioned signals. Every token in every stream attends to every other. The block runs that joint pass, refines each stream on its own, and repeats through the network. Only the motion stream is read out as the final performance.

No. NEO 1.5 runs on the same dataset that grounds NEO. The gain comes from the architecture, not more data. Curated, high-quality human performance recordings, latent diffusion, and reinforcement learning still give the model its broad sense of how people speak, pause, emphasise, and feel.

Two refinements that make the joint attention pass behave well at depth. Attention residuals add a learned, input-dependent blend across layers, so the model can favour shallow or deep representations as needed for a given token. Differential attention splits each head into two sub-heads, computes two softmax maps over the same tokens, and uses their difference to cancel the noise floor that softmax always spreads over irrelevant tokens. The total attention cost matches standard attention.

Yes. VOCAL 1.5 stays a diffusion-based generator and supports guided generation across the signals that define a performance. You can tune how strongly audio responsiveness, identity preservation, and reference stability each steer the result. Some clips want stronger audio expressiveness; others want stricter identity preservation or steadier reference behaviour.

NEO 1.5 is a research preview today. We are running internal evaluations and customer pilots before general availability in the Colossyan editor and API. If you'd like to be considered for the early-access cohort, reach out via your account team or sign up for a trial.