Speaker-Faithful Identity
A long take on the same reference image. The face stays consistent across difficult phonemes.
NEO 1.5 places speaker identity inside the attention pass for the first time. Audio, motion, and identity reason together in a single block we call Unified Joint Attention, instead of identity steering the network from the outside.
NEO 1.5 keeps the two-stage pipeline, the diffusion foundation, and the performance data from NEO. The change is that speaker identity now sits inside the attention pass rather than around it.
In NEO, audio and motion were the two streams that reasoned together. Identity shaped the result from outside the attention pass, as a conditioning signal. NEO 1.5 extends that joint attention from two streams to three, so identity becomes a stream of its own. It co-evolves with audio and motion inside the same pass rather than steering them from the side.
The clearest way to see the difference is to watch a speaker hold a half-second pause. In NEO, the face goes still because nothing in the attention pass knows whose face it is. In NEO 1.5, the eyes refocus and the breath comes back, because identity sits in the same pass as audio and motion.
In NEO, identity entered as a side conditioner around the attention pass. Audio and motion reasoned together; identity influenced the result from outside. In NEO 1.5, identity enters the attention pass itself as a third stream alongside audio and motion.
Same model family, same audio input, same reference image. The four pillars below are the places Unified Joint Attention changes the output.
The generated performance stays consistent with the person you started from, across long durations and difficult phonemes.
Lip shapes fit the speaker's actual face. Wide vowels and fast consonants stay plausible for this speaker's jaw rather than a generic one.
The mid-syllable wobble that two-stream models produce on rapid plosives is visibly damped once identity reasons inside the attention pass.
Silences read as the same person breathing and glancing. Identity carries the performance through the frames where audio drops out.
Same audio, same reference image, NEO on the left and NEO 1.5 on the right. The four cases below mirror the pillars above.
A long take on the same reference image. The face stays consistent across difficult phonemes.
Wide vowels and rapid consonants. Mouth shapes that fit the speaker's actual jaw.
Rapid plosives. The mid-syllable wobble two-stream models produce, side by side with NEO 1.5.
A half-second pause mid-sentence. NEO holds a face. NEO 1.5 holds a person.
Four stages compose the upgraded Voice-to-Latent model. Unified Joint Attention is the change; the rest preserves the diffusion foundation from NEO.
Extracts rich speech cues, including phonetic timing, rhythm, emphasis, pauses, and prosody, that guide the full facial performance and not just the lips. The encoder is the same multi-scale stack we shipped with NEO; the upgrade is what happens to its features once they enter the attention block.
The new attention core. Audio, motion, and speaker identity travel as peer streams and mix inside a single shared attention field. Every token, in every stream, attends to every other, so all three streams attend to one another in the same block.
Each block runs that joint pass, refines each stream on its own, and repeats through the network. Audio and identity shape the motion stream during the pass; only the motion stream is read out as the performance.
Uses speaker information as part of the animation process rather than as a final pass-through. The model treats identity as part of the motion problem itself, which is what keeps the performance consistent across difficult phoneme boundaries and through long silences. What the model is solving for is faithful motion over time, beyond a static identity match.
VOCAL 1.5 stays a diffusion-based generator. At inference time, it exposes controllable guidance across the signals that define a performance, so you can tune how strongly each one steers the result. Some clips want stronger audio responsiveness, others want stricter identity preservation or steadier reference behaviour.
Putting three streams in one attention pass is the headline change. Two further refinements make that pass behave well at depth. Attention residuals act across the stack of layers, keeping early information reachable late. Differential attention works inside a single attention step, sharpening which tokens actually matter.
A standard deep transformer is a strictly sequential stack. Useful early features, like a clean audio onset or a stable identity signature, can get progressively overwritten through depth. Vanilla residuals add every layer's contribution with equal weight, so the model cannot decide which depth's representation matters most for a given input.
Attention residuals replace that uniform accumulation with a learned, input-dependent blend across depth. The simplest way to think about it: attention, but over layers instead of over tokens. At every block boundary the network blends all cached states with softmax weights rather than forwarding only the most recent one. The weights shift with the input, so the model can favour shallow or deep representations as needed.
In the unified design, that gives every stream a skip-across-depth highway. Motion can recover an early kinematic estimate if later layers over-smooth it. Audio can preserve sharp phonetic timing that deep mixing would blur. Identity can hold its speaker signature steady rather than drifting.
Softmax attention almost never assigns exactly zero weight to irrelevant tokens. Summed over a long sequence, that attention noise dilutes the signal. In a multimodal field it is worse, because motion, audio, and identity tokens all compete in one place, so motion queries can leak attention onto unrelated audio frames and the reverse.
Differential attention cancels that common-mode noise, by analogy with a differential amplifier that rejects noise shared between two signal lines. Each head is split into two sub-heads that compute two independent softmax maps over the same tokens, and the output uses their difference. If both maps place spurious weight on the same irrelevant tokens, subtracting them cancels that shared mode, while the genuine signal survives.
The variant is compute neutral: splitting each head into two halves means there are half as many differential heads, each computing two maps, so the total cost matches standard attention.
Treating audio, motion, and identity as separate streams aligns well on benchmark metrics but produces visible artefacts in real performances. Unified Joint Attention is the answer to where those artefacts show up.
Fast consonants and wide vowels push the mouth into uncertain shapes. With identity in the same step as the phoneme, every shape stays plausible for this face.
When audio leads, a silence becomes an empty hole in the motion. Identity fills it with the speaker's own gaze and breathing, so the pause still reads as a person.
The same rising pitch should move two people differently. Letting emphasis meet identity before motion is set makes the gesture this speaker's own.
VOCAL 1.5 stays a diffusion-based generator, and at inference time it still supports guided generation across the signals that define a performance. You tune how strongly each one steers the result.
Some clips want stronger audio responsiveness, others want stricter identity preservation or steadier reference behaviour. Try the balance below.
Illustrative only. The sliders show the kind of control guided generation exposes, not exact inference parameters.
The table below summarises the five dimensions Unified Joint Attention moves. The pillars and architecture above explain why each one moves.
| Dimension | NEO | NEO 1.5 |
|---|---|---|
| Identity preservation | Generic | Speaker-faithful |
| Mouth-shape fit to face | Approximate | Geometry-aware |
| Stability on fast consonants | Wobbly | Locked |
| Pause and gaze coherence | Inconsistent | Lifelike |
| Lip-sync accuracy | Strong | Maintained |
NEO 1.5 is an architectural change. The dataset, the diffusion foundation, and the reinforcement-learning recipe carry over from NEO. It runs on the same dataset that grounds NEO, so the quality gain comes from how the attention pass is wired, not from more data. NEO 1.5 sits between NEO and our full-body successor, NEO 2.
NEO 1.5 is an upgrade to Colossyan's NEO talking-head model. The architectural change is concentrated in one place: VOCAL 1.5, the Voice-to-Latent stage, replaces NEO's two-stream attention with Unified Joint Attention. Audio, motion, and speaker identity travel as peer streams inside a single attention pass. The renderer, the data, and the rest of the pipeline are unchanged.
In NEO, audio and motion reasoned together inside the attention pass; identity reached the network as a side conditioner around that pass. NEO 1.5 promotes identity into the attention pass itself, so all three signals attend to one another before any motion is committed. The change shows up at phoneme boundaries, during silences, and on emphasised words.
An attention block where three streams (audio, motion, and identity) mix in a single shared field rather than as separately conditioned signals. Every token in every stream attends to every other. The block runs that joint pass, refines each stream on its own, and repeats through the network. Only the motion stream is read out as the final performance.
No. NEO 1.5 runs on the same dataset that grounds NEO. The gain comes from the architecture, not more data. Curated, high-quality human performance recordings, latent diffusion, and reinforcement learning still give the model its broad sense of how people speak, pause, emphasise, and feel.
Two refinements that make the joint attention pass behave well at depth. Attention residuals add a learned, input-dependent blend across layers, so the model can favour shallow or deep representations as needed for a given token. Differential attention splits each head into two sub-heads, computes two softmax maps over the same tokens, and uses their difference to cancel the noise floor that softmax always spreads over irrelevant tokens. The total attention cost matches standard attention.
Yes. VOCAL 1.5 stays a diffusion-based generator and supports guided generation across the signals that define a performance. You can tune how strongly audio responsiveness, identity preservation, and reference stability each steer the result. Some clips want stronger audio expressiveness; others want stricter identity preservation or steadier reference behaviour.
NEO 1.5 is a research preview today. We are running internal evaluations and customer pilots before general availability in the Colossyan editor and API. If you'd like to be considered for the early-access cohort, reach out via your account team or sign up for a trial.
Most lip-sync models break the moment conditions get difficult. Colossyan Dubbing handles real-world footage by feeding the full frame through NEO 2 instead of cropping the mouth, producing broadcast-quality output on raw, unprocessed video.
The model that powers Colossyan Dubbing under the hood. Full-body, audio-driven avatar generation with consistent identity at any duration.
The first generation of NEO. Talking-head performances with natural head movements, facial expressions, and eye gaze from audio input.