Abstract
Dubbing video into other languages has historically required either re-recording with a native speaker or accepting the uncanny mismatch between audio and lip movements that comes with simple voice-over. Neither approach scales. The first destroys the original performance. The second looks wrong and audiences notice immediately.
NEO 2 takes a different approach. Given a video of a person speaking and a target language audio track, the system maps phonemes from the target language onto the speaker's existing facial movements. It modifies jaw position, lip shape, and tongue visibility frame by frame while preserving the speaker's identity, skin texture, and surrounding scene. The output is a photorealistic video where the person appears to natively speak the target language.
The entire pipeline runs in a single forward pass. There is no iterative refinement, no manual correction, no post-processing. The model handles the phoneme-to-viseme mapping, facial deformation, and temporal smoothing in one step.


