We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly
synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked
diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences
of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we
achieve
rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for
near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that
SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and
synchronization with video content in quantitative evaluation and a human listening study. Furthermore,
our use of random masking during training enables our model to match spectral characteristics of
reference
audio samples, broadening its applicability to professional audio synthesis tasks such as Foley
generation and
sound design.