Meta has released SAM Audio, a prompt driven audio separation model that targets a common editing bottleneck, isolating one sound from a real world mix without building a custom model per sound class. Meta released 3 main sizes, sam-audio-small, sam-audio-base, and sam-audio-large. The model is available to download and to try in the Segment Anything Playground.
Architecture
SAM Audio uses separate encoders for each conditioning signal, an audio encoder for the mixture, a text encoder for the natural language description, a span encoder for time anchors, and a visual encoder that consumes a visual prompt derived from video plus an object mask. The encoded streams are concatenated into time aligned features, then processed by a diffusion transformer that applies self attention over the time aligned representation and cross attention to the textual feature, then a DACVAE decoder reconstructs waveforms and emits 2 outputs, target audio and residual audio.
What SAM Audio does, and what ‘segment’ means here?
SAM Audio takes an input recording that contains multiple overlapping sources, for example speech plus traffic plus music, and separates out a target source based on a prompt. In the public inference API, the model produces 2 outputs, result.target and result.residual. The research team describes target as the isolated sound, and residual as everything else.
That target plus residual interface maps directly to editor operations. If you want to remove a dog bark across a podcast track, you can treat the bark as the target, then subtract it by keeping only residual. If you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these exact kinds of examples to explain what the model is meant to enable.
The 3 prompt types Meta is shipping
Meta positions SAM Audio as a single unified model that supports 3 prompt types, and it says these prompts can be used alone or combined.

Results
Meta team positions SAM Audio as achieving cutting edge performance across diverse, real world scenarios, and frames it as a unified alternative to single purpose audio tools. The team publishes a subjective evaluation table across categories, General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro), with General scores of 3.62 for sam audio small, 3.28 for sam audio base, and 3.50 for sam audio large, and Instr(pro) scores reaching 4.49 for sam audio large.
Key Takeaways
Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.




