Skip to main content

SAM Audio integration

Powered by SAM Audio technology, this API isolates specific sounds from audio or video files using natural language descriptions.
SAM Audio is an AI-powered audio isolation API that extracts specific sounds from audio or video files based on text descriptions. Describe what you want to isolate - vocals, speech, instruments, or sound effects - and receive a clean WAV file containing only that sound. The API supports both audio files (WAV, MP3, FLAC, OGG, M4A) and video files (MP4, MOV, WEBM, AVI) as input.

Key capabilities

  • Text-guided isolation: Describe any sound to extract (e.g., “A person speaking”, “Piano playing”, “Dog barking”)
  • Multi-format input: Accepts audio (WAV, MP3, FLAC, OGG, M4A) or video (MP4, MOV, WEBM, AVI) files
  • Video localization: Optional bounding box (x1, y1, x2, y2) to focus on specific areas in video
  • Quality tuning: Adjust reranking_candidates (1-8) to balance quality vs. latency
  • Event detection: Enable predict_spans for better isolation of non-ambient sounds
  • WAV output: High-quality WAV audio file with the isolated sound
  • Async processing: Webhook notifications or polling for task completion

Use cases

  • Music production: Extract vocals from songs for remixes or karaoke tracks
  • Podcast editing: Isolate speech from background noise or music
  • Film post-production: Separate dialogue from ambient sounds for audio mixing
  • Sound design: Extract specific sound effects from video recordings
  • Transcription services: Clean up audio by isolating speech before transcription
  • Instrument isolation: Separate specific instruments from full band recordings

Isolate audio with SAM Audio

Submit an audio or video file with a text description of the sound to isolate. The service returns a task ID for async polling or webhook notification.

Parameters

ParameterTypeRequiredDefaultDescription
descriptionstringYes-Text description of the sound to isolate (e.g., “A person speaking”, “Piano playing”)
audiostringNo*-URL or base64-encoded audio file (WAV, MP3, FLAC, OGG, M4A)
videostringNo*-URL or base64-encoded video file (MP4, MOV, WEBM, AVI)
x1integerNo0Bounding box left coordinate for video localization (0 = full frame)
y1integerNo0Bounding box top coordinate for video localization (0 = full frame)
x2integerNo0Bounding box right coordinate for video localization (0 = full frame)
y2integerNo0Bounding box bottom coordinate for video localization (0 = full frame)
sample_fpsintegerNo2Frame sampling rate for video (1-5 FPS)
reranking_candidatesintegerNo1Quality vs. latency trade-off (1-8, higher = better quality, slower)
predict_spansbooleanNofalseEnable for better isolation of non-ambient, event-based sounds
webhook_urlstringNo-URL for task completion notification
*Either audio or video must be provided, but not both.

Frequently Asked Questions

SAM Audio is an AI-powered audio isolation API that uses text descriptions to identify and extract specific sounds from audio or video files. You submit a file with a description of the target sound (e.g., “A person speaking”), receive a task ID immediately, then poll for results or receive a webhook notification. The output is a WAV file containing only the isolated sound.
For audio input: WAV, MP3, FLAC, OGG, and M4A formats. For video input: MP4, MOV, WEBM, and AVI formats. Files can be provided as URLs or base64-encoded strings.
Be specific and descriptive. Good examples: “A person speaking”, “Piano playing in the background”, “Dog barking loudly”, “Acoustic guitar strumming”. Avoid vague descriptions like “music” or “noise” - instead specify what type of music or sound you want to isolate.
The reranking_candidates parameter (1-8) controls the quality vs. speed trade-off. Higher values produce better isolation quality but take longer to process. Use 1 for fastest results, 8 for highest quality. Default is 1.
Enable predict_spans when isolating non-ambient, event-based sounds like speech, individual notes, or sound effects. Keep it disabled (default) for continuous ambient sounds like background music or environmental noise.
For video input, you can specify a bounding box (x1, y1, x2, y2) to focus on sounds originating from a specific area of the frame. This is useful when you want to isolate audio from a particular person or object in the video. Set all values to 0 (default) to process the full frame.
SAM Audio outputs a high-quality WAV audio file containing only the isolated sound. This uncompressed format is ideal for further editing or processing in audio production workflows.

Best practices

  • Description specificity: Use detailed descriptions for better isolation accuracy
  • Input quality: Higher quality input audio/video produces better isolation results
  • Quality tuning: Start with reranking_candidates=1 for testing, increase for production
  • Event sounds: Enable predict_spans for speech, music notes, or sound effects
  • Video focus: Use bounding boxes to isolate sounds from specific video regions
  • Production integration: Use webhooks instead of polling for scalable applications
  • Error handling: Implement retry logic with exponential backoff for 503 errors