SAM Audio integration
Powered by SAM Audio technology, this API isolates specific sounds from audio or video files using natural language descriptions.
Key capabilities
- Text-guided isolation: Describe any sound to extract (e.g., “A person speaking”, “Piano playing”, “Dog barking”)
- Multi-format input: Accepts audio (WAV, MP3, FLAC, OGG, M4A) or video (MP4, MOV, WEBM, AVI) files
- Video localization: Optional bounding box (
x1,y1,x2,y2) to focus on specific areas in video - Quality tuning: Adjust
reranking_candidates(1-8) to balance quality vs. latency - Event detection: Enable
predict_spansfor better isolation of non-ambient sounds - WAV output: High-quality WAV audio file with the isolated sound
- Async processing: Webhook notifications or polling for task completion
Use cases
- Music production: Extract vocals from songs for remixes or karaoke tracks
- Podcast editing: Isolate speech from background noise or music
- Film post-production: Separate dialogue from ambient sounds for audio mixing
- Sound design: Extract specific sound effects from video recordings
- Transcription services: Clean up audio by isolating speech before transcription
- Instrument isolation: Separate specific instruments from full band recordings
Isolate audio with SAM Audio
Submit an audio or video file with a text description of the sound to isolate. The service returns a task ID for async polling or webhook notification.Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
description | string | Yes | - | Text description of the sound to isolate (e.g., “A person speaking”, “Piano playing”) |
audio | string | No* | - | URL or base64-encoded audio file (WAV, MP3, FLAC, OGG, M4A) |
video | string | No* | - | URL or base64-encoded video file (MP4, MOV, WEBM, AVI) |
x1 | integer | No | 0 | Bounding box left coordinate for video localization (0 = full frame) |
y1 | integer | No | 0 | Bounding box top coordinate for video localization (0 = full frame) |
x2 | integer | No | 0 | Bounding box right coordinate for video localization (0 = full frame) |
y2 | integer | No | 0 | Bounding box bottom coordinate for video localization (0 = full frame) |
sample_fps | integer | No | 2 | Frame sampling rate for video (1-5 FPS) |
reranking_candidates | integer | No | 1 | Quality vs. latency trade-off (1-8, higher = better quality, slower) |
predict_spans | boolean | No | false | Enable for better isolation of non-ambient, event-based sounds |
webhook_url | string | No | - | URL for task completion notification |
audio or video must be provided, but not both.
Frequently Asked Questions
What is SAM Audio and how does it work?
What is SAM Audio and how does it work?
SAM Audio is an AI-powered audio isolation API that uses text descriptions to identify and extract specific sounds from audio or video files. You submit a file with a description of the target sound (e.g., “A person speaking”), receive a task ID immediately, then poll for results or receive a webhook notification. The output is a WAV file containing only the isolated sound.
What audio and video formats does SAM Audio support?
What audio and video formats does SAM Audio support?
For audio input: WAV, MP3, FLAC, OGG, and M4A formats. For video input: MP4, MOV, WEBM, and AVI formats. Files can be provided as URLs or base64-encoded strings.
How do I write effective sound descriptions?
How do I write effective sound descriptions?
Be specific and descriptive. Good examples: “A person speaking”, “Piano playing in the background”, “Dog barking loudly”, “Acoustic guitar strumming”. Avoid vague descriptions like “music” or “noise” - instead specify what type of music or sound you want to isolate.
What is the reranking_candidates parameter for?
What is the reranking_candidates parameter for?
The
reranking_candidates parameter (1-8) controls the quality vs. speed trade-off. Higher values produce better isolation quality but take longer to process. Use 1 for fastest results, 8 for highest quality. Default is 1.When should I enable predict_spans?
When should I enable predict_spans?
Enable
predict_spans when isolating non-ambient, event-based sounds like speech, individual notes, or sound effects. Keep it disabled (default) for continuous ambient sounds like background music or environmental noise.How does video localization work with bounding boxes?
How does video localization work with bounding boxes?
For video input, you can specify a bounding box (
x1, y1, x2, y2) to focus on sounds originating from a specific area of the frame. This is useful when you want to isolate audio from a particular person or object in the video. Set all values to 0 (default) to process the full frame.What is the output format?
What is the output format?
SAM Audio outputs a high-quality WAV audio file containing only the isolated sound. This uncompressed format is ideal for further editing or processing in audio production workflows.
Best practices
- Description specificity: Use detailed descriptions for better isolation accuracy
- Input quality: Higher quality input audio/video produces better isolation results
- Quality tuning: Start with
reranking_candidates=1for testing, increase for production - Event sounds: Enable
predict_spansfor speech, music notes, or sound effects - Video focus: Use bounding boxes to isolate sounds from specific video regions
- Production integration: Use webhooks instead of polling for scalable applications
- Error handling: Implement retry logic with exponential backoff for 503 errors
Related APIs
- Sound Effects: Generate sound effects from text descriptions
- Lip Sync: Synchronize lip movements with audio
- OmniHuman 1.5: Generate human animations driven by audio