Skip to main content

Alibaba WAN 2.7 integration

WAN 2.7 Reference-to-Video generates videos featuring specific characters from reference images or videos, maintaining visual identity and supporting optional voice references.
WAN 2.7 Reference-to-Video is an AI video generation API that creates MP4 videos featuring characters from reference images or videos. You provide up to 5 character references (images and/or videos combined), then describe a scene in the prompt using labels like “Image 1” or “Video 1” to place those characters. The model maintains visual identity of referenced characters across the generated video. Output is available at 720P (1280x720) or 1080P (1920x1080) resolution with durations from 2 to 10 seconds.

Key capabilities

  • Character references: Provide up to 5 combined character images and videos for identity preservation
  • Prompt-based character placement: Reference characters as “Image 1”, “Image 2”, “Video 1” in the prompt
  • Voice-guided generation: Optionally include reference_voice audio per character for voice-guided output
  • Resolution options: 720P (1280x720) and 1080P (1920x1080) output
  • 5 aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4 (or auto-detect from start_image_url)
  • Flexible durations: 2 to 10 seconds of video output
  • Start frame control: Optionally provide start_image_url to set the first frame and auto-detect aspect ratio
  • Async processing: Webhook notifications or polling for task completion

How character references work

  1. Provide character images via image_urls (JPEG/PNG/BMP/WEBP, 240-8000px, max 20MB each)
  2. Provide character videos via video_urls (MP4/MOV, max 100MB each)
  3. Combined total of images + videos must not exceed 5
  4. Reference characters in the prompt using position labels: “Image 1”, “Image 2”, “Video 1”, “Video 2”
  5. Optionally add reference_voice audio URL per character for voice-guided generation
Example prompt:
“Image 1 and Image 2 are walking together in a park while Video 1 plays guitar in the background.”

Use cases

  • Consistent character videos: Generate multiple videos with the same character across different scenes
  • Multi-character narratives: Create scenes with up to 5 characters interacting
  • Branded content: Maintain consistent mascot or spokesperson identity across video campaigns
  • Voice-synchronized video: Guide character motion using voice references for natural lip and gesture sync
  • Social media series: Create episodic content with recurring characters
  • Virtual presenters: Generate videos of a reference person in different settings

API operations

Generate videos by submitting character references and a prompt to the API. The service returns a task ID for async polling or webhook notification.

POST /v1/ai/reference-to-video/wan-2-7

Create a new reference-to-video generation task

GET /v1/ai/reference-to-video/wan-2-7

List all WAN 2.7 R2V tasks with status

GET /v1/ai/reference-to-video/wan-2-7/{task-id}

Get task status and results by ID

Parameters

ParameterTypeRequiredDefaultDescription
promptstringYes-Scene description referencing characters as “Image 1”, “Video 1”, etc. Max 5000 characters
negative_promptstringNo-Elements to avoid (e.g., “blurry, watermark”). Max 500 characters
image_urlsarrayConditional-Character reference images. Each item has url (required) and optional reference_voice
video_urlsarrayConditional-Character reference videos. Each item has url (required) and optional reference_voice
start_image_urlstringNo-First-frame image. If provided, overrides aspect_ratio with image dimensions
aspect_ratiostringNo"16:9"Output ratio: "16:9", "9:16", "1:1", "4:3", "3:4"
resolutionstringNo"1080P"Output resolution: "720P" or "1080P"
durationintegerNo5Video length in seconds: 2 to 10
seedintegerNoRandomSeed for reproducibility (0 to 2147483647)
additional_settings.prompt_extendbooleanNotrueEnable AI prompt expansion for richer output
webhook_urlstringNo-URL for async status notifications

Frequently Asked Questions

WAN 2.7 Reference-to-Video is an AI video generation API developed by Alibaba. You provide character reference images or videos along with a text prompt that describes a scene. The model generates a video featuring those characters while preserving their visual identity. You receive a task ID immediately, then poll for results or receive a webhook notification.
You can provide up to 5 combined character references (images + videos). For example: 3 character images and 2 character videos, or 5 images and 0 videos. At least one image or video reference is required.
Use position labels based on the order you provide references. Image references are labeled “Image 1”, “Image 2”, etc. Video references are labeled “Video 1”, “Video 2”, etc. Example: “Image 1 and Video 1 are having a conversation at a cafe.”
Each character reference (image or video) can include an optional reference_voice URL pointing to an audio file. The model uses this voice to guide character motion and lip movement in the generated video, creating more natural character animation.
Image references: JPEG, PNG, BMP, WEBP (240-8000px per side, max 20MB). Video references: MP4, MOV (max 100MB). All files must be at publicly accessible URLs.
Rate limits depend on your subscription tier. See the Rate Limits page for current limits by plan.
See the Pricing page for current rates and subscription options.

Best practices

  • Character images: Use clear, well-lit images with the character prominently visible. Avoid busy backgrounds.
  • Character videos: Shorter reference videos with clear character visibility produce better identity preservation.
  • Prompt structure: Explicitly name each character by label (“Image 1 walks left while Image 2 sits down”) for predictable placement.
  • Voice references: Provide clean audio clips with minimal background noise for best voice-guided results.
  • Duration selection: Reference-to-Video supports 2-10 seconds. Start with shorter durations for iteration.
  • Negative prompts: Include: “blurry, low quality, watermark, text, distortion, extra limbs”
  • Production integration: Use webhooks for scalable applications instead of polling.
  • Error handling: Implement retry with exponential backoff for 503 errors during high-demand periods.