WAN 2.6 Video – Image-to-Video & Text-to-Video API

Alibaba WAN 2.6 integration

WAN 2.6 is Alibaba’s latest video generation model, delivering smooth motion, high visual fidelity, and advanced features like multi-shot sequences and AI prompt expansion.

WAN 2.6 is a versatile AI video generation API that supports both image-to-video (i2v) and text-to-video (t2v) workflows. It generates cinematic videos with smooth motion, strong temporal consistency, and detailed visuals at 720p or 1080p resolution. The model offers durations of 5, 10, or 15 seconds and includes advanced features like prompt expansion and multi-shot composition for richer narratives.

Key capabilities

Dual input modes: Generate video from an image (i2v) or purely from text (t2v)
Resolution options: 720p (1280x720) and 1080p (1920x1080) in landscape or portrait
Flexible durations: 5, 10, or 15 second video outputs
Prompt expansion: AI optimizer expands short prompts into detailed scripts for richer output
Multi-shot sequences: Break prompts into multiple shots for narrative depth (requires prompt expansion)
Negative prompts: Exclude unwanted elements like watermarks, blur, or distortion
Reproducible results: Fixed seed support for consistent generation
Async processing: Webhook notifications or polling for task completion

Use cases

Marketing videos: Create product showcases and brand content from images or descriptions
Social media content: Generate short-form videos for TikTok, Instagram, and YouTube
Concept visualization: Transform static designs or ideas into motion
Storyboarding: Use multi-shot mode to prototype narrative sequences
Educational content: Illustrate concepts with AI-generated video explanations
Creative exploration: Experiment with text prompts to generate unique visual content

Generation modes

Mode	Input	Best for
Image-to-Video (i2v)	Image URL + prompt	Animating existing images, product videos, bringing artwork to life
Text-to-Video (t2v)	Text prompt only	Creating videos from scratch, concept exploration, narrative content

Resolution variants

Variant	Resolution	Orientation	Use case
720p Landscape	1280x720	Horizontal	Web videos, presentations
720p Portrait	720x1280	Vertical	Social media stories, mobile
1080p Landscape	1920x1080	Horizontal	High-quality marketing, YouTube
1080p Portrait	1080x1920	Vertical	Premium social content, ads

API Operations

Image-to-Video (i2v)

Generate videos from an input image with motion guidance via prompts.

POST i2v 720p

Generate 720p video from image

GET i2v 720p task

Check 720p task status

POST i2v 1080p

Generate 1080p video from image

GET i2v 1080p task

Check 1080p task status

Text-to-Video (t2v)

Generate videos purely from text descriptions without an input image.

POST t2v 720p

Generate 720p video from text

GET t2v 720p task

Check 720p task status

POST t2v 1080p

Generate 1080p video from text

GET t2v 1080p task

Check 1080p task status

Parameters

Parameter	Type	Required	Default	Description
`prompt`	`string`	Yes	-	Scene description, motion, camera moves, style. Max 2000 characters
`image`	`string`	i2v only	-	URL of the keyframe image to animate (JPEG, PNG, WebP)
`size`	`string`	No	`1280*720`	Output size: `1280720`, `7201280`, `19201080`, `10801920`
`duration`	`string`	No	`5`	Video length: `5`, `10`, or `15` seconds
`negative_prompt`	`string`	No	-	Elements to avoid (e.g., “blurry, watermark”). Max 1000 characters
`enable_prompt_expansion`	`boolean`	No	`false`	Enable AI to expand prompts into detailed scripts
`shot_type`	`string`	No	`single`	`single` for continuous shot, `multi` for scene transitions
`seed`	`integer`	No	`-1`	Random seed for reproducibility (-1 for random)
`webhook_url`	`string`	No	-	URL for async status notifications

Frequently Asked Questions

What is WAN 2.6 and what can it do?

WAN 2.6 is Alibaba’s video generation model that creates AI videos from images (i2v) or text prompts (t2v). It produces smooth, cinematic videos up to 15 seconds at 720p or 1080p resolution with features like multi-shot sequences and prompt expansion for richer narratives.

What is the difference between image-to-video and text-to-video modes?

Image-to-video (i2v) animates an existing image you provide, giving you control over the visual starting point. Text-to-video (t2v) generates videos purely from text descriptions, creating visuals entirely from your prompt without an input image.

What video durations does WAN 2.6 support?

WAN 2.6 supports 5, 10, and 15 second durations. Shorter durations (5s) generate faster, while longer durations (15s) allow for more developed scenes and narratives.

What resolutions are available?

WAN 2.6 offers 720p (1280x720) and 1080p (1920x1080) in both landscape and portrait orientations. Use 720p for faster generation and 1080p for higher quality output.

What is prompt expansion and when should I use it?

Prompt expansion uses AI to transform short, simple prompts into detailed scripts before generation. Enable it when you have a basic idea but want richer, more cinematic output. It is required for multi-shot mode.

What is multi-shot mode?

Multi-shot mode (shot_type: multi) breaks your prompt into multiple scene transitions, creating a narrative sequence instead of a single continuous shot. It requires enable_prompt_expansion: true.

How long does video generation take?

Generation time varies by resolution, duration, and server load. Typical processing ranges from 30 seconds to several minutes. Use webhooks for production workflows instead of polling.

What are the rate limits for WAN 2.6?

Rate limits depend on your subscription tier. See the Rate Limits page for current limits by plan.

How much does WAN 2.6 cost?

Pricing varies by resolution and duration. See the Pricing page for current rates and credit costs.

What image formats are supported for i2v?

Image-to-video accepts JPEG, PNG, and WebP images via publicly accessible URLs. Use high-quality images with clear subjects for best results.

Best practices

Prompt writing: Be specific about scenes, camera movements (zoom, pan, tilt), lighting, and atmosphere for better results
Image selection (i2v): Use high-resolution images with clear subjects and balanced lighting; avoid compressed or noisy inputs
Negative prompts: Always include common artifacts to avoid: “blurry, low quality, watermark, text, distortion”
Duration selection: Start with 5 seconds for quick iterations, then increase to 10-15s for final outputs
Prompt expansion: Enable for short prompts or when you want the AI to add cinematic details
Multi-shot planning: For multi-shot mode, structure your prompt with clear scene descriptions
Production integration: Use webhooks for scalable applications instead of polling
Reproducibility: Save the seed value from successful generations to recreate similar results

WAN 2.5: Previous WAN version with 480p, 720p, and 1080p options
Kling 2.5 Turbo Pro: Alternative i2v model with cinematic quality
Kling Pro v2.1: High-fidelity i2v with expressive motion
PixVerse V5: Fast i2v with style consistency

Get Started

APIs

WAN 2.6 Video – Image-to-Video & Text-to-Video API

Alibaba WAN 2.6 integration

Key capabilities

Use cases

Generation modes

Resolution variants

API Operations

Image-to-Video (i2v)

POST i2v 720p

GET i2v 720p task

POST i2v 1080p

GET i2v 1080p task

Text-to-Video (t2v)

POST t2v 720p

GET t2v 720p task

POST t2v 1080p

GET t2v 1080p task

Parameters

Frequently Asked Questions

Best practices

Get Started

APIs

Alibaba WAN 2.6 integration

​Key capabilities

​Use cases

​Generation modes

​Resolution variants

​API Operations

​Image-to-Video (i2v)

POST i2v 720p

GET i2v 720p task

POST i2v 1080p

GET i2v 1080p task

​Text-to-Video (t2v)

POST t2v 720p

GET t2v 720p task

POST t2v 1080p

GET t2v 1080p task

​Parameters

​Frequently Asked Questions

​Best practices

​Related APIs

Key capabilities

Use cases

Generation modes

Resolution variants

API Operations

Image-to-Video (i2v)

Text-to-Video (t2v)

Parameters

Frequently Asked Questions

Best practices

Related APIs