Built with anycoder ✨

🎬 Image to Video Generator with Ovi

Transform your static images into dynamic videos with synchronized audio using AI! Upload an image and describe the motion you want to see.

⚠️ You must Sign in with Hugging Face using the button below to use this app.

💡 Tips for best results:

Use clear, well-lit images with a single main subject
Write specific prompts describing the desired motion or action
Keep prompts concise and focused on movement and audio elements
Processing generates 5-second videos at 24 FPS with synchronized audio
Processing may take 30-60 seconds depending on server load

✨ Special Tokens for Enhanced Control:

Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech
Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes the audio or sound effects present in the video

📝 Example Prompt:

Dogs bark loudly at a man wearing a red shirt. The man says <S>Please stop barking at me!<E>. <AUDCAP>Dogs barking, angry man yelling in stern voice<ENDAUDCAP>.

📸 Upload Image

✍️ Text Prompt

Example

📸 Upload Image	✍️ Text Prompt

🎥 Generated Video

About Ovi Model

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Developed by Chetwin Low, Weimin Wang (Character AI) & Calder Katyal (Yale University)

🌟 Key Features:

🎬 Video+Audio Generation: Generates synchronized video and audio content simultaneously
📝 Flexible Input: Supports text-only or text+image conditioning
⏱️ 5-second Videos: Generates 5-second videos at 24 FPS
📐 Multiple Aspect Ratios: Supports 720×720 area at various ratios (9:16, 16:9, 1:1, etc)

Ovi is a veo-3 like model that simultaneously generates both video and audio content from text or text+image inputs.

🚀 How it works

Sign in with your Hugging Face account
Upload your image - any photo or illustration
Describe the motion you want to see in the prompt
Generate and watch your image come to life!

⚠️ Notes

Video generation may take 30-60 seconds
Generates 5-second videos at 24 FPS with synchronized audio
Supports multiple aspect ratios (9:16, 16:9, 1:1, etc) at 720×720 area
Requires a valid HuggingFace token with Inference API access
Best results with clear, high-quality images
The model works best with realistic subjects and natural motions

🎬 Image to Video Generator with Ovi

About Ovi Model

🚀 How it works

⚠️ Notes

🔗 Resources