๐ŸŽฌ Image to Video Generator with Ovi

Transform your static images into dynamic videos with synchronized audio using AI! Upload an image and describe the motion you want to see.

Powered by Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation via HuggingFace Inference API.

โš ๏ธ You must Sign in with Hugging Face using the button below to use this app.
๐Ÿ’ก Tips for best results:
  • Use clear, well-lit images with a single main subject
  • Write specific prompts describing the desired motion or action
  • Keep prompts concise and focused on movement and audio elements
  • Processing generates 5-second videos at 24 FPS with synchronized audio
  • Processing may take 30-60 seconds depending on server load
โœจ Special Tokens for Enhanced Control:
  • Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech
  • Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes the audio or sound effects present in the video

๐Ÿ“ Example Prompt:
Dogs bark loudly at a man wearing a red shirt. The man says <S>Please stop barking at me!<E>. <AUDCAP>Dogs barking, angry man yelling in stern voice<ENDAUDCAP>.
Example
๐Ÿ“ธ Upload Image โœ๏ธ Text Prompt

About Ovi Model

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Developed by Chetwin Low, Weimin Wang (Character AI) & Calder Katyal (Yale University)

๐ŸŒŸ Key Features:

  • ๐ŸŽฌ Video+Audio Generation: Generates synchronized video and audio content simultaneously
  • ๐Ÿ“ Flexible Input: Supports text-only or text+image conditioning
  • โฑ๏ธ 5-second Videos: Generates 5-second videos at 24 FPS
  • ๐Ÿ“ Multiple Aspect Ratios: Supports 720ร—720 area at various ratios (9:16, 16:9, 1:1, etc)

Ovi is a veo-3 like model that simultaneously generates both video and audio content from text or text+image inputs.


๐Ÿš€ How it works

  1. Sign in with your Hugging Face account
  2. Upload your image - any photo or illustration
  3. Describe the motion you want to see in the prompt
  4. Generate and watch your image come to life!

โš ๏ธ Notes

  • Video generation may take 30-60 seconds
  • Generates 5-second videos at 24 FPS with synchronized audio
  • Supports multiple aspect ratios (9:16, 16:9, 1:1, etc) at 720ร—720 area
  • Requires a valid HuggingFace token with Inference API access
  • Best results with clear, high-quality images
  • The model works best with realistic subjects and natural motions

๐Ÿ”— Resources