How to use ElevenLabs and OpenAI to Automate Faceless Shorts Video for YouTube and TikTok

StackForward

Oct 8, 2024 — 3 min read

Are you looking to create engaging faceless short videos for platforms like YouTube or TikTok but don't want to deal with complex video editing? This article will walk you through how to automate the entire process using OpenAI, ElevenLabs, and MoviePy.

By the end of this tutorial, you'll know how to automatically generate visuals and voiceovers for short videos based on any script. Whether you’re creating educational content, storytelling, or meme videos, this workflow will save you tons of time.

Prerequisites

Before getting started, you’ll need:

API keys for both OpenAI (for generating visuals) and ElevenLabs (for voiceovers).
Basic Python knowledge.
MoviePy and other required Python libraries installed (moviepy, openai, elevenlabs, etc.).

Step 1: Setting Up API Keys

import openai
from elevenlabs import ElevenLabs

# Set up your OpenAI and ElevenLabs API keys
openai.api_key = "your_openai_api_key"
elevenlabs_client = ElevenLabs(api_key="your_elevenlabs_api_key")

Start by getting API keys from OpenAI and ElevenLabs. Replace the placeholders in the code with your actual API keys.

Step 2: Preparing the Script

Your video starts with a story or script. You can replace the story_script variable with the text you want to turn into a video. Here’s an example script about Dogecoin:

story_script = """
Dogecoin began as a joke in 2013, inspired by the popular 'Doge' meme featuring a Shiba Inu dog. It unexpectedly gained a massive following thanks to its community's charitable initiatives, eventually evolving into a legitimate cryptocurrency with support from Elon Musk.
"""

The script will be split into sentences to match each visual and audio segment.

Step 3: Generating Images with OpenAI’s DALL-E

For each sentence, we generate a corresponding image using OpenAI’s DALL-E model.

def generate_image_from_text(sentence, context, idx):
    prompt = f"Generate an image without any text that describes: {sentence}. Context: {context}"
    response = openai.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1792",
        response_format="b64_json"
    )

    image_filename = f"images/image_{idx}.jpg"
    with open(image_filename, "wb") as f:
        f.write(base64.b64decode(response.data[0].b64_json))
    return image_filename

This function sends each sentence to DALL-E and saves the generated image. We ensure the generated visuals match the video's theme.

Step 4: Generating Voiceovers with ElevenLabs

Once we have the visuals, we need voiceovers. ElevenLabs converts each sentence into speech.

def generate_audio_from_text(sentence, idx):
    audio = elevenlabs_client.text_to_speech.convert(
        voice_id="pqHfZKP75CvOlQylNhV4",
        model_id="eleven_multilingual_v2",
        text=sentence,
        voice_settings=VoiceSettings(stability=0.2, similarity_boost=0.8)
    )
    audio_filename = f"audio/audio_{idx}.mp3"
    with open(audio_filename, "wb") as f:
        for chunk in audio:
            f.write(chunk)
    return audio_filename

This function generates an audio file for each sentence. You can select different voice settings to customize the narration style.

Step 5: Combining Audio and Video

Next, we pair each image with its corresponding voiceover using MoviePy:

from moviepy.editor import ImageClip, AudioFileClip

image_clip = ImageClip(image_path, duration=audio_clip.duration)
image_clip = image_clip.set_audio(audio_clip)
video_clips.append(image_clip.set_fps(30))

Each image is displayed for the duration of its audio clip, ensuring synchronization.

Step 6: Applying Video Effects

To make the video more dynamic, we apply zoom and fade effects to each image. For example, the apply_zoom_in_center effect slowly zooms into the center of the image:

def apply_zoom_in_center(image_clip, duration):
    return image_clip.resize(lambda t: 1 + 0.04 * t)

Other effects include zooming in from the upper part or zooming out. These effects are applied randomly to each clip to keep the video visually engaging.

Step 7: Final Video Assembly

We combine all video clips into one seamless video and add background music:

final_video = concatenate_videoclips(video_clips, method="compose")
final_video.write_videofile(output_video_path, codec="libx264", audio_codec="aac", fps=30)

Step 8: Adding Captions

Captions improve video accessibility and engagement. We use Captacity to automatically add captions based on the audio.

captacity.add_captions(
    video_file=output_video_path,
    output_file="captioned_video.mp4",
    font_size=130,
    font_color="yellow",
    stroke_width=3
)

Step 9: Adding Background Music

To finish the video, background music is added. The volume is reduced so that it doesn't overpower the narration.

background_music = AudioFileClip(music_filename).subclip(0, final_video.duration).volumex(0.2)
narration_audio = final_video.audio.volumex(1.5)
combined_audio = CompositeAudioClip([narration_audio, background_music])
final_video.set_audio(combined_audio)

By using OpenAI and ElevenLabs, we’ve automated the creation of faceless videos from text. You can now quickly generate YouTube Shorts or TikToks without needing a camera or microphone.

This method can easily be adapted for different video lengths or topics, making it ideal for content creators who want to focus on creativity and storytelling rather than spending hours on manual editing.

Happy creating!