• AI CODING CLUB
  • Posts
  • I used AI to code an app to generate AI videos. It can be yours.

I used AI to code an app to generate AI videos. It can be yours.

Welcome to another week in my AI-assisted coding journey.

Inspired by a few projects I discovered on “There’s An AI For That” (spoiler: there’s now an AI for [almost] everything), I decided on Tuesday afternoon to start coding a Python app to generate AI videos from a single prompt.

message from a developer friend, after 24 hours

🎉Two days later, it was ready!

On the frontend, it’s just a simple field, nothing fancy.

When triggered straight from Python, it’s just a simple command (20 = the number of sentences I ask the AI to generate, split in chunks in 2 sentences, then rendered in captions with a pretty cool “poetic” prompt).

generatevideotask('Tips and tricks to master AI-assisted coding',20)

In the backend, it’s rather sophisticated, leveraging OpenAI’s API, Stable Diffusion, PyDub (for audio) and MoviePy (for video).

✍️We’re talking about roughly 1,000 lines of code.

This is a section of the JSON of the overall structure of my most recent video.

The end result is a vertical 9:16 video consisting of a series of Stable Diffusion images + properly paced captions feat. some highlighted words, a nice voiceover from OpenAI (10x cheaper than Elevenlabs), a selection of effects applied to the images (zoom, pan,…) and a MP3 backing track (from a selection of tracks generated with Suno.ai).

Here’s an example I published this morning on Youtube.

Time taken to generate video: 5.3702291051546736 minutes

That’s how magically effective AI-assisted coding can be in 2024.

The Coding Process

In this email, I wanted to share with you the key components of the process which enabled me to ship this app in just 2 days, from the initial idea to the final deployment.

I’ve not published the app, but I have a special offer for you, dear reader.

More about this later… 👇

It all started with a tweet I read on Monday about a developer who had shipped a web app to generate and schedule short videos for TikTok.

I wanted to try to emulate the output of his product.

I already knew that I could generate 9:16 images via the API of Stable Diffusion (each image costs roughly $0.0036 to generate), I already had a solid experience in prompt engineering and a fairly good knowledge of PyDub and MoviePy, two Python packages used for audio and video.

So on Tuesday afternoon I opened Visual Studio and I asked ChatGPT (better to process longer kickoff prompts) to generate the boilerplate of the project.

Here’s the full unredacted prompt.

It just came to me in one shot.


I need the detailed code for a Python script.
The purpose of the script is to generate a custom video. Output = customvideo.mp4

The script is fed with a text (text = "text string...")

The text is sent to the function chunkify which segments the text into chunks of 2 sentences AND also returns captions for each chunk, which we'll use in the final part of the process (one chunk can consist of multiple captions).
 
Each chunk is then processed by the function topify which returns a brief description of the group of 2 sentences, sent to a function imagify which returns an AI image of 768:1344 pixels illustrating the group.

Then we have to generate the final video in mp4.

We'll send each chunk of text to a function called voiceover which will return a MP3.

The background of each chunk will be the corresponding AI image, applying a random animation effect taken out of 3 options: subtle zoom in / zoom out, subtle pan to left, subtle pan to right. The duration of the video section will depend on the duration of the MP3 generated for the text chunk for the corresponding section.

The captions corresponding to each chunk will be displayed in the middle of the corresponding video section, written in white in a nice bold font, with a black border, for contrasting purposes. You should also apply an appear / exit effect to those captions. 

The final video will be a concatenation of the background videos with the captions properly superimposed in sequence in the center of those background videos.

The soundtrack will be a concatenation of the MP3s rendered for the chunks of text. 

Make sure that the background videos generated from the animated AI images have the same duration as those MP3s. 

All clear? 

Give me the full Python code, not omitting or commenting a single line. 
I need the full code.

To be honest with you, the structure evolved quite a bit over the next two days,

I introduced some refinements, especially around caption generation, and I isolated the various steps in dedicated .py files, which is much easier to maintain.

Originally, MoviePy was used for both the audio and the video manipulations but due to a well-documented bug in MoviePy, which tends to introduce some nasty artifacts at the end of concatenated segments, I refactored the code to handle audio with PyDub and video with MoviePy, blending both at the very end of the script.

    # Concatenate the clips using MoviePy

    final_video_clip = concatenate_videoclips(video_clips)

    # Concatenate all audio clips using PyDub
    final_audio_clip = AudioSegment.empty()  # Initialize an empty AudioSegment
    for audio_clip in audio_clips:
        # Concatenate the clip
        final_audio_clip += audio_clip

    # Generate a unique id for the filename
    unique_id_audio = uuid.uuid4()

    # Convert the final audio clip to a wav with a unique id as filename
    final_audio_clip.export(f"final_audio_{unique_id_audio}.wav", format="wav")
    
    final_audio_clip_moviepy = AudioFileClip(f"final_audio_{unique_id_audio}.wav")
    
    # Ensure the video duration does not exceed the audio duration
    if final_video_clip.duration > final_audio_clip_moviepy.duration:
        # Trim the video to match the audio's duration
        final_video_clip = final_video_clip.subclip(0, final_audio_clip_moviepy.duration)
    
    final_clip = final_video_clip.set_audio(final_audio_clip_moviepy)

🤔The Challenges

Besides the nasty audio artifacts introduced by MoviePy, I also struggled for a while to find the right prompt to segment the sentence chunks into on-screen captions.

I found a nice angle by asking the AI to rewrite the chunks as if they were a poem, so that the text would nicely flow on the screen.

Here’s an excerpt from the (much longer) prompt:

GUIDELINES. READ CAREFULLY:
1. Divide each chunk into smaller, naturally flowing phrases that are easy to read sequentially, while preserving the original text.
2. Imagine those CAPTIONED SECTIONS as if you were rewriting the chunk text in a POEM FORMAT, putting emphasis where needed. 
3. Sometimes, a CAPTIONED SECTION will be one word, sometimes a few words. Imagine you are reading the text as a POEM and you want to emphasize certain words or phrases.
4. Ensure that the divisions do not disrupt the natural grammatical structure or the narrative continuity of the text.
5. Aim for each CAPTIONED SECTION to carry a piece of the sentence's meaning, contributing to the overall message as the reader progresses through them.

This prompt was itself the result of a discussion with my AI assistant.

Another challenge was to improve the on-screen pacing of the captions (it would still require some fine tuning but it’s now close to what it should be).

discussion with a human developer friend

A discussion with a (human) friend prompted me to segment the captions into syllables, calculating the proportion of each caption vs the whole chunk segment, then returning the relative duration of each caption in a list.

Here’s an overview of the last part of the function.

def get_caption_durations(chunk_text, captions, audio_duration):
    total_syllables = sum(nsyl(word) for word in chunk_text.split())
    word_durations = audio_duration / total_syllables

    durations = []
    for caption in captions:
        caption_syllables = sum(nsyl(word) for word in caption.split())
        caption_duration = caption_syllables * word_durations
        durations.append(caption_duration)

    # Add 1.5 second to the last caption
    durations[-1] += 1.5

    # Apply a factor of 0.9 to all durations except the last one
    durations[:-1] = [duration * 0.9 for duration in durations[:-1]]

    return durations

Besides those 3 challenges (audio artifacts, caption definition and pacing), it took me a few hours of trials and errors to refine the whole script.

I deployed it for a test on Railway. I noticed that it would take roughly twice the time it took on my modest local 2.7GHz computer to process a video in the cloud, eating up 5 times as much RAM memory as on my 16Mb local machine (close to 3GB per run instead of 600Mo for a 10-sentence video).

I took the decision to develop this as a locally operated script and to keep on working on the project as a boilerplate for developers (while helping my friend ship a consumer-facing version).

I plan to update the script with new features (voice selection, font selection, video sourcing (using for instance Pexel Images & Videos or my very own Unsplash images).

My Special Offer 🎁

So here comes the special offer I mentioned earlier in my newsletter.

If you’d like to generate your own videos while learning how to use AI for coding, I can offer you the opportunity to download my code and share with you all the updates. I will document all the components, as well as record a detailed video tutorial.

You’ll be able to run this script on your local machine. You will declare your own Stable Diffusion and OpenAI keys in the .env file, as well as your AWS credentials if you want to use S3 for storage.

The script is also deployment ready, incl. the input widget displayed above, if you want to integrate it into another app project.

FYI, one video of 10 sentences (c. 30 sec.) represents roughly $0.06 in API costs, in total. It will take 3’ to render a 30-sec video.

Most consumer-facing services charge $1+ per video generation for this type of output (but you have of course to factor in the cloud expenses, especially if you want to offer a reasonable waiting time to your users instead of putting all jobs in a long rendering queue).

If you’re interested in the script, I’m pricing it at $179 (+ sales tax where it applies).

Simply reply to this email, mentioning your invoicing details, and I’ll raise a custom invoice via Paddle. The price includes the detailed documentation, all future updates as well as my personal assistance to make sure it properly runs on your computer. If you’re the first customer, you will also get a free 30’ consultation.

👨‍💻Mentoring Opportunity

If you’d like to get a private introduction to the art of AI-assisted coding and more broadly a detailed overview of today’s Gen AI capabilities, I’m offering one-on-one 2-hour mentoring sessions “How To Talk To An AI Agent”.

Here’s the link for bookings.

Sessions are tailored to your specific business needs.

I can also assist you in the development of your own micro SaaS project.