Our project focuses on creating an automated video generation system using AI. It transforms text prompts into fully narrated videos by leveraging large language models for script generation, diffusion models for image creation, and text to speech systems for narration. The system processes inputs through multiple stages, from script generation to final video assembly, creating cohesive, engaging content automatically.
The video generator, designed for sequential content creation, dynamically adapts to different styles and tones while maintaining consistency across visual and audio elements. It also has the ability to add subtiles either embedded or through the use of an srt file.
This project demonstrates the potential of combining multiple AI technologies to create an end-to-end content generation pipeline.
Python 3.11
: Core programming language for the project.
-
Content Generation:
Gemini API
: To generate the script usingGemini 2.0 Flash Thinking
model and store it in aJSON
format with proper audio and visual prompts and respective parameters.Stable Diffusion XL Base 1.0
: For image generation using diffusion models to run eitherlocally
or hosted onModal
.Kokoro
: An open weight tts model to convert audio prompts into audio. -
Video Processing
MoviePy
: For adding text, intro, outro, transition effects, subtitles, audio processing, video processing and Final_Assembly by usingFFmpeg
under the hood. -
ML Frameworks:
PyTorch
: Deep learning framework for model inferencing.Diffusers with SDXL Base 1.0
: Utilize Hugging Face's Diffusers to generate stunning images using the SDXL Base 1.0 model. Enhance your creative projects with state-of-the-art diffusion techniques. -
Development Tools:
Jupyter Notebooks
: For development and testing.Google Colab
: For free cloud GPU infrastructure for development and Testing.Git
: For version controlModal
: For low cost high performance cloud GPU infrastructure. -
Package Management:
UV
: For fast and efficient dependency management and project setup
- Multi-Modal Content Generation: Seamlessly combines text, image, and audio generation
- Style Customization: Supports different content styles and tones
- Modular Architecture: Each component can be tested and improved independently
- Content Segmentation: Automatically breaks down content into manageable segments
- Custom Voice Options: Multiple TTS voices and emotional tones
- Format Flexibility: Supports different video durations and formats (.mp4 and .mkv)
- Performance Metrics: Tracks generation quality and consistency
- Error Handling: Robust error management across the pipeline
- Resource Optimization: Efficient resource usage during generation
Clone the repo on your system, using : git clone https://github.com/MLSAKIIT/ForgeTube.git
For more information, visit the UV Documentation.
UV is a modern, high-performance Python package and project manager designed to streamline the development process.
Here’s how you can use UV in this project:
- Install
uv
.
pip install uv
- Download
Python 3.11
uv python install 3.11
- Create a virtual environment
uv venv .venv
- Activate your virtual environment
.venv\scripts\activate.ps1
- Install all dependencies
uv sync
For more information visit the Modal documentation.
Modal is a cloud function platform that lets you Attach high performance GPUs with a single line of code.
The nicest thing about all of this is that you don’t have to set up any infrastructure. Just:
- Create an account at modal.com
- Run
pip install modal
to install the modal Python package - Run
modal setup
to authenticate (if this doesn’t work, trypython -m modal setup
)
To obtain a Gemini API key from Google AI Studio, follow these detailed steps:
Step 1: Sign In to Google AI Studio
Navigate to Google AI Studio. Once signed in, locate and click on the "Gemini API" tab. This can typically be found in the main navigation menu or directly on the dashboard. On the Gemini API page, look for a button labeled "Get API key in Google AI Studio" and click on it.
Step 2: Review and Accept Terms of Service
- Review Terms: A dialog box will appear presenting the Google APIs Terms of Service and the Gemini API Additional Terms of Service. It's essential to read and understand these terms before proceeding.
- Provide Consent: Check the box indicating your agreement to the terms. Optionally, you can also opt-in to receive updates and participate in research studies related to Google AI.
- Proceed: Click the "Continue" button to move forward.
Step 3: Create and Secure Your API Key
- Generate API Key: Click on the "Create API key" button. You'll be prompted to choose between creating a new project or selecting an existing one. Make your selection accordingly.
- Retrieve the Key: Once generated, your unique API key will be displayed. Ensure you copy and store it in a secure location.
Step 4: Add your Key in main.py
or local_main.py
# 1. Generate the Script
gem_api = "Enter your Gemini API Key here"
serp_api = "Enter your Serp API key here"
Important
Always keep your API key confidential. Avoid sharing it publicly or embedding it directly into client-side code to prevent unauthorized access.
Serp is used for web scraping google search results on the video topic and gathering additional context to implement Retrieval Augmented Generation (RAG)
- Visit serpapi.com/ and create an account.
- Go to the dashboard, on the top left select Api key.
- Copy the API Key and add your Key in
main.py
orlocal_main.py
# 1. Generate the Script
gem_api = "Enter your Gemini API Key here"
serp_api = "Enter your Serp API key here"
Run the following commands :
python -m pip install spacy # If not insatlled for some reason
python -m spacy download en_core_web_sm
- Visit : https://github.com/BtbN/FFmpeg-Builds/releases
- Download the setup file for your OS.
- On windows download the win64 version, and extract the files.
- Make a directory at
C:\Program Files\FFmpeg
. - Copy all the files in the directory.
- Add
C:\Program Files\FFmpeg\bin
to yourPATH
environment variable.
Use main.py
for running the image generation on Modal or use main_local.py
to run Stable diffusion XL Locally.
Important
- Make sure all the following folders are updated properly :
script_path = "resources/scripts/"
script_path += "script.json" # Name of the script file
images_path = "resources/images/"
audio_path = "resources/audio/"
font_path = "resources/font/font.ttf" # Not recommended to change
Important
2. Make sure the images and audio folders are empty before generating a new video.
- Name of video file is automatically grabbed from video topic in script. However you may change the following variables to have custom names, if files names are very long then video file wont be generated, so do manually change it in such cases.
sub_output_file = "name of the subtitle file.srt"
video_file = "name of the video.mp4 or .mkv"
no module named pip found
Try running the following :
python -m pip install spacy pydub kokoro soundfile torch
python -m spacy download en_core_web_sm
- Serp API not returning any search results : This is a known issue and is being investigated.
Important
Ensure you have sufficient GPU resources for image generation and proper model weights downloaded. It is recommend to use an NVDIA GPU with at least 24 GB or more of VRAM for locally running the image generation and a high single core performance CPU for video assembly.
Note
Video generation times may vary based on content length , complexity and hardware used.
CONTRIBUTORS | MENTORS | CONTENT WRITER |
---|---|---|
Kartikeya Trivedi | Soham Roy | [Name] |
Naman Singh | Yash Kumar Gupta | |
Soham Mukherjee | ||
Sumedha Gunturi | ||
Souryabrata Goswami | ||
Harshit Agarwal | ||
Rahul Sutradhar | ||
Ayush Mohanty | ||
Shopno Banerjee | ||
Shubham Gupta | ||
Sarthak Singh | ||
Nancy |
Version | Date | Comments |
---|---|---|
1.0 | 23/02/2025 | Initial release |
- Pipeline foundations
- LLM Agent Handing
- Diffusion Agent Handing
- TTS Handing
- Video Assembly Engine
- Initial Deployment
- Advanced style transfer capabilities
- In-Context Generation for Diffusion Model
- Real time generation monitoring
- Enhanced video transitions
- Better quality metrics
- Multi language support
- Custom character consistency
- Animation effects
- Hugging Face Transformers - https://huggingface.co/transformers
- Hugging Face Diffusers - https://huggingface.co/diffusers
- FFmpeg - https://ffmpeg.org/
- UV - https://docs.astral.sh/uv/
- MoviePy - https://zulko.github.io/moviepy/getting_started/index.html
- The Illustrated Transformer - A visual, beginner-friendly introduction to transformer architecture.
- Attention Is All You Need - The seminal paper on transformer architecture.
- Gemini 2.0 Flash Thinking
- Introduction to Multi-Agent Systems - Fundamental concepts and principles.
- A Comprehensive Guide to Understanding LangChain Agents and Tools - Practical implementation guide.
- kokoro
- Stable Diffusion XL Turbo 1.0 Base
- Stable Diffusion: A Comprehensive End-to-End Guide with Examples
- Stable Diffusion Explained
- Stable Diffusion Explained Step-by-Step with Visualization
- Understanding Stable Diffusion: The Magic Behind AI Image Generation
- Stable Diffusion Paper