MLSA Project Wing: ML

ForgeTube

🚧Our Project:

Our project focuses on creating an automated video generation system using AI. It transforms text prompts into fully narrated videos by leveraging large language models for script generation, diffusion models for image creation, and text to speech systems for narration. The system processes inputs through multiple stages, from script generation to final video assembly, creating cohesive, engaging content automatically.

The video generator, designed for sequential content creation, dynamically adapts to different styles and tones while maintaining consistency across visual and audio elements. It also has the ability to add subtiles either embedded or through the use of an srt file.

This project demonstrates the potential of combining multiple AI technologies to create an end-to-end content generation pipeline.

🖥️Project Stack:

Python 3.11: Core programming language for the project.

Content Generation:

Gemini API: To generate the script using Gemini 2.0 Flash Thinking model and store it in a JSON format with proper audio and visual prompts and respective parameters.

Stable Diffusion XL Base 1.0: For image generation using diffusion models to run either locally or hosted on Modal.

Kokoro: An open weight tts model to convert audio prompts into audio.
Video Processing MoviePy : For adding text, intro, outro, transition effects, subtitles, audio processing, video processing and Final_Assembly by using FFmpeg under the hood.
ML Frameworks:

PyTorch: Deep learning framework for model inferencing.

Diffusers with SDXL Base 1.0 : Utilize Hugging Face's Diffusers to generate stunning images using the SDXL Base 1.0 model. Enhance your creative projects with state-of-the-art diffusion techniques.
Development Tools:

Jupyter Notebooks: For development and testing.

Google Colab : For free cloud GPU infrastructure for development and Testing.

Git: For version control

Modal : For low cost high performance cloud GPU infrastructure.
Package Management:

UV: For fast and efficient dependency management and project setup

Features

Multi-Modal Content Generation: Seamlessly combines text, image, and audio generation
Style Customization: Supports different content styles and tones
Modular Architecture: Each component can be tested and improved independently
Content Segmentation: Automatically breaks down content into manageable segments
Custom Voice Options: Multiple TTS voices and emotional tones
Format Flexibility: Supports different video durations and formats (.mp4 and .mkv)
Performance Metrics: Tracks generation quality and consistency
Error Handling: Robust error management across the pipeline
Resource Optimization: Efficient resource usage during generation

Steps for deployment :

Clone the repo on your system, using : git clone https://github.com/MLSAKIIT/ForgeTube.git

1. Using UV for Python Package Management

For more information, visit the UV Documentation.

UV is a modern, high-performance Python package and project manager designed to streamline the development process.

Here’s how you can use UV in this project:

Install uv.

pip install uv

Download Python 3.11

uv python install 3.11

Create a virtual environment

uv venv .venv

Activate your virtual environment

.venv\scripts\activate.ps1

Install all dependencies

uv sync

2. Setting up Modal

For more information visit the Modal documentation.

Modal is a cloud function platform that lets you Attach high performance GPUs with a single line of code.

The nicest thing about all of this is that you don’t have to set up any infrastructure. Just:

Create an account at modal.com
Run pip install modal to install the modal Python package
Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)

3. Get your Gemini-API Key :

To obtain a Gemini API key from Google AI Studio, follow these detailed steps:

Step 1: Sign In to Google AI Studio

Navigate to Google AI Studio. Once signed in, locate and click on the "Gemini API" tab. This can typically be found in the main navigation menu or directly on the dashboard. On the Gemini API page, look for a button labeled "Get API key in Google AI Studio" and click on it.

Step 2: Review and Accept Terms of Service

Review Terms: A dialog box will appear presenting the Google APIs Terms of Service and the Gemini API Additional Terms of Service. It's essential to read and understand these terms before proceeding.
Provide Consent: Check the box indicating your agreement to the terms. Optionally, you can also opt-in to receive updates and participate in research studies related to Google AI.
Proceed: Click the "Continue" button to move forward.

Step 3: Create and Secure Your API Key

Generate API Key: Click on the "Create API key" button. You'll be prompted to choose between creating a new project or selecting an existing one. Make your selection accordingly.
Retrieve the Key: Once generated, your unique API key will be displayed. Ensure you copy and store it in a secure location.

Step 4: Add your Key in main.py or local_main.py

# 1. Generate the Script
gem_api = "Enter your Gemini API Key here"
serp_api = "Enter your Serp API key here"

Important

Always keep your API key confidential. Avoid sharing it publicly or embedding it directly into client-side code to prevent unauthorized access.

4. Setting up Serp-Api

Serp is used for web scraping google search results on the video topic and gathering additional context to implement Retrieval Augmented Generation (RAG)

Visit serpapi.com/ and create an account.
Go to the dashboard, on the top left select Api key.
Copy the API Key and add your Key in main.py or local_main.py

# 1. Generate the Script
gem_api = "Enter your Gemini API Key here"
serp_api = "Enter your Serp API key here"

5. `Kokoro`

Run the following commands :

python -m pip install spacy # If not insatlled for some reason
python -m spacy download en_core_web_sm

6. Download and setup FFmpeg

Visit : https://github.com/BtbN/FFmpeg-Builds/releases
Download the setup file for your OS.
On windows download the win64 version, and extract the files.
Make a directory at C:\Program Files\FFmpeg.
Copy all the files in the directory.
Add C:\Program Files\FFmpeg\bin to your PATH environment variable.

7. Start Generating :

Use main.py for running the image generation on Modal or use main_local.py to run Stable diffusion XL Locally.

Troubleshooting

Important

Make sure all the following folders are updated properly :

script_path = "resources/scripts/"
script_path += "script.json" # Name of the script file
images_path = "resources/images/"
audio_path = "resources/audio/"
font_path = "resources/font/font.ttf" # Not recommended to change

Important

2. Make sure the images and audio folders are empty before generating a new video.

Name of video file is automatically grabbed from video topic in script. However you may change the following variables to have custom names, if files names are very long then video file wont be generated, so do manually change it in such cases.

sub_output_file = "name of the subtitle file.srt"
video_file = "name of the video.mp4 or .mkv"

no module named pip found Try running the following :

python -m pip install spacy pydub kokoro soundfile torch
python -m spacy download en_core_web_sm

Serp API not returning any search results : This is a known issue and is being investigated.

Important

Ensure you have sufficient GPU resources for image generation and proper model weights downloaded. It is recommend to use an NVDIA GPU with at least 24 GB or more of VRAM for locally running the image generation and a high single core performance CPU for video assembly.

Note

Video generation times may vary based on content length , complexity and hardware used.

Contributors

CONTRIBUTORS	MENTORS	CONTENT WRITER
Kartikeya Trivedi	Soham Roy	[Name]
Naman Singh	Yash Kumar Gupta
Soham Mukherjee
Sumedha Gunturi
Souryabrata Goswami
Harshit Agarwal
Rahul Sutradhar
Ayush Mohanty
Shopno Banerjee
Shubham Gupta
Sarthak Singh
Nancy

Version

Version	Date	Comments
1.0	23/02/2025	Initial release

Future Roadmap

Part 1: Baseline

Part 2: Advanced

Advanced style transfer capabilities
In-Context Generation for Diffusion Model
Real time generation monitoring
Enhanced video transitions
Better quality metrics
Multi language support
Custom character consistency
Animation effects

Acknowledgements

Hugging Face Transformers - https://huggingface.co/transformers
Hugging Face Diffusers - https://huggingface.co/diffusers
FFmpeg - https://ffmpeg.org/
UV - https://docs.astral.sh/uv/
MoviePy - https://zulko.github.io/moviepy/getting_started/index.html

Project References

1. Large Language Models (LLMs) & Transformers

The Illustrated Transformer - A visual, beginner-friendly introduction to transformer architecture.
Attention Is All You Need - The seminal paper on transformer architecture.
Gemini 2.0 Flash Thinking

2. Multi-Agent Systems

Introduction to Multi-Agent Systems - Fundamental concepts and principles.
A Comprehensive Guide to Understanding LangChain Agents and Tools - Practical implementation guide.
kokoro

2. Image Generation & Processing

3. RAG

Retrieval Augmented Generation

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
assembly		assembly
diffusion/scripts		diffusion/scripts
resources		resources
tts/scripts		tts/scripts
.python-version		.python-version
README.md		README.md
main.py		main.py
main_local.py		main_local.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLSA Project Wing: ML

ForgeTube

🚧Our Project:

🖥️Project Stack:

Features

Steps for deployment :

1. Using UV for Python Package Management

2. Setting up Modal

3. Get your Gemini-API Key :

4. Setting up Serp-Api

5. `Kokoro`

6. Download and setup FFmpeg

7. Start Generating :

Troubleshooting

Contributors

Version

Future Roadmap

Part 1: Baseline

Part 2: Advanced

Acknowledgements

Project References

1. Large Language Models (LLMs) & Transformers

2. Multi-Agent Systems

2. Image Generation & Processing

3. RAG

About

Contributors 5

Languages

MLSAKIIT/ForgeTube

Folders and files

Latest commit

History

Repository files navigation

MLSA Project Wing: ML

ForgeTube

🚧Our Project:

🖥️Project Stack:

Features

Steps for deployment :

1. Using UV for Python Package Management

2. Setting up Modal

3. Get your Gemini-API Key :

4. Setting up Serp-Api

5. Kokoro

6. Download and setup FFmpeg

7. Start Generating :

Troubleshooting

Contributors

Version

Future Roadmap

Part 1: Baseline

Part 2: Advanced

Acknowledgements

Project References

1. Large Language Models (LLMs) & Transformers

2. Multi-Agent Systems

2. Image Generation & Processing

3. RAG

About

Resources

Stars

Watchers

Forks

Contributors 5

Languages

5. `Kokoro`