Skip to content

MLSAKIIT/ForgeTube

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLSA Project Wing: ML

ForgeTube

GitHub ForgeTube YouTube X Instagram Discord

🚧Our Project:

Our project focuses on creating an automated video generation system using AI. It transforms text prompts into fully narrated videos by leveraging large language models for script generation, diffusion models for image creation, and text to speech systems for narration. The system processes inputs through multiple stages, from script generation to final video assembly, creating cohesive, engaging content automatically.

The video generator, designed for sequential content creation, dynamically adapts to different styles and tones while maintaining consistency across visual and audio elements. It also has the ability to add subtiles either embedded or through the use of an srt file.

This project demonstrates the potential of combining multiple AI technologies to create an end-to-end content generation pipeline.

🖥️Project Stack:

Python 3.11: Core programming language for the project.

  • Content Generation:

    Gemini API: To generate the script using Gemini 2.0 Flash Thinking model and store it in a JSON format with proper audio and visual prompts and respective parameters.

    Stable Diffusion XL Base 1.0: For image generation using diffusion models to run either locally or hosted on Modal.

    Kokoro: An open weight tts model to convert audio prompts into audio.

  • Video Processing MoviePy : For adding text, intro, outro, transition effects, subtitles, audio processing, video processing and Final_Assembly by using FFmpeg under the hood.

  • ML Frameworks:

    PyTorch: Deep learning framework for model inferencing.

    Diffusers with SDXL Base 1.0 : Utilize Hugging Face's Diffusers to generate stunning images using the SDXL Base 1.0 model. Enhance your creative projects with state-of-the-art diffusion techniques.

  • Development Tools:

    Jupyter Notebooks: For development and testing.

    Google Colab : For free cloud GPU infrastructure for development and Testing.

    Git: For version control

    Modal : For low cost high performance cloud GPU infrastructure.

  • Package Management:

    UV: For fast and efficient dependency management and project setup

Features

  • Multi-Modal Content Generation: Seamlessly combines text, image, and audio generation
  • Style Customization: Supports different content styles and tones
  • Modular Architecture: Each component can be tested and improved independently
  • Content Segmentation: Automatically breaks down content into manageable segments
  • Custom Voice Options: Multiple TTS voices and emotional tones
  • Format Flexibility: Supports different video durations and formats (.mp4 and .mkv)
  • Performance Metrics: Tracks generation quality and consistency
  • Error Handling: Robust error management across the pipeline
  • Resource Optimization: Efficient resource usage during generation

Steps for deployment :

Clone the repo on your system, using : git clone https://github.com/MLSAKIIT/ForgeTube.git

1. Using UV for Python Package Management

For more information, visit the UV Documentation.

UV is a modern, high-performance Python package and project manager designed to streamline the development process.

Here’s how you can use UV in this project:

  1. Install uv.
pip install uv
  1. Download Python 3.11
uv python install 3.11
  1. Create a virtual environment
uv venv .venv
  1. Activate your virtual environment
.venv\scripts\activate.ps1
  1. Install all dependencies
uv sync

2. Setting up Modal

For more information visit the Modal documentation.

Modal is a cloud function platform that lets you Attach high performance GPUs with a single line of code.

The nicest thing about all of this is that you don’t have to set up any infrastructure. Just:

  1. Create an account at modal.com
  2. Run pip install modal to install the modal Python package
  3. Run modal setup to authenticate (if this doesn’t work, try python -m modal setup)

3. Get your Gemini-API Key :

To obtain a Gemini API key from Google AI Studio, follow these detailed steps:

Step 1: Sign In to Google AI Studio

Navigate to Google AI Studio. Once signed in, locate and click on the "Gemini API" tab. This can typically be found in the main navigation menu or directly on the dashboard. On the Gemini API page, look for a button labeled "Get API key in Google AI Studio" and click on it.

Step 2: Review and Accept Terms of Service

  1. Review Terms: A dialog box will appear presenting the Google APIs Terms of Service and the Gemini API Additional Terms of Service. It's essential to read and understand these terms before proceeding.
  2. Provide Consent: Check the box indicating your agreement to the terms. Optionally, you can also opt-in to receive updates and participate in research studies related to Google AI.
  3. Proceed: Click the "Continue" button to move forward.

Step 3: Create and Secure Your API Key

  1. Generate API Key: Click on the "Create API key" button. You'll be prompted to choose between creating a new project or selecting an existing one. Make your selection accordingly.
  2. Retrieve the Key: Once generated, your unique API key will be displayed. Ensure you copy and store it in a secure location.

Step 4: Add your Key in main.py or local_main.py

# 1. Generate the Script
gem_api = "Enter your Gemini API Key here"
serp_api = "Enter your Serp API key here"

Important

Always keep your API key confidential. Avoid sharing it publicly or embedding it directly into client-side code to prevent unauthorized access.

4. Setting up Serp-Api

Serp is used for web scraping google search results on the video topic and gathering additional context to implement Retrieval Augmented Generation (RAG)

  1. Visit serpapi.com/ and create an account.
  2. Go to the dashboard, on the top left select Api key.
  3. Copy the API Key and add your Key in main.py or local_main.py
# 1. Generate the Script
gem_api = "Enter your Gemini API Key here"
serp_api = "Enter your Serp API key here"

5. Kokoro

Run the following commands :

python -m pip install spacy # If not insatlled for some reason
python -m spacy download en_core_web_sm

6. Download and setup FFmpeg

  1. Visit : https://github.com/BtbN/FFmpeg-Builds/releases
  2. Download the setup file for your OS.
  3. On windows download the win64 version, and extract the files.
  4. Make a directory at C:\Program Files\FFmpeg.
  5. Copy all the files in the directory.
  6. Add C:\Program Files\FFmpeg\bin to your PATH environment variable.

7. Start Generating :

Use main.py for running the image generation on Modal or use main_local.py to run Stable diffusion XL Locally.

Troubleshooting

Important

  1. Make sure all the following folders are updated properly :
script_path = "resources/scripts/"
script_path += "script.json" # Name of the script file
images_path = "resources/images/"
audio_path = "resources/audio/"
font_path = "resources/font/font.ttf" # Not recommended to change

Important

2. Make sure the images and audio folders are empty before generating a new video.

  1. Name of video file is automatically grabbed from video topic in script. However you may change the following variables to have custom names, if files names are very long then video file wont be generated, so do manually change it in such cases.
sub_output_file = "name of the subtitle file.srt"
video_file = "name of the video.mp4 or .mkv"
  1. no module named pip found Try running the following :
python -m pip install spacy pydub kokoro soundfile torch
python -m spacy download en_core_web_sm
  1. Serp API not returning any search results : This is a known issue and is being investigated.

Important

Ensure you have sufficient GPU resources for image generation and proper model weights downloaded. It is recommend to use an NVDIA GPU with at least 24 GB or more of VRAM for locally running the image generation and a high single core performance CPU for video assembly.

Note

Video generation times may vary based on content length , complexity and hardware used.

Contributors

CONTRIBUTORS MENTORS CONTENT WRITER
Kartikeya Trivedi Soham Roy [Name]
Naman Singh Yash Kumar Gupta
Soham Mukherjee
Sumedha Gunturi
Souryabrata Goswami
Harshit Agarwal
Rahul Sutradhar
Ayush Mohanty
Shopno Banerjee
Shubham Gupta
Sarthak Singh
Nancy

Version

Version Date Comments
1.0 23/02/2025 Initial release

Future Roadmap

Part 1: Baseline

  • Pipeline foundations
  • LLM Agent Handing
  • Diffusion Agent Handing
  • TTS Handing
  • Video Assembly Engine
  • Initial Deployment

Part 2: Advanced

  • Advanced style transfer capabilities
  • In-Context Generation for Diffusion Model
  • Real time generation monitoring
  • Enhanced video transitions
  • Better quality metrics
  • Multi language support
  • Custom character consistency
  • Animation effects

Acknowledgements

Project References

1. Large Language Models (LLMs) & Transformers


2. Multi-Agent Systems

2. Image Generation & Processing


3. RAG


About

MLSA Project Wing 2025 - Machine Learning

Resources

Stars

Watchers

Forks

Languages