Skip to content

Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model

Notifications You must be signed in to change notification settings

krsx/visual-gui-grounding

Repository files navigation

Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model

readme-project-template

Introduction  •  Tech Stack  •  Installation  •  Reference •  Issue •  License •  I Putu Krisna Erlangga

📄 Introduction

The expanding demand for autonomous agents highlights the necessity of effective GUI grounding to enable accurate interactions with graphical interfaces. Traditional methods often depend on extensive fine-tuning, large datasets, and high-cost hardware, restricting their accessibility. This research proposes a visual-only GUI grounding framework that integrates visual segmentation models with large language models, eliminating the need for fine-tuning. The framework leverages the Segment Anything Model (SAM) to segment GUI images into regions, which are then captioned using small language models like BLIP. State-of-the-art large language models, such as GPT-4o, analyze these captions to align user queries with relevant GUI elements. The evaluation, conducted on 350 GUI datasets across web, mobile, and desktop environments, demonstrates the framework's effectiveness. The proposed framework achieved an accuracy of 57.86%, acquiring a better performance than the 56.91% accuracy of SeeClick (GUI fine-tuned model), a fine-tuned GUI model. Our framework also surpasses the SoTA LLM model (GPT-4o and GPT-4o-mini) with margin up to 30%. These results underscore the framework's capability to perform robust GUI grounding without the need for fine-tuning, presenting a practical and efficient solution for diverse applications in GUI interaction.

💻 Tech Stack

Framework, Library, Database, Tools, etc

  • Python
  • PyTorch
  • Streamlit
  • HuggingFace
  • OpenAI

⚙️ Installation

Note: Tested in Python 3.10.4 and CUDA 11.8

  1. Clone this repository https://github.com/krsx/visual-gui-grounding.git or click Clone or Download button and then click Download ZIP
  2. Install required the required library pip install -r requirements.txt
  3. Setup your .env
    OPENAI_API_KEY=enter_your_api_key_here
  4. Download Segment Anything (SAM) model (weights)[https://github.com/facebookresearch/segment-anything#model-checkpoints]

🔎 Usage Guide

We provide a streamlit app for a convinient way to understand how the framework works. Run the app by using:

sreamlit run app.py

readme-project-template


For framework evaluations, we provide 3 automated evaluations scripts. Here are step by step to use it:
  1. Download ScreenSpot GUI datasets and annotations
  2. To execute framework evaluation
    python eval.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all
  3. To execute SeeClick and Qwen-VL evaluation
    python eval_seeclick.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all
    python eval_seeclick.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all --model qwen
  4. To execute GPT-4o and GPT-4o-mini evaluation bash python eval_gpt.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations --task all bash python eval_gpt.py -screenspot_imgs path/to/imgs --screenspot_test path/to/annotations --task all --model gpt-4o-mini

📚 Reference

🚩 Issue

If you found a bug or an issue, please report by opening a new issue on this repository.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details

📌 Authors

I Putu Krisna Erlangga

linkedin github portofolio

About

Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages