Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model

Introduction • Tech Stack • Installation • Reference • Issue • License • I Putu Krisna Erlangga

📄 Introduction

The expanding demand for autonomous agents highlights the necessity of effective GUI grounding to enable accurate interactions with graphical interfaces. Traditional methods often depend on extensive fine-tuning, large datasets, and high-cost hardware, restricting their accessibility. This research proposes a visual-only GUI grounding framework that integrates visual segmentation models with large language models, eliminating the need for fine-tuning. The framework leverages the Segment Anything Model (SAM) to segment GUI images into regions, which are then captioned using small language models like BLIP. State-of-the-art large language models, such as GPT-4o, analyze these captions to align user queries with relevant GUI elements. The evaluation, conducted on 350 GUI datasets across web, mobile, and desktop environments, demonstrates the framework's effectiveness. The proposed framework achieved an accuracy of 57.86%, acquiring a better performance than the 56.91% accuracy of SeeClick (GUI fine-tuned model), a fine-tuned GUI model. Our framework also surpasses the SoTA LLM model (GPT-4o and GPT-4o-mini) with margin up to 30%. These results underscore the framework's capability to perform robust GUI grounding without the need for fine-tuning, presenting a practical and efficient solution for diverse applications in GUI interaction.

💻 Tech Stack

Framework, Library, Database, Tools, etc

Python
PyTorch
Streamlit
HuggingFace
OpenAI

⚙️ Installation

Note: Tested in Python 3.10.4 and CUDA 11.8

Clone this repository https://github.com/krsx/visual-gui-grounding.git or click Clone or Download button and then click Download ZIP
Install required the required library pip install -r requirements.txt
Setup your .env
```
OPENAI_API_KEY=enter_your_api_key_here
```
Download Segment Anything (SAM) model (weights)[https://github.com/facebookresearch/segment-anything#model-checkpoints]

🔎 Usage Guide

We provide a streamlit app for a convinient way to understand how the framework works. Run the app by using:

sreamlit run app.py

For framework evaluations, we provide 3 automated evaluations scripts. Here are step by step to use it:

Download ScreenSpot GUI datasets and annotations

To execute framework evaluation

python eval.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all

To execute SeeClick and Qwen-VL evaluation

python eval_seeclick.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all

python eval_seeclick.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all --model qwen

To execute GPT-4o and GPT-4o-mini evaluation bash python eval_gpt.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations --task all bash python eval_gpt.py -screenspot_imgs path/to/imgs --screenspot_test path/to/annotations --task all --model gpt-4o-mini

📚 Reference

🚩 Issue

If you found a bug or an issue, please report by opening a new issue on this repository.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.streamlit		.streamlit
configs		configs
img		img
logic		logic
utils		utils
.env-sample		.env-sample
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
app.py		app.py
eval.py		eval.py
eval_gpt.py		eval_gpt.py
eval_seeclick.py		eval_seeclick.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model

📄 Introduction

💻 Tech Stack

⚙️ Installation

🔎 Usage Guide

📚 Reference

🚩 Issue

📝 License

📌 Authors

I Putu Krisna Erlangga

About

Releases

Packages

Languages

krsx/visual-gui-grounding

Folders and files

Latest commit

History

Repository files navigation

Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model

📄 Introduction

💻 Tech Stack

⚙️ Installation

🔎 Usage Guide

📚 Reference

🚩 Issue

📝 License

📌 Authors

I Putu Krisna Erlangga

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages