Advanced Visual-Only GUI Grounding Framework with Visual Segmentation Model and Large Language Model
Introduction • Tech Stack • Installation • Reference • Issue • License • I Putu Krisna Erlangga
The expanding demand for autonomous agents highlights the necessity of effective GUI grounding to enable accurate interactions with graphical interfaces. Traditional methods often depend on extensive fine-tuning, large datasets, and high-cost hardware, restricting their accessibility. This research proposes a visual-only GUI grounding framework that integrates visual segmentation models with large language models
, eliminating the need for fine-tuning. The framework leverages the Segment Anything Model (SAM)
to segment GUI images into regions, which are then captioned using small language models like BLIP
. State-of-the-art large language models, such as GPT-4o
, analyze these captions to align user queries with relevant GUI elements. The evaluation, conducted on 350 GUI datasets across web, mobile, and desktop environments, demonstrates the framework's effectiveness. The proposed framework achieved an accuracy of 57.86%, acquiring a better performance than the 56.91% accuracy of SeeClick
(GUI fine-tuned model), a fine-tuned GUI model. Our framework also surpasses the SoTA LLM model (GPT-4o
and GPT-4o-mini
) with margin up to 30%. These results underscore the framework's capability to perform robust GUI grounding without the need for fine-tuning, presenting a practical and efficient solution for diverse applications in GUI interaction.
Framework, Library, Database, Tools, etc
- Python
- PyTorch
- Streamlit
- HuggingFace
- OpenAI
Note: Tested in Python 3.10.4 and CUDA 11.8
- Clone this repository
https://github.com/krsx/visual-gui-grounding.git
or clickClone or Download
button and then clickDownload ZIP
- Install required the required library
pip install -r requirements.txt
- Setup your
.env
OPENAI_API_KEY=enter_your_api_key_here
- Download Segment Anything (SAM) model (weights)[https://github.com/facebookresearch/segment-anything#model-checkpoints]
We provide a streamlit app
for a convinient way to understand how the framework works. Run the app by using:
sreamlit run app.py
For framework evaluations, we provide 3 automated evaluations scripts. Here are step by step to use it:
- Download ScreenSpot GUI datasets and annotations
- To execute framework evaluation
python eval.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all
- To execute
SeeClick
andQwen-VL
evaluationpython eval_seeclick.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all
python eval_seeclick.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations--task all --model qwen
- To execute
GPT-4o
andGPT-4o-mini
evaluationbash python eval_gpt.py --screenspot_imgs path/to/imgs --screenspot_test path/to/annotations --task all
bash python eval_gpt.py -screenspot_imgs path/to/imgs --screenspot_test path/to/annotations --task all --model gpt-4o-mini
If you found a bug or an issue, please report by opening a new issue on this repository.
This project is licensed under the MIT License - see the LICENSE file for details