OS-ATLAS: A Foundation Action Model For Generalist GUI Agents

Zhiyong Wu1*, Zhenyu Wu1,2*, Fangzhi Xu1*, Yian Wang2*, Qiushi Sun3, Chengyou Jia1,
1Shanghai AI Lab, 2Shanghai Jiaotong University, 3University of Hong Kong, 4MIT
*Equal contribution

💻 Demo1: Hide `__pycache__` in VSCode.

🌐 Demo2: Enlarge Font Size in Chrome.

📱 Demo3: Send SMS via Simple Messenger.

Overview

os-atlas

An overview of the functions of OS-Atlas and its superior performance across various dimensions.

Abstract

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas—a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested substantial engineering effort into developing a toolkit for synthesizing multi-platform GUI grounding data. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs. All our data, code, and models will be made publicly available.

Grounding Data Collection

ss

Training Pipeline

ss

Overall training pipeline of OS-Atlas. We first perform large-scale pre-training using 13M GUI grounding data collected to build OS-Atlas-Base. Next, we conduct multitask fine-tuning on agent data, resulting in OS-Atlas.

Experiments: Grounding Tasks

ScreenSpot

ss

Grounding accuracy on ScreenSpot. The best results are in bold.

OS-World

ss

Successful rate on OS World benchmark, divided by apps (domains).

Experiments: Agent Tasks

Web & Desktop Platform

ss

Results on web and desktop tasks. InternVL-2/Qwen2-VL and OS-Atlas-4/7B differ in that the former utilizes the original checkpoints, while the latter is fine-tuned on OS-Atlas-Base.

Mobile Platform

ss

Results on mobile tasks.

BibTeX

@article{wu2024atlas,
        title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
        author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
        journal={arXiv preprint arXiv:2410.23218},
        year={2024}
      }