ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model

Jul 6, 2024 | Educational

ScreenAgent Logo

View ScreenAgent Paper

Readme

Introduction

Welcome to the frontier of AI-driven desktop control! The ScreenAgent project has harnessed the power of Visual Language Model (VLM) agents, enabling them to not just observe but interact meaningfully with computer screens. Imagine having a digital assistant that can navigate through your computer just by understanding visual cues – that’s what ScreenAgent brings to the table!

Recent Developments

  • (2024-4-17) The ScreenAgent paper has been accepted for presentation at IJCAI 2024!
  • (2024-5-19) ScreenAgent Web Client released, a simpler way to experience controlling a desktop with a large model.

How Does It Work?

Picture a skilled puppeteer orchestrating a performance, where every move has been meticulously planned, executed, and reflected upon to ensure a flawless show. Similarly, ScreenAgent operates through a three-phase process:

  • Planning: Breaking down user tasks into manageable subtasks.
  • Execution: Observing the screen and providing specific mouse and keyboard actions to accomplish these subtasks.
  • Reflection: Evaluating the outcomes, adjusting actions accordingly, and continuing until the overall task is completed.

Setup Instructions

To setup ScreenAgent effectively, follow these steps:

Step 1: Prepare the Desktop for Control

  • Install a VNC Server like TightVNC, or run a Docker container with a GUI environment.
  • Run the following command to pull and start the container:
  • docker run -d --name ScreenAgent -e RESOLUTION=1024x768 -p 5900:5900 -p 8001:8001 -e VNC_PASSWORD=YOUR_VNC_PASSWORD -e CLIPBOARD_SERVER_SECRET_TOKEN=YOUR_SECRET_TOKEN -v devshm:devshm niuniushanscreenagent-env:latest
  • Set the VNC password and clipboard service token appropriately.

Step 2: Prepare the Controller Code Environment

  • Run the following command to install the required dependencies:
  • pip install -r client/requirements.txt

Step 3: Set Up the Large Model Inferencer or API

Choose a suitable VLM and configure it in your client/config.yml file. You have options like GPT-4V, LLaVA-1.5, and others. Ensure to fill in the API keys where necessary.

Running the Controller

After setting everything up, you can run the controller using:

cd client
python run_controller.py -c config.yml

Troubleshooting

If you experience any hiccups during setup or operation, consider the following checks:

  • If the screen freezes, hit the Re-connect button on the controller interface.
  • Ensure your VNC server IP and port number are properly configured in client/config.yml.
  • For clipboard issues, verify that the clipboard service is running correctly.
  • If you encounter an error related to Pyperclip, set the X server location in the environment variable and rerun the clipboard service:
  • export DISPLAY=:0.0

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox