Summary
Screenshots are expensive.
I need to rethink the approach of using Vision models for Computer Use, the LLMs can also interact with the Accessibility Tree instead of purely by vision. This will make the system more compatible and token efficient.
Not all OS supports this type of structured data, so the task is to explore the opportunity to make this more efficient.
Perhaps even providing a tool to read the tree and select an element with a fallback that when it fails, it will consume a screenshot (GetLatestScreenshot).
Acceptance Criteria
Summary
Screenshots are expensive.
I need to rethink the approach of using Vision models for Computer Use, the LLMs can also interact with the Accessibility Tree instead of purely by vision. This will make the system more compatible and token efficient.
Not all OS supports this type of structured data, so the task is to explore the opportunity to make this more efficient.
Perhaps even providing a tool to read the tree and select an element with a fallback that when it fails, it will consume a screenshot (GetLatestScreenshot).
Acceptance Criteria