Tarsier icon
Tarsier icon

Tarsier

If you've tried using an LLM to automate web interactions, you've probably run into questions like:

Tarsier screenshot 1

Cost / License

  • Free
  • Open Source

Platforms

  • Self-Hosted
  • Python
-
No reviews
1like
0comments
0news articles

Features

Suggest and vote on features
  1.  Ad-free
  2.  OCR
  3.  Python-based
  4.  AI-Powered

 Tags

Tarsier News & Activities

Highlights All activities

Recent activities

Show all activities

Tarsier information

  • Developed by

    Reworkd AI
  • Licensing

    Open Source (MIT) and Free product.
  • Alternatives

    2 alternatives listed
  • Supported Languages

    • English

AlternativeTo Categories

Development, Office & Productivity

GitHub repository

  •  1,744 Stars
  •  117 Forks
  •  17 Open Issues
  •   Updated  
View on GitHub
Tarsier was added to AlternativeTo by Paul on and this page was last updated .
No comments or reviews, maybe you want to be first?
Post comment/review

What is Tarsier?

If you've tried using an LLM to automate web interactions, you've probably run into questions like:

  • How should you feed the webpage to an LLM? (e.g. HTML, Accessibility Tree, Screenshot)
  • How do you map LLM responses back to web elements?
  • How can you inform a text-only LLM about the page's visual structure?

At Reworkd, we iterated on all these problems across tens of thousands of real web tasks to build a powerful perception system for web agents... Tarsier! In the video below, we use Tarsier to provide webpage perception for a minimalistic GPT-4 LangChain web agent.

How does it work?

Tarsier visually tags interactable elements on a page via brackets + an ID e.g. [23]. In doing this, we provide a mapping between elements and IDs for an LLM to take actions upon (e.g. CLICK [23]). We define interactable elements as buttons, links, or input fields that are visible on the page; Tarsier can also tag all textual elements if you pass tag_text_elements=True.

Furthermore, we've developed an OCR algorithm to convert a page screenshot into a whitespace-structured string (almost like ASCII art) that an LLM even without vision can understand. Since current vision-language models still lack fine-grained representations needed for web interaction tasks, this is critical. On our internal benchmarks, unimodal GPT-4 + Tarsier-Text beats GPT-4V + Tarsier-Screenshot by 10-20%!

Official Links