@harshvz/crawler icon
@harshvz/crawler icon

@harshvz/crawler

A flexible web crawler and scraping tool using Playwright, supporting both BFS and DFS crawling strategies with screenshot capture and structured output. Installable via npm and usable both as a CLI and programmatically.

@harshvz/crawler screenshot 1

Cost / License

Platforms

  • Mac
  • Windows
  • Linux
-
No reviews
2likes
0comments
0news articles

Features

Suggest and vote on features
  1.  Browser Automation
  2.  Crawler

 Tags

  • graph-traversal
  • internal-links
  • npm-package
  • package
  • screenshot-automation
  • headless-browser
  • playwright
  • web-crawling
  • typescript
  • meta-data-extraction
  • bfs
  • dom-parsing
  • dfs
  • seo-analysis
  • cli-tool

@harshvz/crawler News & Activities

Highlights All activities

Recent activities

@harshvz/crawler information

  • Developed by

    IN flagHarshVz
  • Licensing

    Open Source (Apache-2.0) and Free product.
  • Written in

  • Alternatives

    30 alternatives listed
  • Supported Languages

    • English

GitHub repository

  •  3 Stars
  •  0 Forks
  •  3 Open Issues
  •   Updated  
View on GitHub

Popular alternatives

View all
@harshvz/crawler was added to AlternativeTo by Harsh on and this page was last updated .
No comments or reviews, maybe you want to be first?

What is @harshvz/crawler?

crawler is a Playwright-based web crawler designed to turn websites into reusable knowledge artifacts.

Unlike traditional scrapers that focus on extracting isolated data fields, this tool focuses on capturing meaning-bearing content from real, JavaScript-rendered pages and preserving it in a form suitable for documentation, internal knowledge bases, and AI/LLM workflows.

The crawler navigates websites using BFS or DFS strategies, renders each page in a real browser, and extracts core semantic elements such as metadata, headings (H1–H6), paragraphs, and inline text. The extracted content is stored as Markdown files, alongside full-page screenshots, providing both textual knowledge and visual ground truth for every crawled page.

The project is intentionally opinionated and minimal:

It prioritizes content understanding over raw scraping speed

It captures human-readable, context-preserving text

It produces outputs that are immediately usable by humans and machines

At its core, crawler is built as a knowledge ingestion layer — a foundation for turning websites into structured documentation, searchable knowledge bases, or LLM-ready corpora, while remaining fully local, open-source, and developer-controlled.

As the project evolves, the focus is on making extraction more controllable and deterministic, allowing users to define what content is captured and how it is organized — without introducing black-box behavior or external dependencies.

@harshvz/crawler Videos

Official Links