

Heritrix
5 likes
Open-source, extensible web crawler designed for large-scale, archival-quality web archiving, preserves digital artifacts, supports modular plugins, distributed crawling, detailed monitoring, scheduling, and exports data in standardized formats for preservation.
Cost / License
- Free
- Open Source
Platforms
- Mac
- Windows
- Linux
Features
- WARC Output
Tags
- Web Crawler
- web-data-crawling
- web-crawling
Heritrix News & Activities
Highlights All activities
Recent activities
- Danilo_Venom updated Heritrix
POX added Heritrix as alternative to Canary - Search & Ask AI
Heritrix information
No comments or reviews, maybe you want to be first?
Post comment/reviewWhat is Heritrix?
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.





