ArchiveBox

Name: ArchiveBox
Rating: 4 (7 reviews)

87 likes

Self-hosted archiving platform storing web content as HTML, PDFs, screenshots, media, and WARC files; imports links from bookmarks, RSS, or files; supports browser history, complex sites, JSON indexing, Git repo archiving, regular scheduling, and offline browsing.

Cost / License

Free
Open Source (MIT)

Application types

Origin

United States

Platforms

Mac
Windows
Linux
Self-Hosted
Docker

ArchiveBox alternatives

4.0

Very Good7 reviews

87likes

3comments

67alternatives

0articles

Features

Properties

Privacy focused

Features

Network Tools
Website screenshots
Anticensorship

ArchiveBox News & Activities

Highlights All activities

Recent News

No news, maybe you know any news worth sharing?

Share a News Tip

Recent activities

manduhcalderon liked ArchiveBox
5 days ago
kipim added ArchiveBox as alternative to WebsiteArchiver
17 days ago
Meget added ArchiveBox as alternative to ArchiveKit
about 1 month ago
StreamStash added ArchiveBox as alternative to StreamStash
about 1 month ago
BadAzzKunt26 liked ArchiveBox
about 2 months ago
zuxu4n added ArchiveBox as alternative to Rosint
about 2 months ago
michlm281 liked ArchiveBox
2 months ago
POX added ArchiveBox as alternative to Moji
3 months ago
josepaulinog added ArchiveBox as alternative to NoCodeExport
3 months ago
sultanmas9590 liked ArchiveBox
3 months ago

Comments and Reviews

Top Positive Comment

Francewhoa

★

May 17, 2020

Strength: • Free • Fast • No censorship • Software code is community owned. ArchiveBox (AB) is better than both Archive Today (AT) and Wayback Machine (WB). Why better? Because AB software as stronger security & stronger privacy. Why? Because AB code is open source. This means that it is available for both public review & contributions. Compare to AT and WB which have weaker security & weaker privacy. As they are not open source. • Attractive MIT license https://github.com/pirate/ArchiveBox/blob/master/LICENSE

Challenge: • The downside with ArchiveBox is that you need to either install it on your own server or hire somehow to do so.

Note: • See screenshots at https://github.com/pirate/ArchiveBox#screenshots • Code repository at https://github.com/pirate/ArchiveBox

Dylan Hooton

★

Oct 4, 2023

This site is too hard to use and doesn't let me archive sites.

Abhishek kumar

★

Aug 15, 2020

Works fine on Manjaro Linux distro yet lacks a lot of useful features and is under development now.

[Edited by abhi884, August 15]

Featured in Lists

Oh My List

Useful App

List by Jee with 40 apps, updated Aug 6, 2023

Mac apps

List of all Mac apps

List by rahulsaigal with 121 apps, updated Jan 12, 2023

A. LINKS

A list with 191 apps by shaesolomon without a description.

List by Shae Solomon with 191 apps, updated Oct 18, 2020

What is ArchiveBox?

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org and Archive.is are capable of saving.

ArchiveBox imports a list of URLs from stdin, remote url, or file, then adds the pages to a local archive folder using wget to create a browsable html clone, youtube-dl to extract media, and a full instance of Chrome headless for PDF, Screenshot, and DOM dumps, and more...

Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

Can import links from:

Pocket, Pinboard, Instapaper
RSS, XML, JSON, or plain text lists
Browser history or bookmarks (Chrome, Firefox, Safari, IE, Opera, and more)
Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, and any other text with links in it!

Can save these things for each site:

favicon.ico favicon of the site
example.com/page-name.html wget clone of the site, with .html appended if not present
output.pdf Printed PDF of site using headless chrome
screenshot.png 1440x900 screenshot of site using headless chrome
output.html DOM Dump of the HTML after rendering using headless chrome
archive.org.txt A link to the saved site on archive.org
warc/ for the html + gzipped warc file <timestamp>.gz
media/ any mp4, mp3, subtitles, and metadata found using youtube-dl
git/ clone of any repository for github, bitbucket, or gitlab links
index.html & index.json HTML and JSON index files containing metadata and details

The archiving is additive, so you can schedule ./archive to run regularly and pull new links into the index. All the saved content is static and indexed with JSON files, so it lives forever & is easily parseable, it requires no always-running backend.