Why Diffbot?
We're focused exclusively on getting you better web data.
Some of the reasons hundreds of customers make (hundreds of) millions of calls every month:
#The Web's Best Content Extractor:
Diffbot works automatically—without rules or training. There's no better way to extract data from web pages. See how Diffbot stacks up to other content extraction methods:
Feature Comparison Text-Extraction Quality Shootout
#Identify Pages Automatically:
Use the Analyze API to automatically find and extract all products, articles, discussions or images while crawling any site.
Analyze API
#Detailed product data:
The Product API automatically returns complete product info, including all pricing data, product IDs, brand and full specifications tables.
Product API
#Clean text and html:
Articles, discussion threads, product descriptions and image captions are returned in pure text and sanitized HTML.
Start testing today
#Structured Search:
Search structured content from any crawl on-the-fly using our Search API, returning only the matching results.
Plus...
¤ All APIs execute Javascript so content is parsed like a regular browser.
¤ Works on most non-English pages thanks to visual processing.
¤ Date normalization: Datestamps are normalized and presented in RFC 1123 (HTTP/1.1) standard format.
¤ Multipage articles are automatically joined together in a single API response.
¤ Entity extraction: automatic tagging identifies major topics and entities within article text.
¤ Fix any issues realtime with the API Toolkit.
¤ Bulk API allows the extraction of hundreds to hundreds-of-thousands of pages.
¤ Access Crawlbot and Bulk job data in full JSON or CSV formats.
¤ Optionally crawl using a diverse array of IP addresses.