This repository contains a list of Well Known Bots, including robots, crawlers,
validators, monitors, and spiders, in a single JSON file. Each bot is identified
and provided a RegExp pattern
to match against an HTTP User-Agent
header.
Additional metadata is available on each item.
Download the well-known-bots.json
file directly.
It's impossible to create a system that can detect all bots. Well-behaving bots identify themselves in a consistent manner, usually via the User-Agent patterns this project provides. It is straightforward to identify these well-behaving bots, but misbehaving bots pretend to be real clients and use various mechanisms to evade detection.
For more details, see Non-Technical Notes in the browser-fingerprinting project.
Each entry in the JSON represents a specific bot or crawler and includes the following fields:
- id: A unique identifier for the bot
- categories: An array of categories the bot belongs to (e.g., "search-engine", "advertising")
- pattern: A regular expression pattern used to identify the bot in user agent strings
- url: (optional) A URL with more information about the bot
- verification: A list of supported methods for verifying the bot's identity (if the bot is not verifiable it should be empty).
- instances: An array of example user agent strings for the bot
Each verification entry contains the following fields:
- type: The method of verification (
dns
andcidr
are supported)
If you specify dns
verification then these fields are expected:
- masks: An array of mask patterns used for verification
If you specify cidr
verification then these fields are expected:
- sources: An array of sources to pull cidr range data from (at least one is required)
The mask patterns use the following special characters:
- *: Represents 0 or 1 of any character
- @: Acts as a wildcard, matching any number of characters
All other characters in the mask require an exact match.
Each cidr source requires the following fields:
- type: The type of source (Currently only
http-json
) is supported - url: The url that hosts the ip ranges
- selector: A JsonPath selector that selects all of the IP ranges in the source
The project is a hard-fork of crawler-user-agents at commit
46831767324e10c69c9ac6e538c9847853a0feb9
, which is distributed under the MIT
License.