Skip to content

Latest commit

 

History

History
76 lines (49 loc) · 2.93 KB

README.md

File metadata and controls

76 lines (49 loc) · 2.93 KB

Well Known Bots

This repository contains a list of Well Known Bots, including robots, crawlers, validators, monitors, and spiders, in a single JSON file. Each bot is identified and provided a RegExp pattern to match against an HTTP User-Agent header. Additional metadata is available on each item.

Install

Direct download

Download the well-known-bots.json file directly.

Realities

It's impossible to create a system that can detect all bots. Well-behaving bots identify themselves in a consistent manner, usually via the User-Agent patterns this project provides. It is straightforward to identify these well-behaving bots, but misbehaving bots pretend to be real clients and use various mechanisms to evade detection.

For more details, see Non-Technical Notes in the browser-fingerprinting project.

Structure

Each entry in the JSON represents a specific bot or crawler and includes the following fields:

  • id: A unique identifier for the bot
  • categories: An array of categories the bot belongs to (e.g., "search-engine", "advertising")
  • pattern: A regular expression pattern used to identify the bot in user agent strings
  • url: (optional) A URL with more information about the bot
  • verification: A list of supported methods for verifying the bot's identity (if the bot is not verifiable it should be empty).
  • instances: An array of example user agent strings for the bot

Verification

Each verification entry contains the following fields:

  • type: The method of verification (dns and cidr are supported)

If you specify dns verification then these fields are expected:

  • masks: An array of mask patterns used for verification

If you specify cidr verification then these fields are expected:

  • sources: An array of sources to pull cidr range data from (at least one is required)

Verification mask patterns

The mask patterns use the following special characters:

  • *: Represents 0 or 1 of any character
  • @: Acts as a wildcard, matching any number of characters

All other characters in the mask require an exact match.

Cidr verification sources

Each cidr source requires the following fields:

  • type: The type of source (Currently only http-json) is supported
  • url: The url that hosts the ip ranges
  • selector: A JsonPath selector that selects all of the IP ranges in the source

License

The project is a hard-fork of crawler-user-agents at commit 46831767324e10c69c9ac6e538c9847853a0feb9, which is distributed under the MIT License.