Skip to content

Crawler Reference

Complete reference for crawler YAML configuration.

Top-Level Options

Option Type Default Description
name string required Unique identifier for the crawler
description string None Human-readable description
delay int 0 Default delay between tasks (seconds)
expire int 1 Days until cached data expires
stealthy bool false Use random User-Agent headers

Pipeline

The pipeline key defines the crawler stages:

pipeline:
  stage_name:
    method: operation_name
    params:
      key: value
    handle:
      pass: next_stage

Stage Configuration

Option Type Description
method string Operation name or module:function path
params dict Parameters passed to the operation
handle dict Handler routing (rule → stage name)

Handlers

Handlers route data to subsequent stages based on the operation's output:

Handler Description
pass Default success handler
fetch URLs that need fetching
store Data ready for storage
fragment FTM entity fragments

Custom handlers can be defined by operations.

Rules

Rules filter HTTP responses based on URL, content type, or document structure.

Rule Types

Rule Description Example
domain Match URLs from a domain (including subdomains) example.com
pattern Match URL against a regex pattern .*\.pdf$
mime_type Match exact MIME type application/pdf
mime_group Match MIME type group documents
xpath Match if XPath finds elements //div[@class="article"]
match_all Always matches (default) {}

MIME Groups

Group Description
web HTML, CSS, JavaScript
images Image files
media Audio and video
documents PDF, Office documents, text
archives ZIP, TAR, compressed files
assets Fonts, icons, other assets

Boolean Operators

Combine rules using and, or, and not:

# Match PDFs from example.com
rules:
  and:
    - domain: example.com
    - mime_type: application/pdf

# Match documents but not images
rules:
  and:
    - mime_group: documents
    - not:
        mime_group: images

# Match either domain
rules:
  or:
    - domain: example.com
    - domain: example.org

Complex Example

parse:
  method: parse
  params:
    rules:
      and:
        - domain: dataresearchcenter.org
        - not:
            or:
              - domain: vis.dataresearchcenter.org
              - domain: data.dataresearchcenter.org
              - mime_group: images
              - pattern: ".*/about.*"
    store:
      mime_group: documents
  handle:
    fetch: fetch
    store: store

Aggregator

Run postprocessing after the crawler completes:

aggregator:
  method: module:function
  params:
    key: value

The aggregator function receives a context object with access to crawler state.

Context Object

Operations receive a Context object with these attributes:

Properties

Property Type Description
context.crawler Crawler The crawler instance
context.stage CrawlerStage Current stage
context.run_id str Unique run identifier
context.params dict Stage parameters
context.log Logger Structured logger (structlog)
context.http ContextHttp HTTP client

Methods

Method Description
emit(data, rule='pass', optional=False) Emit data to next stage
recurse(data, delay=None) Re-queue current stage
get(key, default=None) Get param with env var expansion
store_file(path) Store file in archive, returns content hash
store_data(data, encoding='utf-8') Store bytes in archive
check_tag(tag) Check if tag exists
get_tag(tag) Get tag value
set_tag(tag, value) Set tag value
skip_incremental(*criteria) Check/set incremental skip
emit_warning(message, **kwargs) Log a warning

HTTP Client

The context.http client provides:

Methods

context.http.get(url, **kwargs)
context.http.post(url, **kwargs)
context.http.rehash(data)  # Restore response from serialized data
context.http.save()        # Persist session state
context.http.reset()       # Clear session state

Request Parameters

Parameter Description
headers Extra HTTP headers
auth Tuple of (username, password)
data Form data for POST
json_data JSON body for POST
params URL query parameters
lazy Defer the actual request
timeout Request timeout in seconds

Response Properties

Property Description
url Final URL after redirects
status_code HTTP status code
headers Response headers
encoding Content encoding
content_hash SHA1 hash of body
content_type Normalized MIME type
file_name From Content-Disposition
ok True if status < 400
raw Body as bytes
text Body as string
html Parsed lxml HTML tree
xml Parsed lxml XML tree
json Parsed JSON
retrieved_at ISO timestamp
last_modified From Last-Modified header

Response Methods

Method Description
local_path() Context manager for local file path
serialize() Convert to dict for passing between stages
close() Close the connection

Data Validation

Context validation helpers:

Helper Description
is_not_empty(value) Check value is not empty
is_numeric(value) Check value is numeric
is_integer(value) Check value is an integer
match_date(value) Check value is a date
match_regexp(value, pattern) Check value matches regex
has_length(value, length) Check value has given length
must_contain(value, substring) Check value contains string

Debugging

Debug Operation

debug:
  method: inspect

Sampling Rate

Process only a subset of data during development:

fetch:
  method: fetch
  params:
    sampling_rate: 0.1  # Process 10% of items

Interactive Debugger

debug:
  method: ipdb  # Drops into ipdb debugger