Crawler Reference

Complete reference for crawler YAML configuration.

Top-Level Options

Option	Type	Default	Description
`name`	string	required	Unique identifier for the crawler
`description`	string	`None`	Human-readable description
`delay`	int	`0`	Default delay between tasks (seconds)
`expire`	int	`1`	Days until cached data expires
`max_runtime`	int	`0`	Maximum runtime in seconds (0 = unlimited). See Runtime Control
`stealthy`	bool	`false`	Use random User-Agent headers

Runtime Control

Max Runtime

The max_runtime option limits how long a crawler can run. This is useful for CI environments with time limits (e.g., GitHub Actions 6-hour limit).

name: my_crawler
max_runtime: 21600  # 6 hours in seconds

When using memorious run:

A timer starts when the crawler begins
When max_runtime is exceeded, SIGTERM is sent to stop the worker
Pending jobs are skipped (checked before each stage execution)
The crawler exits gracefully, flushing any pending entity data

Can also be set globally via MEMORIOUS_MAX_RUNTIME environment variable.

Error Handling

When a stage raises an exception:

With --continue-on-error: The error is logged and execution continues with other jobs
Without --continue-on-error (default): The crawler stops immediately by:
1. Sending SIGTERM to terminate the worker process
2. Pending jobs remain in the queue but are not processed

Clearing Previous Runs

By default, memorious run cancels any pending jobs from previous runs before starting. Control this with --clear-runs / --no-clear-runs:

# Default: cancel previous jobs before starting
memorious run crawler.yml

# Keep previous jobs in queue (resume interrupted crawl)
memorious run crawler.yml --no-clear-runs

Cancel vs Stop

The crawler has two termination methods:

cancel(): Removes pending jobs from the queue. Used by memorious cancel CLI command.
stop(): Sends SIGTERM to terminate the current worker process. Used internally on unhandled errors.

Pipeline

The pipeline key defines the crawler stages:

pipeline:
  stage_name:
    method: operation_name
    params:
      key: value
    handle:
      pass: next_stage

Stage Configuration

Option	Type	Description
`method`	string	Operation name or `module:function` path
`params`	dict	Parameters passed to the operation
`handle`	dict	Handler routing (rule → stage name)

Handlers

Handlers route data to subsequent stages based on the operation's output:

Handler	Description
`pass`	Default success handler
`fetch`	URLs that need fetching
`store`	Data ready for storage
`fragment`	FTM entity fragments

Custom handlers can be defined by operations.

Rules

Rules filter HTTP responses based on URL, content type, or document structure.

Rule Types

Rule	Description	Example
`domain`	Match URLs from a domain (including subdomains)	`example.com`
`pattern`	Match URL against a regex pattern	`.*\.pdf$`
`mime_type`	Match exact MIME type	`application/pdf`
`mime_group`	Match MIME type group	`documents`
`xpath`	Match if XPath finds elements	`//div[@class="article"]`
`match_all`	Always matches (default)	`{}`

MIME Groups

Group	Description
`web`	HTML, CSS, JavaScript
`images`	Image files
`media`	Audio and video
`documents`	PDF, Office documents, text
`archives`	ZIP, TAR, compressed files
`assets`	Fonts, icons, other assets

Boolean Operators

Combine rules using and, or, and not:

# Match PDFs from example.com
rules:
  and:
    - domain: example.com
    - mime_type: application/pdf

# Match documents but not images
rules:
  and:
    - mime_group: documents
    - not:
        mime_group: images

# Match either domain
rules:
  or:
    - domain: example.com
    - domain: example.org

Complex Example

parse:
  method: parse
  params:
    rules:
      and:
        - domain: dataresearchcenter.org
        - not:
            or:
              - domain: vis.dataresearchcenter.org
              - domain: data.dataresearchcenter.org
              - mime_group: images
              - pattern: ".*/about.*"
    store:
      mime_group: documents
  handle:
    fetch: fetch
    store: store

Aggregator

Run postprocessing after the crawler completes:

aggregator:
  method: module:function
  params:
    key: value

The aggregator function receives a context object with access to crawler state.

Context Object

Operations receive a Context object with these attributes:

Properties

Property	Type	Description
`context.crawler`	`Crawler`	The crawler instance
`context.stage`	`CrawlerStage`	Current stage
`context.run_id`	`str`	Unique run identifier
`context.params`	`dict`	Stage parameters
`context.log`	`Logger`	Structured logger (structlog)
`context.http`	`ContextHttp`	HTTP client

Methods

Method	Description
`emit(data, rule='pass', optional=False)`	Emit data to next stage
`recurse(data, delay=None)`	Re-queue current stage
`get(key, default=None)`	Get param with env var expansion
`store_file(path)`	Store file in archive, returns content hash
`store_data(data, encoding='utf-8')`	Store bytes in archive
`check_tag(tag)`	Check if tag exists
`get_tag(tag)`	Get tag value
`set_tag(tag, value)`	Set tag value
`skip_incremental(*criteria)`	Check/set incremental skip
`emit_warning(message, **kwargs)`	Log a warning

HTTP Client

The context.http client provides:

Methods

context.http.get(url, **kwargs)
context.http.post(url, **kwargs)
context.http.rehash(data)  # Restore response from serialized data
context.http.save()        # Persist session state
context.http.reset()       # Clear session state

Proxy Configuration

Proxies can be configured globally via MEMORIOUS_HTTP_PROXIES or per-stage:

pipeline:
  fetch:
    method: fetch
    params:
      # Single proxy
      http_proxies: http://proxy:8080

      # Multiple proxies (random selection per request)
      http_proxies:
        - http://proxy1:8080
        - http://proxy2:8080
        - socks5://proxy3:1080

Stage-level http_proxies overrides the global setting. When multiple proxies are provided, a random one is selected when the HTTP client is created.

Request Parameters

Parameter	Description
`headers`	Extra HTTP headers
`auth`	Tuple of (username, password)
`data`	Form data for POST
`json_data`	JSON body for POST
`params`	URL query parameters
`lazy`	Defer the actual request
`timeout`	Request timeout in seconds

Response Properties

Property	Description
`url`	Final URL after redirects
`status_code`	HTTP status code
`headers`	Response headers
`encoding`	Content encoding
`content_hash`	SHA1 hash of body
`content_type`	Normalized MIME type
`file_name`	From Content-Disposition
`ok`	True if status < 400
`raw`	Body as bytes
`text`	Body as string
`html`	Parsed lxml HTML tree
`xml`	Parsed lxml XML tree
`json`	Parsed JSON
`retrieved_at`	ISO timestamp
`last_modified`	From Last-Modified header

Response Methods

Method	Description
`local_path()`	Context manager for local file path
`serialize()`	Convert to dict for passing between stages
`close()`	Close the connection

Data Validation

Context validation helpers:

Helper	Description
`is_not_empty(value)`	Check value is not empty
`is_numeric(value)`	Check value is numeric
`is_integer(value)`	Check value is an integer
`match_date(value)`	Check value is a date
`match_regexp(value, pattern)`	Check value matches regex
`has_length(value, length)`	Check value has given length
`must_contain(value, substring)`	Check value contains string

Debugging

Debug Operation

debug:
  method: inspect

Sampling Rate

Process only a subset of data during development:

fetch:
  method: fetch
  params:
    sampling_rate: 0.1  # Process 10% of items

Interactive Debugger

debug:
  method: ipdb  # Drops into ipdb debugger