Crawlers

A crawler is a YAML configuration that defines a data processing pipeline. Each crawler is made up of stages that process data and pass it to subsequent stages.

Basic Structure

name: my_crawler
description: A simple web crawler
pipeline:
  init:
    method: seed
    params:
      url: https://example.com
    handle:
      pass: fetch

  fetch:
    method: fetch
    handle:
      pass: parse

  parse:
    method: parse
    handle:
      pass: store

  store:
    method: directory
    params:
      path: ./output

Crawler Options

Option	Type	Description
`name`	string	Unique identifier for the crawler
`description`	string	Human-readable description
`delay`	int	Default delay between tasks (seconds)
`expire`	int	Days until cached data expires
`stealthy`	bool	Use random User-Agent headers

Stages

Each stage has:

method: The operation to execute (built-in name or module:function)
params: Parameters passed to the operation
handle: Routing rules for the next stage

Handlers

Handlers define which stage runs next based on the operation's output:

parse:
  method: parse
  handle:
    fetch: fetch   # URLs to crawl go to fetch stage
    store: store   # Documents go to store stage
    pass: next     # Default handler

Common handler names:

pass - Default success handler
fetch - URLs that need fetching
store - Data ready for storage

Running Crawlers

# Run synchronously (waits for completion)
memorious run my_crawler.yml

# Queue for background workers
memorious start my_crawler.yml

# Run with custom Python modules
memorious run my_crawler.yml --src ./src

# Incremental mode (skip already-processed items)
memorious run my_crawler.yml --incremental

Common Patterns

Recursive Web Crawling

name: web_crawler
pipeline:
  init:
    method: seed
    params:
      urls:
        - https://example.com/page1
        - https://example.com/page2
    handle:
      pass: fetch

  fetch:
    method: fetch
    handle:
      pass: parse

  parse:
    method: parse
    params:
      rules:
        domain: example.com
      store:
        mime_group: documents
    handle:
      fetch: fetch   # Recursive: parse -> fetch -> parse
      store: store

  store:
    method: directory
    params:
      path: ./downloads

API Pagination

name: api_crawler
pipeline:
  init:
    method: sequence
    params:
      start: 1
      stop: 100
    handle:
      pass: seed

  seed:
    method: seed
    params:
      url: https://api.example.com/items?page=%(number)s
    handle:
      pass: fetch

  fetch:
    method: fetch
    handle:
      pass: parse

  parse:
    method: parse_jq
    params:
      query: .items[]
    handle:
      pass: store

  store:
    method: directory
    params:
      path: ./output

Date Range Crawling

name: date_crawler
pipeline:
  init:
    method: dates
    params:
      end: "2020-01-01"  # Iterate backwards from now to 2020
      months: 1
    handle:
      pass: seed

  seed:
    method: seed
    params:
      url: https://example.com/data/%(date)s
    handle:
      pass: fetch
  # ... rest of pipeline

The dates operation emits date and date_iso for each step. Direction is automatic:

begin > end: iterates backwards
begin < end: iterates forwards
Both default to today if not specified (emits only today)

Supports days, weeks, months, and years intervals.

Rules

Rules filter which URLs are processed or stored:

parse:
  method: parse
  params:
    # Only follow links matching these rules
    rules:
      and:
        - domain: example.com
        - not:
            pattern: ".*/login.*"

    # Only store documents
    store:
      mime_group: documents

Available Rules

Rule	Description
`domain`	Match URLs from a domain
`pattern`	Regex pattern for URLs
`mime_type`	Exact MIME type match
`mime_group`	MIME type group (`documents`, `images`, `web`, etc.)
`xpath`	Match if XPath finds elements

Combine with and, or, not:

rules:
  and:
    - domain: example.com
    - not:
        or:
          - mime_group: images
          - pattern: ".*\\.css$"

Incremental Crawling

Skip already-processed items using tags:

name: incremental_crawler
expire: 7  # Remember items for 7 days
pipeline:
  fetch:
    method: fetch
    params:
      skip_incremental: true  # Skip if already fetched

Or in custom operations:

def my_operation(context, data):
    url = data.get("url")
    if context.check_tag(url):
        return  # Already processed

    # Process...
    context.set_tag(url, True)
    context.emit(data=result)

Postprocessing

Run a function after the crawler completes:

name: my_crawler
pipeline:
  # ... stages ...
aggregator:
  method: mymodule:export_results
  params:
    output_file: results.json

Debugging

Use the inspect operation to log data:

debug:
  method: inspect  # Logs the data dict

Use sampling to test with a subset of data:

fetch:
  method: fetch
  params:
    sampling_rate: 0.1  # Only process 10% of items

Next Steps

Operations - Available operations
Crawler Reference - Complete configuration reference
Operations Reference - API documentation