Crawlers
A crawler is a YAML configuration that defines a data processing pipeline. Each crawler is made up of stages that process data and pass it to subsequent stages.
Basic Structure
name: my_crawler
description: A simple web crawler
pipeline:
init:
method: seed
params:
url: https://example.com
handle:
pass: fetch
fetch:
method: fetch
handle:
pass: parse
parse:
method: parse
handle:
pass: store
store:
method: directory
params:
path: ./output
Crawler Options
| Option | Type | Description |
|---|---|---|
name |
string | Unique identifier for the crawler |
description |
string | Human-readable description |
delay |
int | Default delay between tasks (seconds) |
expire |
int | Days until cached data expires |
stealthy |
bool | Use random User-Agent headers |
Stages
Each stage has:
method: The operation to execute (built-in name ormodule:function)params: Parameters passed to the operationhandle: Routing rules for the next stage
Handlers
Handlers define which stage runs next based on the operation's output:
parse:
method: parse
handle:
fetch: fetch # URLs to crawl go to fetch stage
store: store # Documents go to store stage
pass: next # Default handler
Common handler names:
pass- Default success handlerfetch- URLs that need fetchingstore- Data ready for storage
Running Crawlers
# Run synchronously (waits for completion)
memorious run my_crawler.yml
# Queue for background workers
memorious start my_crawler.yml
# Run with custom Python modules
memorious run my_crawler.yml --src ./src
# Incremental mode (skip already-processed items)
memorious run my_crawler.yml --incremental
Common Patterns
Recursive Web Crawling
name: web_crawler
pipeline:
init:
method: seed
params:
urls:
- https://example.com/page1
- https://example.com/page2
handle:
pass: fetch
fetch:
method: fetch
handle:
pass: parse
parse:
method: parse
params:
rules:
domain: example.com
store:
mime_group: documents
handle:
fetch: fetch # Recursive: parse -> fetch -> parse
store: store
store:
method: directory
params:
path: ./downloads
API Pagination
name: api_crawler
pipeline:
init:
method: sequence
params:
start: 1
stop: 100
handle:
pass: seed
seed:
method: seed
params:
url: https://api.example.com/items?page=%(number)s
handle:
pass: fetch
fetch:
method: fetch
handle:
pass: parse
parse:
method: parse_jq
params:
query: .items[]
handle:
pass: store
store:
method: directory
params:
path: ./output
Date Range Crawling
name: date_crawler
pipeline:
init:
method: dates
params:
begin: "2024-01-01"
end: "2024-12-31"
days: 1
handle:
pass: seed
seed:
method: seed
params:
url: https://example.com/data/%(date)s
handle:
pass: fetch
# ... rest of pipeline
Rules
Rules filter which URLs are processed or stored:
parse:
method: parse
params:
# Only follow links matching these rules
rules:
and:
- domain: example.com
- not:
pattern: ".*/login.*"
# Only store documents
store:
mime_group: documents
Available Rules
| Rule | Description |
|---|---|
domain |
Match URLs from a domain |
pattern |
Regex pattern for URLs |
mime_type |
Exact MIME type match |
mime_group |
MIME type group (documents, images, web, etc.) |
xpath |
Match if XPath finds elements |
Combine with and, or, not:
Incremental Crawling
Skip already-processed items using tags:
name: incremental_crawler
expire: 7 # Remember items for 7 days
pipeline:
fetch:
method: fetch
params:
skip_incremental: true # Skip if already fetched
Or in custom operations:
def my_operation(context, data):
url = data.get("url")
if context.check_tag(url):
return # Already processed
# Process...
context.set_tag(url, True)
context.emit(data=result)
Postprocessing
Run a function after the crawler completes:
name: my_crawler
pipeline:
# ... stages ...
aggregator:
method: mymodule:export_results
params:
output_file: results.json
Debugging
Use the inspect operation to log data:
Use sampling to test with a subset of data:
Next Steps
- Operations - Available operations
- Crawler Reference - Complete configuration reference
- Operations Reference - API documentation