Crawler Reference
Complete reference for crawler YAML configuration.
Top-Level Options
| Option | Type | Default | Description |
|---|---|---|---|
name |
string | required | Unique identifier for the crawler |
description |
string | None |
Human-readable description |
delay |
int | 0 |
Default delay between tasks (seconds) |
expire |
int | 1 |
Days until cached data expires |
stealthy |
bool | false |
Use random User-Agent headers |
Pipeline
The pipeline key defines the crawler stages:
Stage Configuration
| Option | Type | Description |
|---|---|---|
method |
string | Operation name or module:function path |
params |
dict | Parameters passed to the operation |
handle |
dict | Handler routing (rule → stage name) |
Handlers
Handlers route data to subsequent stages based on the operation's output:
| Handler | Description |
|---|---|
pass |
Default success handler |
fetch |
URLs that need fetching |
store |
Data ready for storage |
fragment |
FTM entity fragments |
Custom handlers can be defined by operations.
Rules
Rules filter HTTP responses based on URL, content type, or document structure.
Rule Types
| Rule | Description | Example |
|---|---|---|
domain |
Match URLs from a domain (including subdomains) | example.com |
pattern |
Match URL against a regex pattern | .*\.pdf$ |
mime_type |
Match exact MIME type | application/pdf |
mime_group |
Match MIME type group | documents |
xpath |
Match if XPath finds elements | //div[@class="article"] |
match_all |
Always matches (default) | {} |
MIME Groups
| Group | Description |
|---|---|
web |
HTML, CSS, JavaScript |
images |
Image files |
media |
Audio and video |
documents |
PDF, Office documents, text |
archives |
ZIP, TAR, compressed files |
assets |
Fonts, icons, other assets |
Boolean Operators
Combine rules using and, or, and not:
# Match PDFs from example.com
rules:
and:
- domain: example.com
- mime_type: application/pdf
# Match documents but not images
rules:
and:
- mime_group: documents
- not:
mime_group: images
# Match either domain
rules:
or:
- domain: example.com
- domain: example.org
Complex Example
parse:
method: parse
params:
rules:
and:
- domain: dataresearchcenter.org
- not:
or:
- domain: vis.dataresearchcenter.org
- domain: data.dataresearchcenter.org
- mime_group: images
- pattern: ".*/about.*"
store:
mime_group: documents
handle:
fetch: fetch
store: store
Aggregator
Run postprocessing after the crawler completes:
The aggregator function receives a context object with access to crawler state.
Context Object
Operations receive a Context object with these attributes:
Properties
| Property | Type | Description |
|---|---|---|
context.crawler |
Crawler |
The crawler instance |
context.stage |
CrawlerStage |
Current stage |
context.run_id |
str |
Unique run identifier |
context.params |
dict |
Stage parameters |
context.log |
Logger |
Structured logger (structlog) |
context.http |
ContextHttp |
HTTP client |
Methods
| Method | Description |
|---|---|
emit(data, rule='pass', optional=False) |
Emit data to next stage |
recurse(data, delay=None) |
Re-queue current stage |
get(key, default=None) |
Get param with env var expansion |
store_file(path) |
Store file in archive, returns content hash |
store_data(data, encoding='utf-8') |
Store bytes in archive |
check_tag(tag) |
Check if tag exists |
get_tag(tag) |
Get tag value |
set_tag(tag, value) |
Set tag value |
skip_incremental(*criteria) |
Check/set incremental skip |
emit_warning(message, **kwargs) |
Log a warning |
HTTP Client
The context.http client provides:
Methods
context.http.get(url, **kwargs)
context.http.post(url, **kwargs)
context.http.rehash(data) # Restore response from serialized data
context.http.save() # Persist session state
context.http.reset() # Clear session state
Request Parameters
| Parameter | Description |
|---|---|
headers |
Extra HTTP headers |
auth |
Tuple of (username, password) |
data |
Form data for POST |
json_data |
JSON body for POST |
params |
URL query parameters |
lazy |
Defer the actual request |
timeout |
Request timeout in seconds |
Response Properties
| Property | Description |
|---|---|
url |
Final URL after redirects |
status_code |
HTTP status code |
headers |
Response headers |
encoding |
Content encoding |
content_hash |
SHA1 hash of body |
content_type |
Normalized MIME type |
file_name |
From Content-Disposition |
ok |
True if status < 400 |
raw |
Body as bytes |
text |
Body as string |
html |
Parsed lxml HTML tree |
xml |
Parsed lxml XML tree |
json |
Parsed JSON |
retrieved_at |
ISO timestamp |
last_modified |
From Last-Modified header |
Response Methods
| Method | Description |
|---|---|
local_path() |
Context manager for local file path |
serialize() |
Convert to dict for passing between stages |
close() |
Close the connection |
Data Validation
Context validation helpers:
| Helper | Description |
|---|---|
is_not_empty(value) |
Check value is not empty |
is_numeric(value) |
Check value is numeric |
is_integer(value) |
Check value is an integer |
match_date(value) |
Check value is a date |
match_regexp(value, pattern) |
Check value matches regex |
has_length(value, length) |
Check value has given length |
must_contain(value, substring) |
Check value contains string |
Debugging
Debug Operation
Sampling Rate
Process only a subset of data during development: