Operations Reference
API documentation for all built-in operations.
Initializers
Operations for starting crawler pipelines.
Initialize crawler with params and optional proxy configuration.
Merges stage params into the data dict and configures HTTP proxy if MEMORIOUS_HTTP_PROXY is set and not in debug mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Initial data dict. |
required |
Example
Source code in memorious/operations/initializers.py
Initialize a crawler with seed URLs.
Emits data items for each URL provided in the configuration. URLs can contain format placeholders that are substituted with values from the incoming data dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Values available for URL formatting. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
Single URL or list of URLs. |
required | |
urls
|
List of URLs (alternative to |
required |
Example
Source code in memorious/operations/initializers.py
Iterate through a set of items and emit each one.
Takes a list of items from configuration and emits a data item
for each, with the item value available as data["item"].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Base data dict to include in each emission. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
List of items to enumerate. |
required |
Example
Source code in memorious/operations/initializers.py
Trigger multiple subsequent stages in parallel.
Emits to all configured handlers, useful for splitting a pipeline into multiple parallel branches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Data to pass to all branches. |
required |
Example
Source code in memorious/operations/initializers.py
Generate a sequence of numbers.
The memorious equivalent of Python's range(), accepting start, stop, and step parameters. Supports two modes: - Immediate: generates all numbers in the range at once. - Recursive: generates numbers one by one with optional delay.
The recursive mode is useful for very large sequences to avoid overwhelming the job queue.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
May contain "number" to continue a recursive sequence. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
Starting number (default: 1). |
required | |
stop
|
Stop number (exclusive). |
required | |
step
|
Step increment (default: 1, can be negative). |
required | |
delay
|
If set, use recursive mode with this delay in seconds. |
required | |
tag
|
If set, emit each number only once across crawler runs. |
required |
Example
pipeline:
pages:
method: sequence
params:
start: 1
stop: 100
step: 1
handle:
pass: fetch
# Recursive mode for large sequences:
large_sequence:
method: sequence
params:
start: 1
stop: 10000
delay: 5 # 5 second delay between emissions
tag: page_sequence # Incremental: skip already processed
handle:
pass: fetch
Source code in memorious/operations/initializers.py
Generate a sequence of dates.
Generates dates by iterating backwards from an end date with a specified interval. Useful for scraping date-based archives.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
May contain "current" to continue iteration. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
format
|
Date format string (default: "%Y-%m-%d"). |
required | |
end
|
End date string, or uses current date if not specified. |
required | |
begin
|
Beginning date string. |
required | |
days
|
Number of days per step (default: 0). |
required | |
weeks
|
Number of weeks per step (default: 0). |
required | |
steps
|
Number of steps if begin not specified (default: 100). |
required |
Example
Note
Each emission includes both date (formatted string) and
date_iso (ISO format) for flexibility.
Source code in memorious/operations/initializers.py
258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | |
Fetch
Operations for making HTTP requests.
Fetch a URL via HTTP GET request.
Performs an HTTP GET request on the URL specified in the data dict. Supports retry logic, URL rules filtering, incremental skipping, URL rewriting, pagination, and custom headers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain "url" key with the URL to fetch. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules
|
URL/content filtering rules (default: match_all). |
required | |
retry
|
Number of retry attempts (default: 3). |
required | |
emit_errors
|
If True, emit data even on HTTP errors (default: False). |
required | |
headers
|
Extra HTTP headers to send. |
required | |
base_url
|
Base URL for resolving relative URLs. |
required | |
rewrite
|
URL rewriting configuration with "method" and "data" keys. Methods: "template" (Jinja2), "replace" (string replace). |
required | |
pagination
|
Pagination config with "param" key for page number. |
required |
Example
pipeline:
# Simple fetch
fetch:
method: fetch
params:
rules:
domain: example.com
retry: 5
handle:
pass: parse
# Fetch with URL rewriting and headers
fetch_detail:
method: fetch
params:
headers:
Referer: https://example.com/search
rewrite:
method: template
data: "https://example.com/doc/{{ doc_id }}"
handle:
pass: parse
# Fetch with pagination
fetch_list:
method: fetch
params:
url: https://example.com/results
pagination:
param: page
handle:
pass: parse
Source code in memorious/operations/fetch.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
Configure HTTP session parameters for subsequent requests.
Sets up authentication, user agent, referer, and proxy settings that will be used for all subsequent HTTP requests in this crawler run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Passed through to next stage. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
user
|
Username for HTTP basic authentication. |
required | |
password
|
Password for HTTP basic authentication. |
required | |
user_agent
|
Custom User-Agent header. |
required | |
url
|
URL to set as Referer header. |
required | |
proxy
|
Proxy URL for HTTP/HTTPS requests. |
required |
Example
Source code in memorious/operations/fetch.py
Perform HTTP POST request with form data.
Sends a POST request with form-urlencoded data to the specified URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Current stage data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
Target URL (or use data["url"]). |
required | |
data
|
dict[str, Any]
|
Dictionary of form fields to POST. |
required |
use_data
|
Map of {post_field: data_key} to include from data dict. |
required | |
headers
|
Extra HTTP headers. |
required |
Example
Source code in memorious/operations/fetch.py
Perform HTTP POST request with JSON body.
Sends a POST request with a JSON payload to the specified URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Current stage data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
Target URL (or use data["url"]). |
required | |
data
|
dict[str, Any]
|
Dictionary to send as JSON body. |
required |
use_data
|
Map of {json_field: data_key} to include from data dict. |
required | |
headers
|
Extra HTTP headers. |
required |
Example
Source code in memorious/operations/fetch.py
Perform HTTP POST to an HTML form with its current values.
Extracts form fields from an HTML page and submits them with optional additional data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Current stage data (must have cached HTML response). |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form
|
XPath to locate the form element. |
required | |
data
|
dict[str, Any]
|
Additional form fields to add/override. |
required |
use_data
|
Map of {form_field: data_key} to include from data dict. |
required | |
headers
|
Extra HTTP headers. |
required |
Example
Source code in memorious/operations/fetch.py
Parse
Operations for parsing responses.
Parse HTML response and extract URLs and metadata.
The main parsing operation that extracts URLs from HTML documents for further crawling and metadata based on XPath expressions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_paths
|
List of XPath expressions to search for URLs. |
required | |
meta
|
Dict mapping field names to XPath expressions. |
required | |
meta_date
|
Dict mapping date field names to XPath expressions. |
required | |
store
|
Rules dict to match responses for storage. |
required | |
schema
|
FTM schema name for entity extraction. |
required | |
properties
|
Dict mapping FTM properties to XPath expressions. |
required |
Example
Source code in memorious/operations/parse.py
Parse HTML listing with multiple items.
Extracts metadata from a list of items on a page and handles pagination. Useful for search results, archives, and index pages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
XPath expression to select item elements. |
required | |
meta
|
Dict mapping field names to XPath expressions (per item). |
required | |
pagination
|
Pagination configuration. |
required | |
emit
|
If True, emit each item's data. |
required | |
parse_html
|
If True, extract URLs from items (default: True). |
required |
Example
Source code in memorious/operations/parse.py
Parse JSON response using jq patterns.
Uses the jq query language to extract data from JSON responses. Emits one data item for each result from the jq query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
jq pattern string to extract data. |
required |
Example
Source code in memorious/operations/parse.py
Parse CSV file and emit rows.
Reads a CSV file and emits each row as a data item. Can also emit all rows together as a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
skiprows
|
Number of rows to skip at the beginning. |
required | |
delimiter
|
CSV field delimiter (default: comma). |
required |
Example
Source code in memorious/operations/parse.py
Parse XML response and extract metadata.
Parses an XML document and extracts metadata using XPath expressions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meta
|
Dict mapping field names to XPath expressions. |
required | |
meta_date
|
Dict mapping date field names to XPath expressions. |
required |
Example
Source code in memorious/operations/parse.py
Clean
Operations for cleaning data.
Clean and validate metadata in the data dict.
Performs various data transformations including dropping keys, setting defaults, rewriting values, validating required fields, and type casting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Data dict to clean (modified in place). |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
drop
|
List of keys to remove from data. |
required | |
defaults
|
Dict of default values for missing keys. |
required | |
values
|
Dict for value rewriting (mapping or format string). |
required | |
required
|
List of required keys (raises MetaDataError if missing). |
required | |
typing
|
Type casting configuration with ignore list and date kwargs. |
required |
Example
pipeline:
clean:
method: clean
params:
drop:
- page
- formdata
- session_id
defaults:
source: "web"
language: "en"
values:
foreign_id: "{publisher[id]}-{reference}"
status:
draft: unpublished
live: published
required:
- title
- url
- published_at
typing:
ignore:
- reference
- phone_number
dateparserkwargs:
dayfirst: true
handle:
pass: store
Source code in memorious/operations/clean.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
Clean HTML by removing specified elements.
Removes HTML elements matching the given XPath expressions and stores the cleaned HTML.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_paths
|
List of XPath expressions for elements to remove. |
required |
Example
Source code in memorious/operations/clean.py
Extract
Operations for extracting archives.
Extract files from a compressed archive.
Supports ZIP, TAR (including gzip/bzip2), and 7z archives. Emits each extracted file as a separate data item.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain cached HTTP response data. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
wildcards
|
List of shell-style patterns to filter extracted files. |
required |
Example
Source code in memorious/operations/extract.py
Regex
Operations for regex extraction.
Extract named regex groups from data values.
Uses regex named capture groups to extract structured data from string values. Supports both simple single-pattern extraction and advanced multi-pattern extraction with splitting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Data dict to extract from (modified in place). |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
<key>
|
Regex pattern with named groups, or config dict. |
required | |
Config dict supports
|
pattern/patterns: Single pattern or list of patterns. store_as: Key name for storing the result. split: Separator to split value before matching. |
required |
Example
pipeline:
extract:
method: regex_groups
params:
# Simple extraction: source key -> named groups added to data
full_name: '(?P<first_name>\w+)\s(?P<last_name>\w+)'
# From "John Doe" extracts: first_name="John", last_name="Doe"
# Advanced extraction with splitting
originators_raw:
store_as: originators
split: ","
patterns:
- '(?P<name>.*),\s*(?P<party>\w+)'
- '(?P<name>.*)'
# From "John Doe, SPD, Jane Smith" extracts:
# originators = [
# {name: "John Doe", party: "SPD"},
# {name: "Jane Smith"}
# ]
# Metadata extraction
meta_raw: >-
.*Drucksache\s+(?P<reference>\d+/\d+)
.*vom\s+(?P<published_at>\d{2}\.\d{2}\.\d{4}).*
handle:
pass: clean
Source code in memorious/operations/regex.py
Store
Operations for storing data.
Store with configurable backend and incremental marking.
A flexible store operation that delegates to other storage methods and marks incremental completion when the target stage is reached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain content_hash from a fetched response. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operation
|
Storage operation name (default: "directory"). Options: "directory", "lakehouse" |
required |
Note
Incremental completion is marked automatically by the underlying storage operations (directory, lakehouse).
Source code in memorious/operations/store.py
Store collected files to a local directory.
Saves files to a directory structure organized by crawler name. Also stores metadata as a JSON sidecar file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain content_hash from a fetched response. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Custom storage path (default: {base_path}/store/{crawler_name}). |
required | |
compute_path
|
Configure how file paths are computed. method: The path computation method (default: "url_path") - "url_path": Use the URL path - "template": Use Jinja2 template with data context - "file_name": Use only the file name (flat structure) params: Method-specific parameters For url_path: include_domain: bool - Include domain as path prefix (default: false) strip_prefix: str - Strip this prefix from the path For template: template: str - Jinja2 template with data context |
required |
Example
Source code in memorious/operations/store.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
Store collected file in the ftm-lakehouse archive.
Stores files in a structured archive with metadata tracking, suitable for integration with Aleph and other FTM-based systems.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain content_hash from a fetched response. |
required |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uri
|
Custom lakehouse URI (default: context.archive). |
required | |
compute_path
|
Configure how file keys are computed. method: The path computation method (default: "url_path") - "url_path": Use the URL path - "template": Use Jinja2 template with data context - "file_name": Use only the file name (flat structure) params: Method-specific parameters For url_path: include_domain: bool - Include domain as path prefix (default: false) strip_prefix: str - Strip this prefix from the path For template: template: str - Jinja2 template with data context |
required |
Example
Source code in memorious/operations/store.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 | |
Remove a blob from the archive.
Deletes a file from the archive after processing is complete. Useful for cleaning up temporary files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Must contain content_hash of file to delete. |
required |
Source code in memorious/operations/store.py
Debug
Operations for debugging.
Log the current data dict for inspection.
Prints the data dictionary in a formatted way for debugging. Passes data through to the next stage unchanged.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context. |
required |
data
|
dict[str, Any]
|
Data to inspect. |
required |
Source code in memorious/operations/debug.py
Drop into an interactive ipdb debugger session.
Pauses execution and opens an interactive Python debugger, allowing inspection of the context and data at runtime.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context (available as |
required |
data
|
dict[str, Any]
|
Current stage data (available as |
required |
Note
Requires ipdb to be installed (pip install ipdb).
Only useful during local development, not in production.
Source code in memorious/operations/debug.py
FTP
Source code in memorious/operations/ftp.py
WebDAV
List files in a WebDAV directory.
Source code in memorious/operations/webdav.py
DocumentCloud
Source code in memorious/operations/documentcloud.py
Create a persistent tag to indicate that a document has been fully processed
On subsequent runs, we can check and skip processing this document earlier in the pipeline.
Source code in memorious/operations/documentcloud.py
Aleph
Operations for Aleph integration.
Source code in memorious/operations/aleph.py
Source code in memorious/operations/aleph.py
Source code in memorious/operations/aleph.py
FTM Store
Operations for FollowTheMoney entity storage.
Store an entity or a list of entities to an ftm store.
Source code in memorious/operations/ftm.py
Write each entity from an ftm store to Aleph via the _bulk API.
Source code in memorious/operations/ftm.py
Helpers
Utility modules for operations.
Pagination
memorious.helpers.pagination
Pagination utilities for web crawlers.
This module provides helper functions for handling pagination in crawlers, including URL manipulation and next-page detection.
get_paginated_url(url, page, param='page')
Apply page number to URL query parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The base URL. |
required |
page
|
int
|
Page number to set. |
required |
param
|
str
|
Query parameter name for the page number. |
'page'
|
Returns:
| Type | Description |
|---|---|
str
|
URL with the page parameter set. |
Example
get_paginated_url("https://example.com/search", 2) 'https://example.com/search?page=2' get_paginated_url("https://example.com/search?q=test", 3, "p") 'https://example.com/search?q=test&p=3'
Source code in memorious/helpers/pagination.py
paginate(context, data, html)
Emit next page if pagination indicates more pages.
Examines pagination configuration and HTML content to determine if there are more pages, and emits the next page data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The crawler context with pagination params. |
required |
data
|
dict[str, Any]
|
Current data dict (used to get current page). |
required |
html
|
HtmlElement
|
HTML element containing pagination info. |
required |
Example YAML configuration::
pipeline:
parse:
method: parse
params:
pagination:
total: './/span[@class="total"]/text()'
per_page: 20
param: page
handle:
next_page: fetch
store: store
Source code in memorious/helpers/pagination.py
Casting
memorious.helpers.casting
Type casting utilities for scraped data.
This module provides functions for automatically casting scraped string values to appropriate Python types (int, float, date, datetime).
cast_value(value, with_date=False, **datekwargs)
Cast a value to its appropriate type.
Attempts to convert strings to int, float, or date as appropriate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
The value to cast. |
required |
with_date
|
bool
|
If True, attempt to parse strings as dates. |
False
|
**datekwargs
|
Any
|
Additional arguments for date parsing. |
{}
|
Returns:
| Type | Description |
|---|---|
int | float | date | datetime | Any
|
The cast value (int, float, date, datetime, or original type). |
Example
cast_value("42") 42 cast_value("3.14") 3.14 cast_value("2024-01-15", with_date=True) datetime.date(2024, 1, 15)
Source code in memorious/helpers/casting.py
cast_dict(data, ignore_keys=None, **kwargs)
Cast all values in a dictionary to appropriate types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary to process. |
required |
ignore_keys
|
list[str] | None
|
Keys to skip during casting. |
None
|
**kwargs
|
Any
|
Additional arguments for date parsing. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
New dictionary with cast values. |
Example
cast_dict({"count": "42", "date": "2024-01-15"})
Source code in memorious/helpers/casting.py
ensure_date(value, raise_on_error=False, **parserkwargs)
Parse a value into a date object.
Tries multiple parsing strategies: datetime.date, dateutil.parse, and dateparser.parse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
str | date | datetime | None
|
The value to parse (string, date, datetime, or None). |
required |
raise_on_error
|
bool
|
If True, raise exception on parse failure. |
False
|
**parserkwargs
|
Any
|
Additional arguments passed to date parsers. |
{}
|
Returns:
| Type | Description |
|---|---|
date | None
|
A date object, or None if parsing fails and raise_on_error is False. |
Raises:
| Type | Description |
|---|---|
Exception
|
If parsing fails and raise_on_error is True. |
Example
ensure_date("2024-01-15") datetime.date(2024, 1, 15) ensure_date("January 15, 2024") datetime.date(2024, 1, 15)
Source code in memorious/helpers/casting.py
XPath
memorious.helpers.xpath
XPath extraction utilities for HTML/XML parsing.
This module provides helper functions for extracting values from HTML and XML documents using XPath expressions.
extract_xpath(html, path)
Extract value from HTML/XML element using XPath.
Handles common cases like single-element lists and text extraction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
HtmlElement
|
The lxml HTML/XML element to query. |
required |
path
|
str
|
XPath expression to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The extracted value. If the result is a single-element list, |
Any
|
returns just that element. If the element has a text attribute, |
Any
|
returns the stripped text. |
Example
extract_xpath(html, './/title/text()') 'Page Title' extract_xpath(html, './/a/@href') 'https://example.com'
Source code in memorious/helpers/xpath.py
Template
memorious.helpers.template
Jinja2 templating utilities for URL and string generation.
This module provides functions for rendering Jinja2 templates with data, useful for dynamic URL construction in crawlers.
render_template(template, data)
Render a Jinja2 template string with data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
template
|
str
|
Jinja2 template string. |
required |
data
|
dict[str, Any]
|
Dictionary of values to substitute. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The rendered string. |
Example
render_template("https://example.com/page/{{ page }}", {"page": 1}) 'https://example.com/page/1'
Source code in memorious/helpers/template.py
Forms
memorious.helpers.forms
HTML form extraction utilities.
This module provides helper functions for extracting form data from HTML documents, useful for form submission in crawlers.
extract_form(html, xpath)
Extract form action URL and field values from an HTML form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
HtmlElement
|
HTML element containing the form. |
required |
xpath
|
str
|
XPath expression to locate the form element. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
Tuple of (action_url, form_data_dict). Returns (None, {}) if |
dict[str, Any]
|
the form is not found. |
Example
action, data = extract_form(html, './/form[@id="login"]') action '/login' data
Source code in memorious/helpers/forms.py
Regex
memorious.helpers.regex
Regex extraction utilities for data parsing.
This module provides helper functions for extracting data from strings using regular expressions.
regex_first(pattern, string)
Extract the first regex match from a string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Regular expression pattern. |
required |
string
|
str
|
String to search. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The first match, stripped of whitespace. |
Raises:
| Type | Description |
|---|---|
RegexError
|
If no match is found. |
Example
regex_first(r"\d+", "Page 42 of 100") '42'
Source code in memorious/helpers/regex.py
YAML
memorious.helpers.yaml
YAML loader with !include constructor support.
This module provides a custom YAML loader that supports including external
files using the !include directive.
Example
IncludeLoader
Bases: SafeLoader
YAML Loader with !include constructor for file inclusion.
Source code in memorious/helpers/yaml.py
__init__(stream)
Initialize the loader with the root directory from the stream.
load_yaml(path)
Load YAML file with !include support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the YAML file. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Parsed YAML content as a dictionary. |
Example
config = load_yaml("crawler.yml")
Source code in memorious/helpers/yaml.py
Registry
memorious.operations.register(name)
Decorator to register an operation.
Raises ValueError if an operation with the same name already exists.
Example
@register("my_operation") def my_operation(context: Context, data: dict) -> None: ...
Source code in memorious/operations/__init__.py
memorious.operations.resolve_operation(method_name, base_path=None)
Resolve an operation method by name.
Resolution order: 1. Local registry (built-in and decorated operations) 2. Module import (module:function syntax, e.g., "mypackage.ops:my_func") 3. File import (file:function syntax, e.g., "./src/ops.py:my_func")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method_name
|
str
|
Either a registered name (e.g., "fetch"), a module path (e.g., "mypackage.ops:my_func"), or a file path (e.g., "./src/ops.py:my_func") |
required |
base_path
|
Path | str | None
|
Base directory for resolving relative file paths. Typically the directory containing the crawler config. |
None
|
Returns:
| Type | Description |
|---|---|
OperationFunc
|
The operation function. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the operation cannot be resolved. |