Memorious
A light-weight web scraping toolkit for Python.
Info
This is a hard fork of the original memorious project that was discontinued in 2023. Currently, this package can only be installed via git:
pip install "memorious @ git+https://github.com/dataresearchcenter/memorious.git"
See development section for what has changed since.
Features
- Modular pipelines - Compose crawlers from reusable stages
- Built-in operations - Fetch, parse, store, and more
- Incremental crawling - Skip already-processed items
- HTTP caching - Conditional requests with ETag support
- OpenAleph integration - Push data to OpenAleph instances
- FTM support - Extract and store FollowTheMoney entities
Quick Example
name: my_crawler
pipeline:
init:
method: seed
params:
url: https://example.com
handle:
pass: fetch
fetch:
method: fetch
handle:
pass: store
store:
method: directory
params:
path: ./output
Documentation
- Quick Start - Get up and running in minutes
- Installation - Installation and setup
- Crawlers - How to configure crawlers
- Operations - Available operations
Reference
- CLI Reference - Command-line interface
- Crawler Reference - Configuration options
- Operations Reference - API documentation
- Settings Reference - Environment variables