Memorious

Import memorious crawler results into a leakrfc dataset.

As long as using the global cache (environment CACHE=1, default) only new documents are synced.

leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset

File paths can be set via a key_func function or via command line:

# use only the file names without their path:
leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --name-only

# strip a prefix from the original relative file urls:
leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --strip-prefix "assets/docs"

Or use a template that will replace values from the original memorious "*.json" file for the source file. Given a json file stored by memorious like this:

{
  "url": "https://pardok.parlament-berlin.de/starweb/adis/citat/VT/19/SchrAnfr/S19-11840.pdf",
  "page": 5228,
  "request_id": "GET:https#//pardok.parlament-berlin.de/starweb/adis/citat/VT/19/SchrAnfr/S19-11840.pdf",
  "status_code": 200,
  "content_hash": "123fd201b54d7a6c91e6e9852008c3ad6698ffbe",
  "headers": {},
  "retrieved_at": "2025-01-09T08:53:58.545052",
  "originator": "Senatsverwaltung für Inneres, Digitalisierung und Sport",
  "subject": "Öffentliche Verwaltung",
  "state": "Berlin",
  "category": "Beratungsvorgang",
  "doc_type": "Anfrage",
  "date": "2022-05-24",
  "doc_id": "BLN_V359641_D359643",
  "reference": "Drucksache 19/11840",
  "reference_id": "9/11840",
  "legislative_term": "9",
  "title": "Berlin - Antwort. Senatsverwaltung für Inneres, Digitalisierung und Sport - Drucksache 19/11840, 24.05.2022",
  "modified_at": "2022-05-31T12:51:25",
  "_file_name": "123fd201b54d7a6c91e6e9852008c3ad6698ffbe.data.pdf"
}

To import this file as "2022/05/Berlin/Beratungsvorgang/19-11840.pdf":

leakrfc -d my_dataset memorious sync -i /memorious/data/store/my_dataset --key-template "{{ date[:4] }}/{{ date[5:7] }}/{{ state }}/{{ category }}/{{ reference.replace('/','-') }}.{{ url.split('.')[-1] }}"

Reference

`import_memorious(dataset, uri, key_func=None)`

Convert a "memorious collection" (the output format of the store->directory stage) into a leakrfc dataset

memorious store

./data/store/test_dataset/
    ./<sha1>.data.pdf|doc|...  # actual file
    ./<sha1>.json              # metadata file

The memorious json metadata for each file will be stored in the leakrfc metadata at the extra property for each file.

Parameters:

Name	Type	Description	Default
`dataset`	`DatasetArchive`	leakrfc Dataset instance	required
`uri`	`Uri`	local or remote location of the memorious store that supports file listing	required
`key_func`	`Callable \| None`	A function to generate file keys (their relative paths), per default it is generated from the source url.	`None`

Source code in leakrfc/sync/memorious.py

def import_memorious(
    dataset: DatasetArchive, uri: Uri, key_func: Callable | None = None
) -> MemoriousStatus:
    """
    Convert a "memorious collection" (the output format of the store->directory
    stage) into a leakrfc dataset

    memorious store:
        ```
        ./data/store/test_dataset/
            ./<sha1>.data.pdf|doc|...  # actual file
            ./<sha1>.json              # metadata file
        ```

    The memorious json metadata for each file will be stored in the leakrfc
    metadata at the `extra` property for each file.

    Args:
        dataset: leakrfc Dataset instance
        uri: local or remote location of the memorious store that supports file
            listing
        key_func: A function to generate file keys (their relative paths), per
            default it is generated from the source url.
    """

    worker = MemoriousWorker(uri, key_func, dataset=dataset)
    worker.log_info(f"Starting memorious import from `{worker.memorious.uri}` ...")
    return worker.run()