Skip to content


The aim of leakrfc is to provide a standardized and client agnostic way to share document collections and their metadata across systems. Collections must be syncable with standard tools (e.g. rsync, s3, ...) Therefore, no specific storage backend for metadata will be used. Every information needed is just stored in files.

Keeping the original file tree as is and storing all metadata in the subfolder .leakrfc, tenants of an archive don't even need to use the library to consume files.

The RFC is reflected by the following layout structure for a Dataset:

    my_dataset/  # the subfolder name is the `foreign_id` of a dataset

        # metadata maintained by `leakrfc`
            index.json      # generated dataset metadata served for clients
            config.yml      # dataset configuration
            documents.csv   # document database (all metadata combined)
            size            # current sum of file size in bytes
            state/          # OPTIONAL processing state
            entities/       # OPTIONAL followthemoney entities for this dataset
            meta/                            # FILE METADATA STORAGE:
                .../File.doc/info.json       # - file metadata as json REQUIRED
                .../File.doc/txt             # - extracted plain text
                .../File.doc/converted.pdf   # - converted file, e.g. from .docx to .pdf for better web display
            export/         # OPTIONAL
                my_dataset.img.zst         # dump as image
                my_dataset.leakrfc         # dump as zipfile

        # actual source data
        Arbitrary Folder/


Documents metadata is stored in the .leakrfc/documents.csv file.

106972554.pdf,54d6cd29c71713bd8b2b3c1267f13a4ab15e7c94,208679,application/pdf,2024-09-29 22:52:24.613038,2024-09-29 22:52:24.613038
500 pages.pdf,3e193f218e1dc0ae0f9f972da6d5da45b0abf59f,302661,application/pdf,2024-09-29 22:52:24.613038,2024-09-29 22:52:24.613038

This csv file is the aggregation of the json metadata stored for each file path in the .leakrfc/meta/<path> folder.

For example: .leakrfc/meta/slides.ppt/info.json

  "created_at": "2024-09-29T22:52:24.673038",
  "updated_at": "2024-09-29T22:52:24.673038",
  "size": 148992,
  "name": "slides.ppt",
  "store": "/tests/fixtures/archive/test_dataset",
  "key": "slides.ppt",
  "dataset": "test_dataset",
  "content_hash": "08883b398cc03686df724621624f432d5d052bd4",
  "mimetype": "application/",
  "processed": null,
  "origin": "original",
  "source_file": null,
  "extra": {}

Arbitrary data can be stored in the extra key (or directly at the top level of the json object if it doesn't infer with the required metadata).

Tracking updates

Each run of crawl, make or other adapters operations will update the documents metadata csv and produce a diff from the current and last csv file with the changes. This follows the unified diff format so that clients can consume the changes and e.g. only process or import documents recently added.

Example of a documents.csv.<timestamp>.diff file:

---     2025-01-05T01:40:10.639414

+++     2025-01-15T13:32:49.955291

@@ -62,0 +63 @@

+test-documents.rar,9a377a36f36bff1c5de651b8f3162ae93c0fa4b1,67945,application/vnd.rar,2024-12-21 14:56:33.412859,2024-12-21 14:56:33.412859