Quickstart

Install

Requires python 3.11 or later.

pip install leakrfc

Build a dataset

leakrfc stores metadata for the files that then refers to the actual source files.

For example, take this public file listing archive: https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/

Crawl these documents into a dataset:

leakrfc -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"

The metadata and source files are now stored in the archive (./data by default).

Inspect files and archive

All metadata and other information lives in the ddos_patriotfront/.leakrfc subdirectory. Files are keyed and accessible by their (relative) path.

Retrieve file metadata:

leakrfc -d ddos_patriotfront head Event.pdf

Retrieve actual file blob:

leakrfc -d ddos_patriotfront get Event.pdf > Event.pdf

Show all files metadata present in the dataset archive:

leakrfc -d ddos_patriotfront ls

Show only the file paths:

leakrfc -d ddos_patriotfront ls --keys

Show only the checksums (sha1 by default):

leakrfc -d ddos_patriotfront ls --checksums

Tracking changes

The make command (re-)generates the datasets metadata.

Delete a file:

rm ./data/ddos_patriotfront/Event.pdf

Now regenerate:

leakrfc -d ddos_patriotfront make

The result output will indicate that 1 file was deleted.

configure storage

storage_config:
  uri: s3://my_bucket
  backend_kwargs:
    endpoint_url: https://s3.example.org
    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}

dataset config.yml

Follows the specification in ftmq.model.Dataset:

name: my_dataset #  also known as "foreign_id"
title: An awesome leak
description: >
  Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
  labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
  similique asperiores quod et quae maiores. Et accusantium accusantium error
  et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata

leakrfc: # see above