Skip to content

Make

This generates or updates a dataset archive. This command should be used after files were added or deleted from the archive.

The process can also be used to turn any existing directory or remote location into a leakrfc dataset.

leakrfc -d my_dataset make [OPTIONS]

Reference

Make or update a leakrfc dataset and check integrity

make_dataset(dataset, check_integrity=True, cleanup=True, metadata_only=False)

Make or update a leakrfc dataset and optionally check its integrity.

Per default, this iterates through all the source files and creates (or updates) file metadata json files.

At the end, dataset statistics and documents.csv (and their diff) are created.

Parameters:

Name Type Description Default
dataset DatasetArchive

leakrfc Dataset instance

required
check_integrity bool | None

Check checksum for each file (logs mismatches)

True
cleanup bool | None

When checking integrity, fix mismatched metadata and delete unreferenced metadata files

True
metadata_only bool | None

Only iterate through existing metadata files, don't look for new source files

False
Source code in leakrfc/make.py
def make_dataset(
    dataset: DatasetArchive,
    check_integrity: bool | None = True,
    cleanup: bool | None = True,
    metadata_only: bool | None = False,
) -> MakeStatus:
    """
    Make or update a leakrfc dataset and optionally check its integrity.

    Per default, this iterates through all the source files and creates (or
    updates) file metadata json files.

    At the end, dataset statistics and documents.csv (and their diff) are
    created.

    Args:
        dataset: leakrfc Dataset instance
        check_integrity: Check checksum for each file (logs mismatches)
        cleanup: When checking integrity, fix mismatched metadata and delete
            unreferenced metadata files
        metadata_only: Only iterate through existing metadata files, don't look
            for new source files

    """
    worker = MakeWorker(check_integrity, cleanup, metadata_only, dataset)
    return worker.run()