Crawl

Crawl a local or remote location of documents (that supports file listing) into a leakrfc dataset. This operation stores the file metadata and actual file blobs in the configured archive.

This will create a new dataset or update an existing one. Incremental crawls are cached via the global leakrfc cache.

Crawls can add files to a dataset but never deletes non-existing files.

Basic usage

Crawl a local directory

leakrfc -d my_dataset crawl /data/dump1/

Crawl a http location

The location needs to support file listing.

In this example, archives (zip, tar.gz, ...) will be extracted during import.

leakrfc -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/

Crawl from a cloud bucket

In this example, only pdf files are crawled:

leakrfc -d my_dataset crawl --include "*.pdf" s3://my_bucket/files

Under the hood, leakrfc uses anystore which uses fsspec that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.

Extract

Source files can be extracted during import using patool. This has a few caveats:

When enabling --extract, archives won't be stored but only their extracted members, keeping the original (archived) directory structure.
This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):

archive1.zip
    subdir1/file.pdf

archive2.zip
    subdir1/file.pdf

To avoid this, use --extract-ensure-subdir to create a sub-directory named by its source archive to place the extracted members into. The result would look like:

archive1.zip/subdir1/file.pdf
archive2.zip/subdir1/file.pdf

If keeping the source archives is desired, use --extract-keep-source

Include / Exclude glob patterns

Only crawl a subdirectory:

--include "subdir/*"

Exclude .txt files from a subdirectory and all it's children:

--exclude "subdir/**/*.txt"

Reference

Crawl document collections from public accessible archives (or local folders)

`crawl(uri, dataset, skip_existing=True, extract=False, extract_keep_source=False, extract_ensure_subdir=False, write_documents_db=True, exclude=None, include=None, origin=ORIGIN_ORIGINAL, source_file=None)`

Crawl a local or remote location of documents into a leakrfc dataset.

Parameters:

Name	Type	Description	Default
`uri`	`Uri`	local or remote location uri that supports file listing	required
`dataset`	`DatasetArchive`	leakrfc Dataset instance	required
`skip_existing`	`bool \| None`	Don't re-crawl existing keys (doesn't check for checksum)	`True`
`extract`	`bool \| None`	Extract archives using `patool`	`False`
`extract_keep_source`	`bool \| None`	When extracting, still import the source archive	`False`
`extract_ensure_subdir`	`bool \| None`	Make sub-directories for extracted files with the archive name to avoid overwriting existing files during extraction of multiple archives with the same directory structure	`False`
`write_documents_db`	`bool \| None`	Create csv-based document tables at the end of crawl run	`True`
`exclude`	`str \| None`	Exclude glob for file paths not to crawl	`None`
`include`	`str \| None`	Include glob for file paths to crawl	`None`
`origin`	`Origins \| None`	Origin of files (used for sub runs of crawl within a crawl job)	`ORIGIN_ORIGINAL`
`source_file`	`File \| None`	Source file (used for sub runs of crawl within a crawl job)	`None`

Source code in leakrfc/crawl.py

def crawl(
    uri: Uri,
    dataset: DatasetArchive,
    skip_existing: bool | None = True,
    extract: bool | None = False,
    extract_keep_source: bool | None = False,
    extract_ensure_subdir: bool | None = False,
    write_documents_db: bool | None = True,
    exclude: str | None = None,
    include: str | None = None,
    origin: Origins | None = ORIGIN_ORIGINAL,
    source_file: File | None = None,
) -> CrawlStatus:
    """
    Crawl a local or remote location of documents into a leakrfc dataset.

    Args:
        uri: local or remote location uri that supports file listing
        dataset: leakrfc Dataset instance
        skip_existing: Don't re-crawl existing keys (doesn't check for checksum)
        extract: Extract archives using [`patool`](https://pypi.org/project/patool/)
        extract_keep_source: When extracting, still import the source archive
        extract_ensure_subdir: Make sub-directories for extracted files with the
            archive name to avoid overwriting existing files during extraction
            of multiple archives with the same directory structure
        write_documents_db: Create csv-based document tables at the end of crawl run
        exclude: Exclude glob for file paths not to crawl
        include: Include glob for file paths to crawl
        origin: Origin of files (used for sub runs of crawl within a crawl job)
        source_file: Source file (used for sub runs of crawl within a crawl job)
    """
    remote_store = get_store(uri=uri)
    # FIXME ensure long timeouts
    if remote_store.scheme.startswith("http"):
        backend_config = ensure_dict(remote_store.backend_config)
        backend_config["client_kwargs"] = {
            **ensure_dict(backend_config.get("client_kwargs")),
            "timeout": aiohttp.ClientTimeout(total=3600 * 24),
        }
        remote_store.backend_config = backend_config
    worker = CrawlWorker(
        remote_store,
        dataset=dataset,
        skip_existing=skip_existing,
        extract=extract,
        extract_keep_source=extract_keep_source,
        extract_ensure_subdir=extract_ensure_subdir,
        write_documents_db=write_documents_db,
        exclude=exclude,
        include=include,
        origin=origin,
        source_file=source_file,
    )
    return worker.run()