Skip to content

Crawl

Crawl a local or remote location of documents (that supports file listing) into a leakrfc dataset. This operation stores the file metadata and actual file blobs in the configured archive.

This will create a new dataset or update an existing one. Incremental crawls are cached via the global leakrfc cache.

Crawls can add files to a dataset but never deletes non-existing files.

Basic usage

Crawl a local directory

leakrfc -d my_dataset crawl /data/dump1/

Crawl a http location

The location needs to support file listing.

In this example, archives (zip, tar.gz, ...) will be extracted during import.

leakrfc -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/

Crawl from a cloud bucket

In this example, only pdf files are crawled:

leakrfc -d my_dataset crawl --include "*.pdf" s3://my_bucket/files

Under the hood, leakrfc uses anystore which uses fsspec that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.

Extract

Source files can be extracted during import using patool. This has a few caveats:

  • When enabling --extract, archives won't be stored but only their extracted members, keeping the original (archived) directory structure.
  • This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):
archive1.zip
    subdir1/file.pdf

archive2.zip
    subdir1/file.pdf
  • To avoid this, use --extract-ensure-subdir to create a sub-directory named by its source archive to place the extracted members into. The result would look like:
archive1.zip/subdir1/file.pdf
archive2.zip/subdir1/file.pdf
  • If keeping the source archives is desired, use --extract-keep-source

Include / Exclude glob patterns

Only crawl a subdirectory:

--include "subdir/*"

Exclude .txt files from a subdirectory and all it's children:

--exclude "subdir/**/*.txt"

Reference

Crawl document collections from public accessible archives (or local folders)

crawl(uri, dataset, skip_existing=True, extract=False, extract_keep_source=False, extract_ensure_subdir=False, write_documents_db=True, exclude=None, include=None, origin=ORIGIN_ORIGINAL, source_file=None)

Crawl a local or remote location of documents into a leakrfc dataset.

Parameters:

Name Type Description Default
uri Uri

local or remote location uri that supports file listing

required
dataset DatasetArchive

leakrfc Dataset instance

required
skip_existing bool | None

Don't re-crawl existing keys (doesn't check for checksum)

True
extract bool | None

Extract archives using patool

False
extract_keep_source bool | None

When extracting, still import the source archive

False
extract_ensure_subdir bool | None

Make sub-directories for extracted files with the archive name to avoid overwriting existing files during extraction of multiple archives with the same directory structure

False
write_documents_db bool | None

Create csv-based document tables at the end of crawl run

True
exclude str | None

Exclude glob for file paths not to crawl

None
include str | None

Include glob for file paths to crawl

None
origin Origins | None

Origin of files (used for sub runs of crawl within a crawl job)

ORIGIN_ORIGINAL
source_file File | None

Source file (used for sub runs of crawl within a crawl job)

None
Source code in leakrfc/crawl.py
def crawl(
    uri: Uri,
    dataset: DatasetArchive,
    skip_existing: bool | None = True,
    extract: bool | None = False,
    extract_keep_source: bool | None = False,
    extract_ensure_subdir: bool | None = False,
    write_documents_db: bool | None = True,
    exclude: str | None = None,
    include: str | None = None,
    origin: Origins | None = ORIGIN_ORIGINAL,
    source_file: File | None = None,
) -> CrawlStatus:
    """
    Crawl a local or remote location of documents into a leakrfc dataset.

    Args:
        uri: local or remote location uri that supports file listing
        dataset: leakrfc Dataset instance
        skip_existing: Don't re-crawl existing keys (doesn't check for checksum)
        extract: Extract archives using [`patool`](https://pypi.org/project/patool/)
        extract_keep_source: When extracting, still import the source archive
        extract_ensure_subdir: Make sub-directories for extracted files with the
            archive name to avoid overwriting existing files during extraction
            of multiple archives with the same directory structure
        write_documents_db: Create csv-based document tables at the end of crawl run
        exclude: Exclude glob for file paths not to crawl
        include: Include glob for file paths to crawl
        origin: Origin of files (used for sub runs of crawl within a crawl job)
        source_file: Source file (used for sub runs of crawl within a crawl job)
    """
    remote_store = get_store(uri=uri)
    # FIXME ensure long timeouts
    if remote_store.scheme.startswith("http"):
        backend_config = ensure_dict(remote_store.backend_config)
        backend_config["client_kwargs"] = {
            **ensure_dict(backend_config.get("client_kwargs")),
            "timeout": aiohttp.ClientTimeout(total=3600 * 24),
        }
        remote_store.backend_config = backend_config
    worker = CrawlWorker(
        remote_store,
        dataset=dataset,
        skip_existing=skip_existing,
        extract=extract,
        extract_keep_source=extract_keep_source,
        extract_ensure_subdir=extract_ensure_subdir,
        write_documents_db=write_documents_db,
        exclude=exclude,
        include=include,
        origin=origin,
        source_file=source_file,
    )
    return worker.run()