Crawl
Crawl a local or remote location of documents (that supports file listing) into a leakrfc
dataset. This operation stores the file metadata and actual file blobs in the configured archive.
This will create a new dataset or update an existing one. Incremental crawls are cached via the global leakrfc cache.
Crawls can add files to a dataset but never deletes non-existing files.
Basic usage
Crawl a local directory
Crawl a http location
The location needs to support file listing.
In this example, archives (zip, tar.gz, ...) will be extracted during import.
Crawl from a cloud bucket
In this example, only pdf files are crawled:
Under the hood, leakrfc
uses anystore which uses fsspec that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.
Extract
Source files can be extracted during import using patool. This has a few caveats:
- When enabling
--extract
, archives won't be stored but only their extracted members, keeping the original (archived) directory structure. - This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):
- To avoid this, use
--extract-ensure-subdir
to create a sub-directory named by its source archive to place the extracted members into. The result would look like:
- If keeping the source archives is desired, use
--extract-keep-source
Include / Exclude glob patterns
Only crawl a subdirectory:
--include "subdir/*"
Exclude .txt files from a subdirectory and all it's children:
--exclude "subdir/**/*.txt"
Reference
Crawl document collections from public accessible archives (or local folders)
crawl(uri, dataset, skip_existing=True, extract=False, extract_keep_source=False, extract_ensure_subdir=False, write_documents_db=True, exclude=None, include=None, origin=ORIGIN_ORIGINAL, source_file=None)
Crawl a local or remote location of documents into a leakrfc dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uri
|
Uri
|
local or remote location uri that supports file listing |
required |
dataset
|
DatasetArchive
|
leakrfc Dataset instance |
required |
skip_existing
|
bool | None
|
Don't re-crawl existing keys (doesn't check for checksum) |
True
|
extract
|
bool | None
|
Extract archives using |
False
|
extract_keep_source
|
bool | None
|
When extracting, still import the source archive |
False
|
extract_ensure_subdir
|
bool | None
|
Make sub-directories for extracted files with the archive name to avoid overwriting existing files during extraction of multiple archives with the same directory structure |
False
|
write_documents_db
|
bool | None
|
Create csv-based document tables at the end of crawl run |
True
|
exclude
|
str | None
|
Exclude glob for file paths not to crawl |
None
|
include
|
str | None
|
Include glob for file paths to crawl |
None
|
origin
|
Origins | None
|
Origin of files (used for sub runs of crawl within a crawl job) |
ORIGIN_ORIGINAL
|
source_file
|
File | None
|
Source file (used for sub runs of crawl within a crawl job) |
None
|