Archive
Archive
Simple archive implementation for storing scraped files based on anystore
archive_source(uri, *args, url_key_only=False, cache=True, stealthy=False, delay=None, raise_on_error=True, **kwargs)
Archive a remote file and return the archive key
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url_key_only
|
bool | None
|
Compute cache key just by url as fallback |
False
|
cache
|
bool | None
|
Disable caching at all (force re-fetch) |
True
|
stealthy
|
bool | None
|
Use random http use agent (for http remote sources) |
False
|
delay
|
int | None
|
Set a delay before fetching |
None
|
raise_on_error
|
bool | None
|
Throw exception or just log it. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
The archive lookup key. |
Source code in investigraph/archive.py
get_archive(uri=None)
cached
Get the archive where to store remote files.
Set the archive via INVESTIGRAPH_ARCHIVE_URI (see
Settings)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uri
|
Uri | None
|
Use this specific uri instead of the global setting. |
None
|
Returns:
| Type | Description |
|---|---|
BaseStore
|
The archive store (see anystore) |
Source code in investigraph/archive.py
make_archive_key(uri)
Make the key prefix based on a file uri.
Example
make_archive_key("https://example.org/files/data.pdf") "example.org/files/data.pdf"
open(uri, url_key_only=False, cache=True, stealthy=False, delay=None, raise_on_error=True, mode=None, **kwargs)
Open a file from the archive as a file-like io handler. If it doesn't exist in the archive, it will be stored first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
str | None
|
open mode (default |
None
|
url_key_only
|
bool | None
|
[only if file doesn't exist in archive yet] Compute cache key just by url as fallback |
False
|
cache
|
bool | None
|
[only if file doesn't exist in archive yet] Disable caching at all (force re-fetch) |
True
|
stealthy
|
bool | None
|
[only if file doesn't exist in archive yet] Use random http use agent (for http remote sources) |
False
|
delay
|
int | None
|
[only if file doesn't exist in archive yet] Set a delay before fetching |
None
|
raise_on_error
|
bool | None
|
[only if file doesn't exist in archive yet] Throw exception or just log it. |
True
|
Returns:
| Type | Description |
|---|---|
ContextManager[IO[AnyStr]]
|
The open file handler |