Skip to content

Seed

config.yml

Example: All .csv files in a local folder

seed:
  uri: ./data
  glob: "**/*.csv"

Reference

Bases: Stage

Source code in investigraph/model/stage.py
class SeedStage(Stage):
    default_handler = settings.seeder

    uri: str | None = None
    """Base uri for sources"""

    prefix: str | None = None
    """Only include sources with given name prefix"""

    exclude_prefix: str | None = None
    """Exclude sources with given name prefix"""

    glob: str | list[str] | None = None
    """Only include sources that match this glob pattern(s)"""

    storage_options: dict[str, Any] | None = None
    """Pass through kwargs to `fsspec`"""

    source_options: dict[str, Any] | None = None
    """Pass through extra data to source object"""

exclude_prefix = None class-attribute instance-attribute

Exclude sources with given name prefix

glob = None class-attribute instance-attribute

Only include sources that match this glob pattern(s)

prefix = None class-attribute instance-attribute

Only include sources with given name prefix

source_options = None class-attribute instance-attribute

Pass through extra data to source object

storage_options = None class-attribute instance-attribute

Pass through kwargs to fsspec

uri = None class-attribute instance-attribute

Base uri for sources

Bring your own code

seed:
  handler: ./seed.py:handle

Function signature

handle(ctx)

The default handler for the seed stage.

Parameters:

Name Type Description Default
ctx DatasetContext

instance of the current DatasetContext

required

Yields:

Type Description
Source

Generator of Source objects for further processing in extract stage.

Source code in investigraph/logic/seed.py
def handle(ctx: DatasetContext) -> Generator[Source, None, None]:
    """
    The default handler for the seed stage.

    Args:
        ctx: instance of the current `DatasetContext`

    Yields:
        Generator of `Source` objects for further processing in extract stage.
    """
    if ctx.config.seed.uri is not None:
        store = get_store(ctx.config.seed.uri)
        globs = ensure_list(ctx.config.seed.glob) or [None]
        for glob in globs:
            for key in store.iterate_keys(
                glob=glob,
                prefix=ctx.config.seed.prefix,
                exclude_prefix=ctx.config.seed.exclude_prefix,
            ):
                yield Source(
                    uri=store.get_key(key),
                    **ensure_dict(ctx.config.seed.source_options),
                )