config.yml
The main entry point for a specific dataset configuration.
Convention
A dataset pipeline configuration should be named config.yml within a dataset folder, e.g.: ./gdho/config.yml
Config files can be referenced via command line:
investigraph run -c ./path/to/config/file.yml
Tip
To avoid repetitive -c ./path/to/config.yml flag, set the config file globally via environment variable INVESTIGRAPH_CONFIG.
Content
Dataset metadata
The dataset metadata follows the FTM dataset specification. Full overview
name (required)
Dataset identifier as a slug.
title
Human-readable title of the dataset.
prefix
Slug prefix for entity IDs. If not specified, uses name.
summary
Brief description of the dataset. Can be multi-line.
summary: |
The Commission applies strict rules on transparency concerning its contacts
and relations with interest representatives.
description
Detailed description of the dataset.
url
URL to the dataset homepage or source.
publisher
Publisher of the dataset. Required field: name.
publisher:
name: European Commission Secretariat-General
description: |
The Secretariat-General is responsible for the overall coherence...
url: https://commission.europa.eu
maintainer
Maintainer of the dataset (same structure as publisher).
license
License identifier (e.g., CC-BY-4.0, MIT).
category
Category of the dataset.
tags
List of tags for categorization.
coverage
Geographic or temporal coverage information.
resources
List of resources that hold entities from this dataset.
resources:
- name: entities.ftm.json
url: https://data.ftm.store/investigraph/gdho/entities.ftm.json
mime_type: application/json+ftm
version
Dataset version.
git_repo
Git repository URL for the dataset.
updated_at
Last update timestamp (ISO 8601).
Seed stage
Optional stage that programmatically initializes Sources.
seed:
handler: ./seed.py:handle # custom handler (optional)
uri: s3://bucket/prefix/ # base uri for sources
prefix: myprefix # only include sources with this prefix
exclude_prefix: test # exclude sources with this prefix
glob: "*.csv" # glob pattern(s) to match
storage_options: # fsspec storage options
key: value
source_options: # extra data to pass to Source objects
key: value
Extract stage
Configuration for the extraction stage. Fetches sources and extracts records.
extract:
handler: ./extract.py:handle # custom handler (optional)
archive: true # download and archive remote sources (default: true)
sources:
- uri: https://example.com/data.csv
name: source_name # optional source identifier
pandas: # pandas/runpandarun configuration
read:
handler: read_csv # pandas read handler
options:
encoding: utf-8
skiprows: 1
data: # arbitrary extra data
key: value
Extract stage options:
handler: Custom extraction handler function (default:investigraph.logic.extract:handle)archive: Download and archive remote sources before processing (default:true)sources: List of source configurationspandas: Global pandas configuration applied to all sources (can be overridden per source)
Source options:
uri: Local or remote source URI (required)name: Source identifier (defaults to slugified URI)pandas: Source-specific pandas configurationdata: Arbitrary extra metadata
See extract stage documentation
Transform stage
Configuration for the transformation stage. Transforms records into FollowTheMoney entities.
transform:
handler: ./transform.py:handle # custom handler (optional)
queries:
- entities:
entity_name:
schema: Organization # FTM schema
keys: # columns to generate entity ID from
- id_column
key_literal: prefix # literal prefix for entity ID
properties:
name:
column: org_name # map single column to property
country:
columns: # map multiple columns to property
- country1
- country2
join: ";" # join multiple values with separator
website:
literal: "https://example.com" # literal value
aliases:
template: "{first} {last}" # template with column interpolation
filters: # filter records
column_name: value
filters_not: # negative filters
column_name: value
Transform stage options:
handler: Custom transformation handler function (default:investigraph.logic.transform:map_ftm)queries: List of mapping queries (uses FTM mapping syntax)
Query mapping options:
entities: Dictionary of entity mappings (entity name → EntityMapping)filters: Include only records matching these column valuesfilters_not: Exclude records matching these column values
Entity mapping options:
schema: FollowTheMoney schema name (required)keys: List of column names to generate entity ID fromkey_literal: Literal prefix for entity IDid_column: Use a specific column as entity IDproperties: Dictionary of property mappings (property name → PropertyMapping)
Property mapping options:
column: Map single column to propertycolumns: Map multiple columns to propertyjoin: Separator for joining multiple column valuessplit: Separator for splitting column value into multiple valuesentity: Reference to another entity in the mappingformat: Format string for value transformationliteral: Literal value for propertyliterals: List of literal values for propertytemplate: Template string with{column_name}interpolationrequired: Skip entity if this property has no value (default:false)
See transform stage documentation
Load stage
Configuration for the load stage. Loads transformed entities into a statement store.
Load stage options:
handler: Custom load handler function (default:investigraph.logic.load:handle)uri: Statement store URI (default:memory://)
Supported store URIs:
memory://- In-memory store (default)postgresql://user:pass@host/db- PostgreSQL storesqlite:///path/to/db.sqlite- SQLite store
Export stage
Configuration for the export stage. Exports dataset metadata and entities to files.
export:
handler: ./export.py:handle # custom handler (optional)
index_uri: ./data/dataset/index.json # dataset metadata output
entities_uri: ./data/dataset/entities.ftm.json # entities output
Export stage options:
handler: Custom export handler function (default:investigraph.logic.export:handle)index_uri: URI for dataset metadata export (JSON file with statistics)entities_uri: URI for entities export (FTM JSON lines format)
Both URIs support local paths and remote storage (S3, GCS, etc. via fsspec).
See export stage documentation
Custom handlers
Each stage can use a custom handler function. Specify the path to a Python file and function:
The handler file path is relative to the config file location.
Handler signatures:
# Seed handler
def handle(ctx: DatasetContext) -> Generator[Source, None, None]:
...
# Extract handler
def handle(ctx: SourceContext) -> RecordGenerator:
...
# Transform handler
def handle(ctx: SourceContext, record: dict, ix: int) -> StatementEntities:
...
# Load handler
def handle(ctx: DatasetContext, proxies: StatementEntities) -> int:
...
# Export handler
def handle(ctx: DatasetContext) -> Dataset:
...
A complete example
Taken from the tutorial:
name: gdho
title: Global Database of Humanitarian Organisations
prefix: gdho
summary: |
GDHO is a global compendium of organisations that provide aid in humanitarian
crises. The database includes basic organisational and operational
information on these humanitarian providers, which include international
non-governmental organisations (grouped by federation), national NGOs that
deliver aid within their own borders, UN humanitarian agencies, and the
International Red Cross and Red Crescent Movement.
resources:
- name: entities.ftm.json
url: https://data.ftm.store/investigraph/gdho/entities.ftm.json
mime_type: application/json+ftm
publisher:
name: Humanitarian Outcomes
description: |
Humanitarian Outcomes is a team of specialist consultants providing
research and policy advice for humanitarian aid agencies and donor
governments.
url: https://www.humanitarianoutcomes.org
extract:
sources:
- uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
pandas:
read:
options:
encoding: latin
skiprows: 1
transform:
queries:
- entities:
org:
schema: Organization
key_literal: gdho
keys:
- Id
properties:
name:
column: Name
weakAlias:
column: Abbreviated name
legalForm:
column: Type
website:
column: Website
country:
column: HQ location
incorporationDate:
column: Year founded
dissolutionDate:
column: Year closed
sector:
columns:
- Sector
- Religious or secular
- Religion
export:
index_uri: s3://data.ftm.store/investigraph/gdho/index.json
entities_uri: s3://data.ftm.store/investigraph/gdho/entities.ftm.json