When not using blocks (as for local developement), any arbitrary config files can be referenced to use via command line:
investigraph run -c ./path/to/config/file.yml
Dataset identifier, as a slug
Human-readable title of the dataset
European Commission - Meetings with interest representatives
name from above.
slug prefix for entity IDs.
2-letter iso code of the main country this dataset is related to. Also accepts
A description about the dataset, can be multi-lined.
A list of recources that hold entities from this dataset.
Publisher of the dataset as an object. Required key:
publisher: name: European Commission Secretariat-General description: | The Secretariat-General is responsible for the overall coherence of the Commission’s work – both in shaping new policies, and in steering them through the other EU institutions. It supports the whole Commission. url: https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/secretariat-general_en
Configuration for the extraction stage, for fetching sources and extracting records to transform in the next stage.
Reference to the python function that handles this stage.
When using your own extractor, you can disable source fetching by investigraph, instead fetch (and extract) your sources within your own code:
Or, a python module (must be in
Can also be applied per source:
Configuration for the transformation stage, for defining a FollowTheMoney mapping or referencing custom transformation code. When a custom handler is defined, the query mapping is ignored.
The final stage that loads the transformed Entities into defined targets.
Uri to output dataset metadata. Can be anything that
Uri to output transformed entities. Can be anything that
fsspec understands, plus a
SQL endpoint (for use with followthemoney-store)
Uri to output intermediate entity fragments. Can be anything that
Specify if entities should be aggregated and how. Per default, aggregation happens in memory.
Turn off aggregation completly:
If datasets are too large to fit into memory, aggregation can happen within a database (specified via the
FTM_DATABASE_URI env var):
This can be a custom handler as well (as in the other stages), e.g.:
Specify the database connection when the
db handler is used.
A complete example
Taken from the tutorial
name: gdho title: Global Database of Humanitarian Organisations prefix: gdho summary: | GDHO is a global compendium of organisations that provide aid in humanitarian crises. The database includes basic organisational and operational information on these humanitarian providers, which include international non-governmental organisations (grouped by federation), national NGOs that deliver aid within their own borders, UN humanitarian agencies, and the International Red Cross and Red Crescent Movement. resources: - name: entities.ftm.json url: https://data.ftm.store/investigraph/gdho/entities.ftm.json mime_type: application/json+ftm publisher: name: Humanitarian Outcomes description: | Humanitarian Outcomes is a team of specialist consultants providing research and policy advice for humanitarian aid agencies and donor governments. url: https://www.humanitarianoutcomes.org extract: sources: - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv pandas: read: options: encoding: latin skiprows: 1 transform: queries: - entities: org: schema: Organization keys: - Id properties: name: column: Name weakAlias: column: Abbreviated name legalForm: column: Type website: column: Website country: column: HQ location incorporationDate: column: Year founded dissolutionDate: column: Year closed sector: columns: - Sector - Religious or secular - Religion load: index_uri: s3://firstname.lastname@example.org/investigraph/gdho/index.json entities_uri: s3://email@example.com/investigraph/gdho/entities.ftm.json aggregate: handler: db