Tutorial
investigraph is a framework for extracting data from sources and transforming it into the FollowTheMoney format. This format is used for representing entities (people, companies, organizations) and their relationships.
This tutorial shows you how to build a simple dataset without writing Python code - just YAML configuration. You'll need Python 3.11 or higher.
1. Installation
Tip
It is highly recommended to use a python virtual environment for installation.
After completion, verify that investigraph is installed:
2. Create a dataset
We'll use The Global Database of Humanitarian Organisations as an example - a list of humanitarian organizations worldwide.
Setup
Every dataset needs a unique identifier (name). We'll use gdho for this dataset.
Create a directory and config file:
Create datasets/gdho/config.yml with basic metadata:
name: gdho
title: Global Database of Humanitarian Organisations
publisher:
name: Humanitarian Outcomes
url: https://www.humanitarianoutcomes.org
Add a data source
Next, specify where to fetch the data from:
The GDHO website has a "DOWNLOAD CSV" link that provides the data: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
Add this URL to the config:
# metadata ...
extract:
sources:
- uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
Test the extraction with:
This will likely fail with a utf-8 encoding error. The CSV file uses latin encoding and has an empty row at the top. Fix this by adding pandas options (investigraph uses pandas.read_csv internally):
# metadata ...
extract:
sources:
- uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
pandas:
read:
options:
encoding: latin
skiprows: 1
Now extraction should work:
Transform to FollowTheMoney entities
The core step is transforming CSV data into FollowTheMoney entities. Define a mapping from CSV columns to entity properties.
For GDHO, we'll create Organization entities:
# metadata ...
# extract ...
transform:
queries:
- entities:
org:
schema: Organization
keys:
- Id
properties:
name:
column: Name
Test the transformation:
investigraph extract -c datasets/gdho/config.yml -l 10 | investigraph transform -c datasets/gdho/config.yml
This outputs FollowTheMoney entities in JSON format. Add more fields by mapping additional CSV columns:
# metadata ...
# extract ...
transform:
queries:
- entities:
org:
schema: Organization
keys:
- Id
properties:
name:
column: Name
website:
column: Website
Complete configuration
Here's the full config with all fields mapped:
name: gdho
title: Global Database of Humanitarian Organisations
prefix: gdho
summary: |
GDHO is a global compendium of organisations that provide aid in humanitarian
crises. The database includes basic organisational and operational
information on these humanitarian providers, which include international
non-governmental organisations (grouped by federation), national NGOs that
deliver aid within their own borders, UN humanitarian agencies, and the
International Red Cross and Red Crescent Movement.
publisher:
name: Humanitarian Outcomes
description: |
Humanitarian Outcomes is a team of specialist consultants providing
research and policy advice for humanitarian aid agencies and donor
governments.
url: https://www.humanitarianoutcomes.org
extract:
sources:
- uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
pandas:
read:
options:
encoding: latin
skiprows: 1
transform:
queries:
- entities:
org:
schema: Organization
key_literal: gdho
keys:
- Id
properties:
name:
column: Name
weakAlias:
column: Abbreviated name
legalForm:
column: Type
website:
column: Website
country:
column: HQ location
incorporationDate:
column: Year founded
dissolutionDate:
column: Year closed
sector:
columns:
- Sector
- Religious or secular
- Religion
3. Run the pipeline
Execute the full pipeline (extract, transform, load):
This creates a data/gdho/ directory with the output files containing FollowTheMoney entities.
Advanced: Custom Python code
For complex transformations beyond YAML mappings, write custom Python code. Create datasets/gdho/transform.py:
def handle(ctx, record, ix):
proxy = ctx.make_entity("Organization")
proxy.id = record.pop("Id")
proxy.add("name", record.pop("Name"))
# add more property data ...
yield proxy
Update config.yml to use the Python handler:
Run the pipeline:
Next Steps
You can extract most data sources using only YAML configuration. For complex transformations, use custom Python code. See the full documentation for more details.