Skip to content

Tutorial

investigraph is a framework for extracting data from sources and transforming it into the FollowTheMoney format. This format is used for representing entities (people, companies, organizations) and their relationships.

This tutorial shows you how to build a simple dataset without writing Python code - just YAML configuration. You'll need Python 3.11 or higher.

1. Installation

Tip

It is highly recommended to use a python virtual environment for installation.

pip install investigraph

After completion, verify that investigraph is installed:

investigraph --help

2. Create a dataset

We'll use The Global Database of Humanitarian Organisations as an example - a list of humanitarian organizations worldwide.

Setup

Every dataset needs a unique identifier (name). We'll use gdho for this dataset.

Create a directory and config file:

mkdir -p datasets/gdho

Create datasets/gdho/config.yml with basic metadata:

name: gdho
title: Global Database of Humanitarian Organisations
publisher:
  name: Humanitarian Outcomes
  url: https://www.humanitarianoutcomes.org

Add a data source

Next, specify where to fetch the data from:

# metadata ...
extract:
  sources:
    - uri: <url>

The GDHO website has a "DOWNLOAD CSV" link that provides the data: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv

Add this URL to the config:

# metadata ...
extract:
  sources:
    - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv

Test the extraction with:

investigraph extract -c ./datasets/gdho/config.yml -l 10

This will likely fail with a utf-8 encoding error. The CSV file uses latin encoding and has an empty row at the top. Fix this by adding pandas options (investigraph uses pandas.read_csv internally):

# metadata ...
extract:
  sources:
    - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
      pandas:
        read:
          options:
            encoding: latin
            skiprows: 1

Now extraction should work:

investigraph extract -c ./datasets/gdho/config.yml -l 10

Transform to FollowTheMoney entities

The core step is transforming CSV data into FollowTheMoney entities. Define a mapping from CSV columns to entity properties.

For GDHO, we'll create Organization entities:

# metadata ...
# extract ...
transform:
  queries:
    - entities:
        org:
          schema: Organization
          keys:
            - Id
          properties:
            name:
              column: Name

Test the transformation:

investigraph extract -c datasets/gdho/config.yml -l 10 | investigraph transform -c datasets/gdho/config.yml

This outputs FollowTheMoney entities in JSON format. Add more fields by mapping additional CSV columns:

# metadata ...
# extract ...
transform:
  queries:
    - entities:
        org:
          schema: Organization
          keys:
            - Id
          properties:
            name:
              column: Name
            website:
              column: Website

Complete configuration

Here's the full config with all fields mapped:

name: gdho
title: Global Database of Humanitarian Organisations
prefix: gdho
summary: |
  GDHO is a global compendium of organisations that provide aid in humanitarian
  crises. The database includes basic organisational and operational
  information on these humanitarian providers, which include international
  non-governmental organisations (grouped by federation), national NGOs that
  deliver aid within their own borders, UN humanitarian agencies, and the
  International Red Cross and Red Crescent Movement.
publisher:
  name: Humanitarian Outcomes
  description: |
    Humanitarian Outcomes is a team of specialist consultants providing
    research and policy advice for humanitarian aid agencies and donor
    governments.
  url: https://www.humanitarianoutcomes.org

extract:
  sources:
    - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
      pandas:
        read:
          options:
            encoding: latin
            skiprows: 1

transform:
  queries:
    - entities:
        org:
          schema: Organization
          key_literal: gdho
          keys:
            - Id
          properties:
            name:
              column: Name
            weakAlias:
              column: Abbreviated name
            legalForm:
              column: Type
            website:
              column: Website
            country:
              column: HQ location
            incorporationDate:
              column: Year founded
            dissolutionDate:
              column: Year closed
            sector:
              columns:
                - Sector
                - Religious or secular
                - Religion

3. Run the pipeline

Execute the full pipeline (extract, transform, load):

investigraph run -c datasets/gdho/config.yml

This creates a data/gdho/ directory with the output files containing FollowTheMoney entities.

Advanced: Custom Python code

For complex transformations beyond YAML mappings, write custom Python code. Create datasets/gdho/transform.py:

def handle(ctx, record, ix):
    proxy = ctx.make_entity("Organization")
    proxy.id = record.pop("Id")
    proxy.add("name", record.pop("Name"))
    # add more property data ...
    yield proxy

Update config.yml to use the Python handler:

# metadata ...
# extract ...
transform:
  handler: ./transform.py:handle

Run the pipeline:

investigraph run -c ./datasets/gdho/config.yml

Next Steps

You can extract most data sources using only YAML configuration. For complex transformations, use custom Python code. See the full documentation for more details.