Tutorial

investigraph tries to automatize as many functionality (scheduling and executing workflows, monitoring, configuration, ...) as possible with the help of prefect.io.

The only thing you have to manage by yourself is the dataset configuration, which, in the easiest scenario, is just a YAML file that contains a bit of metadata and pipeline instructions.

The following tutorial is a simple setup on your local machine and only requires recent python >= 3.10.

1. Installation

It is highly recommended to use a python virtual environment for installation.

pip install investigraph

After completion, verify that investigraph is installed:

investigraph --help

2. Create a dataset definition

Let's start with a simple, public available dataset: The Global Databse of Humanitarian Organisations. It's just a list of, yes, humanitarian organisations.

Metadata

Every dataset needs a unique identifier as a sub-folder in our block, let's use gdho. We will reference this dataset always with this identifier.

Create a subfolder:

mkdir -p datasets/gdho

Create the configuration file with the editor of your choice. The path to the file (by hardcoded convention) will now be:

datasets/gdho/config.yml

Enter the identifier and a bit of metadata into the file:

name: gdho
title: Global Database of Humanitarian Organisations
publisher:
  name: Humanitarian Outcomes
  url: https://www.humanitarianoutcomes.org

That's enough metadata for now, there is a lot more metadata possible which will be covered in the documentation.

Sources

The most interesting part of course is the pipeline definition! We have at least to provide a source:

# metadata ...
extract:
  sources:
    - uri: <url>

We need to find out the remote url we want to fetch. So, open the landing page from the GDHO dataset.

When clicking on "VIEW ALL DATA", it will open this page. And, great! There is a direct link to "DOWNLOAD CSV" which actually returns data in csv format: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv

We can just add this url to our source configuration:

# metadata ...
extract:
  sources:
    - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv

investigraph allows to interactively inspect your building blocks for datasets, so let's try:

investigraph inspect ./datasets/gdho/config.yml

This will show an error about no parsing function defined, but we will fix that later.

We can inspect the outcome from our remote source as well:

investigraph inspect ./datasets/gdho/config.yml --extract

Ooops! This shows us a python exception saying something about utf-8 error. Yeah, we still have that in 2023.

When downloading this csv file manually and opening in a spreadsheet application, you will actually notice that it is in latin encoding and has 1 empty row at the top. 🤦 (welcome to real world data)

Under the hood, investigraph is using pandas.read_csv and there is an option pandas to pass instructions to pandas on how to read this csv. In this case, it would look like this (refer to the pandas documentation for all options):

# metadata ...
extract:
  sources:
    - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
      pandas:
        read:
          options:
            encoding: latin
            skiprows: 1

Now investigraph is able to fetch and parse the csv source:

investigraph inspect ./datasets/gdho/config.yml --extract

Transform data

This is the core functionality of this whole thing: Transform extracted source data into the followthemoney model.

The easiest way is to define a mapping of csv columns to ftm properties, as described here.

The format in investigraphs config.yml aligns with the ftm mapping spec, so you could use any existing mapping here as well.

For the gdho dataset, we want to create Organization entities and map the name column to the name property.

Add this mapping spec to the config.yml (the csv column with the name is called Name, so this is a no-brainer):

# metadata ...
# extract ...
transform:
  queries:
    - entities:
        org:
          schema: Organization
          keys:
            - Id
          properties:
            name:
              column: Name

You can inspect the transformation like this:

investigraph inspect datasets/gdho/config.yml --transform

Yay – it is returning some ftm entities

In the source data is a lot more metadata about the organizations. Refer to the ftm mapping documentation on how to map data. Let's add the organizations website to the properties key:

# metadata ...
# extract ...
transform:
  queries:
    - entities:
        org:
          schema: Organization
          keys:
            - Id
          properties:
            name:
              column: Name
            website:
              column: Website

Inspect again, and the entities now have the website property.

The complete config.yml

Adding in a bit more metadata and property mappings:

name: gdho
title: Global Database of Humanitarian Organisations
prefix: gdho
summary: |
  GDHO is a global compendium of organisations that provide aid in humanitarian
  crises. The database includes basic organisational and operational
  information on these humanitarian providers, which include international
  non-governmental organisations (grouped by federation), national NGOs that
  deliver aid within their own borders, UN humanitarian agencies, and the
  International Red Cross and Red Crescent Movement.
publisher:
  name: Humanitarian Outcomes
  description: |
    Humanitarian Outcomes is a team of specialist consultants providing
    research and policy advice for humanitarian aid agencies and donor
    governments.
  url: https://www.humanitarianoutcomes.org

extract:
  sources:
    - uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
      pandas:
        read:
          options:
            encoding: latin
            skiprows: 1

transform:
  queries:
    - entities:
        org:
          schema: Organization
          key_literal: gdho
          keys:
            - Id
          properties:
            name:
              column: Name
            weakAlias:
              column: Abbreviated name
            legalForm:
              column: Type
            website:
              column: Website
            country:
              column: HQ location
            incorporationDate:
              column: Year founded
            dissolutionDate:
              column: Year closed
            sector:
              columns:
                - Sector
                - Religious or secular
                - Religion

3. Run the pipeline

To actually run the pipeline within the investigraph framework (which is based on prefect.io), execute a flow run:

investigraph run -c datasets/gdho/config.yml

Voilà, you just transformed the whole gdho database into ftm entities! You may notice, that this execution created a new subfolder in the current working directory named data/gdho where you find the data output of this process.

The prefect ui

Start the local prefect server:

prefect server start

In another terminal window, start an agent:

prefect agent start -q default

View the dashboard at http://127.0.0.1:4200

There you will already see our recent flow run for the gdho dataset.

To be able to run flows from within the ui, we first need to create (and apply) a deployment:

prefect deployment build investigraph.pipeline:run -n investigraph-local -a

Now, you can see the local deployment in the Deployments tab in the flow view.

You can click on the deployment, and then click on "Run >" at the upper right corner of the deployment view.

In the options, insert gdho as the dataset and ./datasets/gdho/config.yml as the value for the config. Then click "Run" and watch the magic happen.

Optional: dataset configuration discovery

We use prefect blocks to store datasets configuration. Using blocks allows investigraph to discover dataset configuration (and even parsing code) everywhere on the cloud, but let's start locally for now.

Register your local datasets folder as a LocalFileSystem-Block in prefect:

investigraph add-block -b local-file-system/datasets -u ./datasets

From now on, you can reference this block storage with its name local-file-system/datasets, e.g. when running the pipeline:

investigraph run -d gdho -b local-file-system/datasets

Or reference this block when triggering a flow run via the prefect ui (no need to put in a config path then anymore.)

Of course, these blocks can be created via the prefect ui as well: http://127.0.0.1:4200/blocks

Github block

investigraph maintains an example github repository to use as a block to fetch the dataset configs remotely. Create a github block via the prefect ui or via command line:

investigraph add-block -b github/investigraph-datasets -u https://github.com/investigativedata/investigraph-datasets.git

Now, you can use this block when running flows (via the ui) or command line:

investigraph run gdho -b github/investigraph-datasets

Optional: use python code to transform data

Instead of writting the ftm mapping in the config.yml, which can be a bit limiting for advanced use cases, you can instead write arbitray python code. The code needs to live anywhere relatively to the config.yml, e.g. next to it in a file transform.py. In it, write your own transform (or extract, load) function.

To transform the records within python and achieve the same result for the gdho dataset, an example script would look like this:

def handle(ctx, record, ix):
    proxy = ctx.make_proxy("Organization")
    proxy.id = record.pop("Id"))
    proxy.add("name", record.pop("Name"))
    # add more property data ...
    yield proxy

Now, tell the transform key in the config.yml to use this python file instead of the defined mapping:

# metadata ...
# extract ...
transform:
  queries: # ...
  handler: ./transform.py:handle

After it, test the pipeline again:

investigraph inspect ./datasets/gdho/config.yml --transform

Conclusion

We have shown how we can extract a datasource without the need to write any python code, just with yaml specifications. Head on to the documentation to dive deeper into investigraph