investigraph tries to automatize as many functionality (scheduling and executing workflows, monitoring, configuration, ...) as possible with the help of prefect.io.
The only thing you have to manage by yourself is the dataset configuration, which, in the easiest scenario, is just a
YAML file that contains a bit of metadata and pipeline instructions.
The following tutorial is a simple setup on your local machine and only requires recent python >= 3.10.
It is highly recommended to use a python virtual environment for installation.
pip install investigraph
After completion, verify that investigraph is installed:
2. Create a dataset definition
Let's start with a simple, public available dataset: The Global Databse of Humanitarian Organisations. It's just a list of, yes, humanitarian organisations.
Every dataset needs a unique identifier as a sub-folder in our block, let's use
gdho. We will reference this dataset always with this identifier.
Create a subfolder:
mkdir -p datasets/gdho
Create the configuration file with the editor of your choice. The path to the file (by hardcoded convention) will now be:
Enter the identifier and a bit of metadata into the file:
That's enough metadata for now, there is a lot more metadata possible which will be covered in the documentation.
The most interesting part of course is the pipeline definition! We have at least to provide a source:
We need to find out the remote url we want to fetch. So, open the landing page from the GDHO dataset.
When clicking on "VIEW ALL DATA", it will open this page. And, great! There is a direct link to "DOWNLOAD CSV" which actually returns data in csv format: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
We can just add this url to our source configuration:
investigraph allows to interactively inspect your building blocks for datasets, so let's try:
investigraph inspect ./datasets/gdho/config.yml
This will show an error about no parsing function defined, but we will fix that later.
We can inspect the outcome from our remote source as well:
investigraph inspect ./datasets/gdho/config.yml --extract
Ooops! This shows us a python exception saying something about
utf-8 error. Yeah, we still have that in 2023.
When downloading this csv file manually and opening in a spreadsheet application, you will actually notice that it is in
latin encoding and has 1 empty row at the top. 🤦 (welcome to real world data)
Under the hood, investigraph is using pandas.read_csv and there is an option
pandas to pass instructions to pandas on how to read this csv. In this case, it would look like this (refer to the
pandas documentation for all options):
Now investigraph is able to fetch and parse the csv source:
investigraph inspect ./datasets/gdho/config.yml --extract
This is the core functionality of this whole thing: Transform extracted source data into the followthemoney model.
The easiest way is to define a mapping of csv columns to ftm properties, as described here.
The format in investigraphs
config.yml aligns with the ftm mapping spec, so you could use any existing mapping here as well.
gdho dataset, we want to create Organization entities and map the name column to the name property.
Add this mapping spec to the
config.yml (the csv column with the name is called
Name, so this is a no-brainer):
You can inspect the transformation like this:
investigraph inspect datasets/gdho/config.yml --transform
Yay – it is returning some ftm entities
In the source data is a lot more metadata about the organizations. Refer to the ftm mapping documentation on how to map data. Let's add the organizations website to the
Inspect again, and the entities now have the
The complete config.yml
Adding in a bit more metadata and property mappings:
title: Global Database of Humanitarian Organisations
GDHO is a global compendium of organisations that provide aid in humanitarian
crises. The database includes basic organisational and operational
information on these humanitarian providers, which include international
non-governmental organisations (grouped by federation), national NGOs that
deliver aid within their own borders, UN humanitarian agencies, and the
International Red Cross and Red Crescent Movement.
name: Humanitarian Outcomes
Humanitarian Outcomes is a team of specialist consultants providing
research and policy advice for humanitarian aid agencies and donor
- uri: https://www.humanitarianoutcomes.org/gdho/search/results?format=csv
column: Abbreviated name
column: HQ location
column: Year founded
column: Year closed
- Religious or secular
3. Run the pipeline
To actually run the pipeline within the investigraph framework (which is based on prefect.io), execute a flow run:
investigraph run -c datasets/gdho/config.yml
Voilà, you just transformed the whole gdho database into ftm entities! You may notice, that this execution created a new subfolder in the current working directory named
data/gdho where you find the data output of this process.
The prefect ui
Start the local prefect server:
prefect server start
In another terminal window, start an agent:
prefect agent start -q default
View the dashboard at http://127.0.0.1:4200
There you will already see our recent flow run for the
To be able to run flows from within the ui, we first need to create (and apply) a deployment:
prefect deployment build investigraph.pipeline:run -n investigraph-local -a
Now, you can see the local deployment in the Deployments tab in the flow view.
You can click on the deployment, and then click on "Run >" at the upper right corner of the deployment view.
In the options, insert
gdho as the dataset and
./datasets/gdho/config.yml as the value for the config. Then click "Run" and watch the magic happen.
Optional: dataset configuration discovery
We use prefect blocks to store datasets configuration. Using blocks allows investigraph to discover dataset configuration (and even parsing code) everywhere on the cloud, but let's start locally for now.
Register your local datasets folder as a
LocalFileSystem-Block in prefect:
investigraph add-block -b local-file-system/datasets -u ./datasets
From now on, you can reference this block storage with its name
local-file-system/datasets, e.g. when running the pipeline:
investigraph run -d gdho -b local-file-system/datasets
Or reference this block when triggering a flow run via the prefect ui (no need to put in a config path then anymore.)
Of course, these blocks can be created via the prefect ui as well: http://127.0.0.1:4200/blocks
investigraph maintains an example github repository to use as a block to fetch the dataset configs remotely. Create a github block via the prefect ui or via command line:
investigraph add-block -b github/investigraph-datasets -u https://github.com/investigativedata/investigraph-datasets.git
Now, you can use this block when running flows (via the ui) or command line:
investigraph run gdho -b github/investigraph-datasets
Optional: use python code to transform data
Instead of writting the ftm mapping in the
config.yml, which can be a bit limiting for advanced use cases, you can instead write arbitray python code. The code needs to live anywhere relatively to the
config.yml, e.g. next to it in a file
transform.py. In it, write your own transform (or extract, load) function.
To transform the records within python and achieve the same result for the
gdho dataset, an example script would look like this:
Now, tell the
transform key in the
config.yml to use this python file instead of the defined mapping:
After it, test the pipeline again:
investigraph inspect ./datasets/gdho/config.yml --transform
We have shown how we can extract a datasource without the need to write any python code, just with yaml specifications. Head on to the documentation to dive deeper into investigraph