Overview

investigraph is a framework for building datasets for FollowTheMoney data.

As investigraph can be considered as an ETL-process for FollowTheMoney data, the structure of the dataset pipelines roughly follows the three steps of such a pipeline: extract, transform, load.

The documentation in this section assumes you already checked out the tutorial.

Most of the running behaviour of a specific pipeline is configured on a per-dataset basis and/or via arguments given to a specific run of the pipeline.

Configuration

Pipelines for datasets are stored in a YAML file. Read more about config.yml

Stages

Seed

This is an optional stage before the extract stage to programmatically configure Sources that get passed to the extract stage. This can involve glob patterns from remote uris or a script.

This stage is configured via the optional seed key within the config.yml

Seed stage documentation

Extract

In the first step of a pipeline, we focus on getting one or more data sources and extracting data records from them that will eventually be passed to the transform stage.

This stage is configured via the extract key within the config.yml

Extract stage documentation

The Records that this stage creates are passed to the next stage, transform.

Transform

This stage transforms the records generated by the previous extract stage into FollowTheMoney entities. It can use a defined mapping that doesn't require coding skills or execute a python script.

This stage is configured via the transform key within the config.yml

Transform stage documentation

The Entities that this stage creates are passed to the next stage, load.

Load

This stage writes the Entities created in the previous transform stage to a store for aggregation. This can be a persistent backend like sql or kvrocks or an in-memory store, which is the default.

This stage is configured via the optional load key within the config.yml

Load stage documentation

Export

This optional stage at the end of a dataset pipeline exports the Entities from the store to a json file. The default (if using the in-memory store in the previous load stage) exports the data to a local output directory:

Entities: ./data/<dataset>/entities.ftm.json
Dataset index and stats: ./data/<dataset>/index.json

This stage is configured via the optional export key within the config.yml

Export stage documentation