Overview
investigraph is a framework for building datasets for FollowTheMoney data.
As investigraph can be considered as an ETL-process for FollowTheMoney data, the structure of the dataset pipelines roughly follows the three steps of such a pipeline: extract, transform, load.
The documentation in this section assumes you already checked out the tutorial.
Most of the running behaviour of a specific pipeline is configured on a per-dataset basis and/or via arguments given to a specific run of the pipeline.
Configuration
Pipelines for datasets are stored in a YAML file. Read more about config.yml
Stages
Seed
This is an optional stage before the extract stage to programmatically configure Sources that get passed to the extract stage. This can involve glob patterns from remote uris or a script.
This stage is configured via the optional seed key within the config.yml
Extract
In the first step of a pipeline, we focus on getting one or more data sources and extracting data records from them that will eventually be passed to the transform stage.
This stage is configured via the extract key within the config.yml
The Records that this stage creates are passed to the next stage, transform.
Transform
This stage transforms the records generated by the previous extract stage into FollowTheMoney entities. It can use a defined mapping that doesn't require coding skills or execute a python script.
This stage is configured via the transform key within the config.yml
The Entities that this stage creates are passed to the next stage, load.
Load
This stage writes the Entities created in the previous transform stage to a store for aggregation. This can be a persistent backend like sql or kvrocks or an in-memory store, which is the default.
This stage is configured via the optional load key within the config.yml
Export
This optional stage at the end of a dataset pipeline exports the Entities from the store to a json file. The default (if using the in-memory store in the previous load stage) exports the data to a local output directory:
- Entities:
./data/<dataset>/entities.ftm.json - Dataset index and stats:
./data/<dataset>/index.json
This stage is configured via the optional export key within the config.yml