Investigraph

Research and implementation of an ETL process for a curated and up-to-date public and open-source data catalog of frequently used datasets in investigative journalism.

Head over to the tutorial

About

investigraph is an ETL framework that allows research teams to build their own data catalog themselves as easily and reproducable as possible. The investigraph framework provides logic for extracting, transforming and loading any data source into followthemoney entities.

For most common data source formats, this process is possible without programming knowledge, by means of an easy yaml specification interface. However, if it turns out that a specific dataset can not be parsed with the built-in logic, a developer can plug in custom python scripts at specific places within the pipeline to fulfill even the most edge cases in data processing.

Features

Cached data fetching based on HEAD requests and their response headers
Data extraction based on pandas (runpandarun)
Data patching via datapatch
Transforming data records into followthemoney entities via mappings
Loading result data into a various range of targets, including cloud storage (via fsspec) or sql databases (via followthemoney-store)
"Bring your own code" and plug it in into the right stage if the built-in logic doesn't fit your use case

Value for investigative research teams

standardized process to convert different data sets into a uniform and thus comparable format
control of this process for non-technical people
Creation of an own (internal) data catalog
Regular, automatic updates of the data
A growing community that makes more and more data sets accessible
Access to a public (open source) data catalog operated by investigativedata.io

Github repositories

investigraph-etl - etl style pipeline framework for followthemoney data based on prefect.io
investigraph-eu - Catalog of european datasets powered by investigraph
runpandarun - A simple interface written in python for reproducible i/o workflows around tabular data via pandas
ftmq - An attempt towards a followthemoney query dsl
investigraph-datasets - Example datasets configuration
investigraph-site - Landing page for investigraph (next.js app)
investigraph-api - public API instance to use as a test playground
ftmstore-fastapi - Lightweight API that exposes a ftm store to a public endpoint.

Supported by

Media Tech Lab Bayern batch #3