Investigraph
Research and implementation of an ETL process for a curated and up-to-date public and open-source data catalog of frequently used datasets in investigative journalism.
About
investigraph is an ETL framework that allows research teams to build their own data catalog themselves as easily and reproducable as possible. The investigraph framework provides logic for extracting, transforming and loading any data source into followthemoney entities.
For most common data source formats, this process is possible without programming knowledge, by means of an easy yaml
specification interface. However, if it turns out that a specific dataset can not be parsed with the built-in logic, a developer can plug in custom python scripts at specific places within the pipeline to fulfill even the most edge cases in data processing.
Features
- Cached data fetching based on
HEAD
requests and their response headers - Data extraction based on
pandas
(runpandarun) - Data patching via datapatch
- Transforming data records into followthemoney entities via mappings
- Loading result data into a various range of targets, including cloud storage (via fsspec) or sql databases (via followthemoney-store)
- "Bring your own code" and plug it in into the right stage if the built-in logic doesn't fit your use case
Value for investigative research teams
- standardized process to convert different data sets into a uniform and thus comparable format
- control of this process for non-technical people
- Creation of an own (internal) data catalog
- Regular, automatic updates of the data
- A growing community that makes more and more data sets accessible
- Access to a public (open source) data catalog operated by investigativedata.io
Github repositories
- investigraph-etl - etl style pipeline framework for followthemoney data based on prefect.io
- investigraph-eu - Catalog of european datasets powered by investigraph
- runpandarun - A simple interface written in python for reproducible i/o workflows around tabular data via pandas
- ftmq - An attempt towards a followthemoney query dsl
- investigraph-datasets - Example datasets configuration
- investigraph-site - Landing page for investigraph (next.js app)
- investigraph-api - public API instance to use as a test playground
- ftmstore-fastapi - Lightweight API that exposes a ftm store to a public endpoint.