Skip to content

investigraph on pypi Python test and package Build docker container pre-commit Coverage Status MIT License

Investigraph

Research and implementation of an ETL process for a curated and up-to-date public and open-source data catalog of frequently used datasets in investigative journalism.

Head over to the tutorial

About

investigraph is an ETL framework that allows research teams to build their own data catalog themselves as easily and reproducable as possible. The investigraph framework provides logic for extracting, transforming and loading any data source into followthemoney entities.

For most common data source formats, this process is possible without programming knowledge, by means of an easy yaml specification interface. However, if it turns out that a specific dataset can not be parsed with the built-in logic, a developer can plug in custom python scripts at specific places within the pipeline to fulfill even the most edge cases in data processing.

Features

  • Cached data fetching based on HEAD requests and their response headers
  • Data extraction based on pandas (runpandarun)
  • Data patching via datapatch
  • Transforming data records into followthemoney entities via mappings
  • Loading result data into a various range of targets, including cloud storage (via fsspec) or sql databases (via followthemoney-store)
  • "Bring your own code" and plug it in into the right stage if the built-in logic doesn't fit your use case

Value for investigative research teams

  • standardized process to convert different data sets into a uniform and thus comparable format
  • control of this process for non-technical people
  • Creation of an own (internal) data catalog
  • Regular, automatic updates of the data
  • A growing community that makes more and more data sets accessible
  • Access to a public (open source) data catalog operated by investigativedata.io

Github repositories

Supported by

Media Tech Lab Bayern batch #3