Skip to content

Docker

Investigraph can run in a containerized environment using Docker. This is useful for deployment, CI/CD pipelines, and running datasets in isolated environments.

Using the official Docker image

The official investigraph image is published to GitHub Container Registry.

Pull the latest image:

docker pull ghcr.io/dataresearchcenter/investigraph:latest

Run a dataset:

docker run -v $(pwd)/datasets:/datasets ghcr.io/dataresearchcenter/investigraph:latest \
  run -c /datasets/my_dataset/config.yml

Building a custom image

For a catalog of datasets, create a custom image based on investigraph. This is the pattern used by the dataresearchcenter/datasets repository.

Example Dockerfile

FROM ghcr.io/dataresearchcenter/investigraph:latest

# Copy your datasets
COPY datasets/ /datasets/

# Copy build files
COPY Makefile /datasets/
COPY build_catalog.py /datasets/
COPY setup.py /datasets/
COPY pyproject.toml /datasets/
COPY README.md /datasets/

# Install additional dependencies if needed
USER root
RUN mkdir -p /data
RUN pip install -e /datasets
RUN apt-get update && apt-get install -y awscli
USER 1000

# Configure environment
ENV INVESTIGRAPH_ARCHIVE_URI=s3://my-bucket/archive
ENV INVESTIGRAPH_CACHE_URI=memory://

WORKDIR /datasets

Dataset directory structure

Organize datasets in a consistent structure:

datasets/
 dataset1/
    config.yml
    transform.py  # optional
 dataset2/
    config.yml
    extract.py    # optional
 catalog.yml       # optional catalog metadata

Environment variables

Configure investigraph behavior via environment variables:

Variable Description Default
INVESTIGRAPH_DATA_ROOT Root directory for data output /data
INVESTIGRAPH_STORE_URI Statement store URI memory://
INVESTIGRAPH_ARCHIVE_URI Archive storage for downloaded sources file:///data/archive
INVESTIGRAPH_CACHE_URI Runtime cache URI memory://
INVESTIGRAPH_EXTRACT_CACHE Enable extraction caching true
DEBUG Enable debug logging 0

Volume mounts

Mount directories to persist data:

docker run \
  -v $(pwd)/datasets:/datasets \
  -v $(pwd)/data:/data \
  -e INVESTIGRAPH_DATA_ROOT=/data \
  ghcr.io/dataresearchcenter/investigraph:latest \
  run -c /datasets/my_dataset/config.yml

Docker Compose

For local development with multiple datasets:

version: '3.8'

services:
  investigraph:
    image: ghcr.io/dataresearchcenter/investigraph:latest
    volumes:
      - ./datasets:/datasets
      - ./data:/data
    environment:
      - INVESTIGRAPH_DATA_ROOT=/data
      - INVESTIGRAPH_STORE_URI=postgresql://user:pass@postgres:5432/investigraph
    depends_on:
      - postgres

  postgres:
    image: postgres:15
    environment:
      - POSTGRES_DB=investigraph
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Run with:

docker-compose run investigraph run -c /datasets/my_dataset/config.yml

Running specific stages

Extract stage only:

docker run -v $(pwd)/datasets:/datasets \
  ghcr.io/dataresearchcenter/investigraph:latest \
  extract -c /datasets/my_dataset/config.yml

Chain stages with pipes:

docker run -v $(pwd)/datasets:/datasets \
  ghcr.io/dataresearchcenter/investigraph:latest \
  sh -c "investigraph extract -c /datasets/my_dataset/config.yml | \
         investigraph transform -c /datasets/my_dataset/config.yml | \
         investigraph load -c /datasets/my_dataset/config.yml"

Cloud storage integration

Use fsspec-compatible URIs for cloud storage:

docker run \
  -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  -e INVESTIGRAPH_ARCHIVE_URI=s3://my-bucket/archive \
  ghcr.io/dataresearchcenter/investigraph:latest \
  run -c /datasets/my_dataset/config.yml \
    --entities-uri s3://my-bucket/output/entities.ftm.json \
    --index-uri s3://my-bucket/output/index.json

Multi-architecture support

The official image supports both AMD64 and ARM64 architectures:

# Explicitly specify platform
docker pull --platform linux/amd64 ghcr.io/dataresearchcenter/investigraph:latest
docker pull --platform linux/arm64 ghcr.io/dataresearchcenter/investigraph:latest

Security considerations

  • Run as non-root user (UID 1000)
  • Limit container resources with --memory and --cpus flags
  • Use read-only filesystem where possible: --read-only
  • Drop unnecessary capabilities: --cap-drop=ALL

Example secure run:

docker run \
  --read-only \
  --cap-drop=ALL \
  --memory=2g \
  --cpus=2 \
  -v $(pwd)/datasets:/datasets:ro \
  -v $(pwd)/data:/data \
  ghcr.io/dataresearchcenter/investigraph:latest \
  run -c /datasets/my_dataset/config.yml