Docker
Investigraph can run in a containerized environment using Docker. This is useful for deployment, CI/CD pipelines, and running datasets in isolated environments.
Using the official Docker image
The official investigraph image is published to GitHub Container Registry.
Pull the latest image:
Run a dataset:
docker run -v $(pwd)/datasets:/datasets ghcr.io/dataresearchcenter/investigraph:latest \
run -c /datasets/my_dataset/config.yml
Building a custom image
For a catalog of datasets, create a custom image based on investigraph. This is the pattern used by the dataresearchcenter/datasets repository.
Example Dockerfile
FROM ghcr.io/dataresearchcenter/investigraph:latest
# Copy your datasets
COPY datasets/ /datasets/
# Copy build files
COPY Makefile /datasets/
COPY build_catalog.py /datasets/
COPY setup.py /datasets/
COPY pyproject.toml /datasets/
COPY README.md /datasets/
# Install additional dependencies if needed
USER root
RUN mkdir -p /data
RUN pip install -e /datasets
RUN apt-get update && apt-get install -y awscli
USER 1000
# Configure environment
ENV INVESTIGRAPH_ARCHIVE_URI=s3://my-bucket/archive
ENV INVESTIGRAPH_CACHE_URI=memory://
WORKDIR /datasets
Dataset directory structure
Organize datasets in a consistent structure:
datasets/
dataset1/
config.yml
transform.py # optional
dataset2/
config.yml
extract.py # optional
catalog.yml # optional catalog metadata
Environment variables
Configure investigraph behavior via environment variables:
| Variable | Description | Default |
|---|---|---|
INVESTIGRAPH_DATA_ROOT |
Root directory for data output | /data |
INVESTIGRAPH_STORE_URI |
Statement store URI | memory:// |
INVESTIGRAPH_ARCHIVE_URI |
Archive storage for downloaded sources | file:///data/archive |
INVESTIGRAPH_CACHE_URI |
Runtime cache URI | memory:// |
INVESTIGRAPH_EXTRACT_CACHE |
Enable extraction caching | true |
DEBUG |
Enable debug logging | 0 |
Volume mounts
Mount directories to persist data:
docker run \
-v $(pwd)/datasets:/datasets \
-v $(pwd)/data:/data \
-e INVESTIGRAPH_DATA_ROOT=/data \
ghcr.io/dataresearchcenter/investigraph:latest \
run -c /datasets/my_dataset/config.yml
Docker Compose
For local development with multiple datasets:
version: '3.8'
services:
investigraph:
image: ghcr.io/dataresearchcenter/investigraph:latest
volumes:
- ./datasets:/datasets
- ./data:/data
environment:
- INVESTIGRAPH_DATA_ROOT=/data
- INVESTIGRAPH_STORE_URI=postgresql://user:pass@postgres:5432/investigraph
depends_on:
- postgres
postgres:
image: postgres:15
environment:
- POSTGRES_DB=investigraph
- POSTGRES_USER=user
- POSTGRES_PASSWORD=pass
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
Run with:
Running specific stages
Extract stage only:
docker run -v $(pwd)/datasets:/datasets \
ghcr.io/dataresearchcenter/investigraph:latest \
extract -c /datasets/my_dataset/config.yml
Chain stages with pipes:
docker run -v $(pwd)/datasets:/datasets \
ghcr.io/dataresearchcenter/investigraph:latest \
sh -c "investigraph extract -c /datasets/my_dataset/config.yml | \
investigraph transform -c /datasets/my_dataset/config.yml | \
investigraph load -c /datasets/my_dataset/config.yml"
Cloud storage integration
Use fsspec-compatible URIs for cloud storage:
docker run \
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-e INVESTIGRAPH_ARCHIVE_URI=s3://my-bucket/archive \
ghcr.io/dataresearchcenter/investigraph:latest \
run -c /datasets/my_dataset/config.yml \
--entities-uri s3://my-bucket/output/entities.ftm.json \
--index-uri s3://my-bucket/output/index.json
Multi-architecture support
The official image supports both AMD64 and ARM64 architectures:
# Explicitly specify platform
docker pull --platform linux/amd64 ghcr.io/dataresearchcenter/investigraph:latest
docker pull --platform linux/arm64 ghcr.io/dataresearchcenter/investigraph:latest
Security considerations
- Run as non-root user (UID 1000)
- Limit container resources with
--memoryand--cpusflags - Use read-only filesystem where possible:
--read-only - Drop unnecessary capabilities:
--cap-drop=ALL
Example secure run: