Best Practices
This section provides recommendations for building robust and maintainable datasets with investigraph. These practices are adapted from the zavod framework (developed by OpenSanctions) and tailored for investigraph's architecture.
Overview
Building quality datasets requires attention to:
- Code organization - Structure your transform functions clearly
- Data handling - Track and validate all source fields
- Entity identifiers - Generate stable, deterministic IDs
- Caching strategies - Optimize extraction performance
- Data priorities - Focus on essential information first
- Logging - Use appropriate log levels for debugging and monitoring
Sections
- Dataset metadata - Writing excellent dataset documentation
- Transform patterns - Writing effective transform handlers
- Entity keys and IDs - Generating stable entity identifiers
- Utility functions - Using context helpers and common patterns
Code organization
Transform function structure
Organize transform functions with clear separation of concerns:
def handle(ctx, record, ix):
"""
Main transform handler:
1. Extract and clean data from record
2. Create entities
3. Emit entities
"""
# Extract data
org = make_organization(ctx, record)
person = make_person(ctx, record)
# Create relationships
membership = make_membership(ctx, person, org, record)
# Emit all entities
yield org
yield person
yield membership
def make_organization(ctx, record):
"""Create organization entity from record"""
org = ctx.make_entity("Organization")
org.id = ctx.make_slug(record.pop("org_id"))
org.add("name", record.pop("org_name"))
return org
Use descriptive helper function names following the make_thing pattern:
make_person()- Creates Person entitiesmake_company()- Creates Company entitiesmake_address()- Composes address information
Import order
Organize imports (enforced by ruff/isort):
# Standard library
import csv
from datetime import datetime
# Third-party packages
import pandas as pd
from normality import slugify
# Investigraph
from investigraph.model import Context
Constants and patterns
Define regular expressions and constants at module level:
import re
# Precompile patterns
ID_PATTERN = re.compile(r"^[A-Z]{2}\d{6}$")
DATE_PATTERN = re.compile(r"(\d{4})-(\d{2})-(\d{2})")
# Define mappings
COUNTRY_CODES = {
"United Kingdom": "GB",
"United States": "US",
}
Data priorities
When extracting data, prioritize based on importance:
Essential (minimum)
- Names - At minimum, extract the name of each entity
Essential (when available)
- Identifiers - Official registration numbers, IDs
- Dates - Birth dates, incorporation dates, sanction dates
- Jurisdictions - Countries, registration jurisdictions
entity.add("idNumber", record.get("registration_number"))
entity.add("incorporationDate", record.get("founded"))
entity.add("country", record.get("jurisdiction"))
Should include
- Relationships - Ownership, membership, family relations
- Temporal data - Start/end dates for positions, sanctions
- Contact information - When publicly available and relevant
Could include
- Source URLs - Links to original data
- Notes - Additional context
Logging
Use appropriate log levels:
Debug - Detailed information for development:
Info - Progress tracking for large datasets:
Warning - Issues that need attention:
Error - Serious problems:
try:
date = parse_date(record["date"])
except ValueError as e:
ctx.log.error("Date parsing failed", value=record["date"], error=str(e))
Caching strategies
Extract stage caching
Investigraph caches extracted sources by default. Control via environment:
# Disable caching for frequently-updated sources
INVESTIGRAPH_EXTRACT_CACHE=0 investigraph run -c config.yml
Cache considerations:
- Index pages - Consider disabling cache for frequently-updated sources (sanction lists, regulatory filings)
- Detail pages - Keep default caching for large, slow-moving datasets (corporate registries)
- Paginated content - Be careful with pagination as cached pages may become stale when new items are added
Archive storage
Use archive storage for sources that rarely change:
extract:
archive: true # downloads and stores sources locally
sources:
- uri: https://example.com/annual-report-2023.pdf
Disable archiving for dynamic APIs:
Testing and validation
Test with limited data
Use the -l flag to test with a subset of records:
# Extract and transform first 10 records
investigraph extract -c config.yml -l 10 | \
investigraph transform -c config.yml
Review checklist
Before considering a dataset complete:
- All source fields are mapped or explicitly ignored
- Entity IDs are stable and deterministic
- Required properties are present (name, identifiers)
- Relationships are properly linked
- Dates are in ISO format (YYYY-MM-DD)
- Countries use ISO 2-letter codes
- No personally-identifying information in IDs
- Logging covers important events and warnings
- Configuration is documented (summary, publisher)
- Test run produces expected entity counts
- No errors or warnings in logs
Common pitfalls
Mutable default arguments
Don't use mutable defaults:
# Wrong
def make_entity(ctx, data, aliases=[]):
entity = ctx.make_entity("Person")
aliases.append(data["name"]) # Modifies shared list!
return entity
# Correct
def make_entity(ctx, data, aliases=None):
if aliases is None:
aliases = []
entity = ctx.make_entity("Person")
aliases.append(data["name"])
return entity
Empty entity checks
Always check if entities are valid before yielding:
def handle(ctx, record, ix):
entity = ctx.make_entity("Organization")
entity.id = ctx.make_slug(record.get("id"))
entity.add("name", record.get("name"))
# Check for required data
if not entity.id or not entity.has("name"):
ctx.log.warning("Skipping invalid entity", record=ix)
return
yield entity
ID collisions
Ensure IDs are unique across the dataset:
# Wrong - might collide if multiple entity types share ID space
entity.id = ctx.make_slug(record["id"])
# Correct - include entity type in ID
entity.id = ctx.make_slug("person", record["id"])
Further reading
- Transform patterns - Detailed transform examples
- Entity keys and IDs - ID generation strategies
- Utility functions - Context helpers and common patterns
- Context API reference - Available context methods
- FollowTheMoney - Entity model documentation