Juditha
A super-fast in-process lookup service for canonical names, backed by tantivy.
juditha exists to tame the noise that follows from Named Entity Recognition: given a huge list of known names (company registries, persons of interest, sanctions lists), it tells you whether a span produced by your NER pipeline corresponds to one of them, even when the casing, accents, token order, or spelling differs.
The implementation uses a pre-populated names database and index. Data is either FollowTheMoney entities or simply list of names.
What you can do with it
- Validate and canonicalise NER spans against a known-name corpus (Quickstart, Usage).
- Load names from a flat list, FollowTheMoney entities, or a nomenklatura dataset / catalog (Load data).
- Extract every known-name mention from a fulltext document, either via an Aho-Corasick automaton or via percolation (reverse search of the names index).
Where to go next
- Start with the Quickstart.
- Usage / CLI and the full CLI reference.
- Usage / Python.
The name
Juditha Dommer was the daughter of a coppersmith and raised seven children, while her husband Johann Pachelbel wrote a canon.
Versioning
To mark the compatibility with followthemoney, juditha follows the same major version, which is currently 4.x.x.
License and copyright
juditha, (C) 2024 investigativedata.io. (C) 2025, 2026 Data and Research Center – DARC. Licensed under AGPLv3 or later. See NOTICE and LICENSE.