Named Entity Recognition
Info
Entity extraction builds on top off how ingest-file
originally extracted mentioned Entities. Read more
Originally ingest-file
filtered the entities returned by spaCy
with a custom schema prediction model trained on existing FollowTheMoney data. Based on that, Mention
-Entities are created. These mentions are resolved into actual Entities (e.g. Company, Person) during cross-referencing datasets.
This creates a problem for "smaller" OpenAleph instances: If there is not enough data to cross-reference with, these Mention
entities would never resolved. As well when using the analysis standalone.
ftm-analyze
introduces an improvement to this problem: Extracted names can be compared against juditha, and if they are known, the resolved entities are returned instead of mentions.
juditha allows a fast lookup (based on tantivy) against a set of known names (from FollowTheMoney data). The index can be populated by reference datasets such as company registries, sanctions lists, or PEPs.
Set up juditha
Configure the juditha store uri:
export JUDITHA_URI=/path/to/store.db
For example, to load all PEPs by OpenSanctions:
juditha load-dataset -i https://data.opensanctions.org/datasets/latest/peps/index.json
When using ftm-analyze
now, it will turn known person names into actual Person
entities (instead of mentions) if they are within this PEPs list (including fuzzy matching).