Named Entity Recognition

Info

Entity extraction builds on top off how ingest-file originally extracted mentioned Entities. Read more

Originally ingest-file filtered the entities returned by spaCy with a custom schema prediction model trained on existing FollowTheMoney data. Based on that, Mention-Entities are created. These mentions are resolved into actual Entities (e.g. Company, Person) during cross-referencing datasets.

This creates a problem for "smaller" OpenAleph instances: If there is not enough data to cross-reference with, these Mention entities would never resolved. As well when using the analysis standalone.

ftm-analyze introduces an improvement to this problem: Extracted names can be compared against juditha, and if they are known, the resolved entities are returned instead of mentions.

juditha allows a fast lookup (based on tantivy) against a set of known names (from FollowTheMoney data). The index can be populated by reference datasets such as company registries, sanctions lists, or PEPs.

Set up juditha

documentation

Configure the juditha store uri:

export JUDITHA_URI=/path/to/store.db

For example, to load all PEPs by OpenSanctions:

juditha load-dataset -i https://data.opensanctions.org/datasets/latest/peps/index.json
juditha build

When using ftm-analyze now, it will turn known person names into actual Person entities (instead of mentions) if they are within this PEPs list (including fuzzy matching).