Analyze Pipeline
Detected languages
ftm-analyze uses the fastText text classification library with a pre-trained model to detect the language of the document if it is not specified explicitly.
Named-entity recognition (NER)
ftm-analyze uses the SpaCy natural-language processing (NLP) framework and a number of pre-trained models for different languages to extract names of people, organizations, and countries from the text previously extracted from the Word document.
Extract patterns
In addition to NLP techniques, ftm-analyze also uses simple regular expressions to extract phone numbers, IBAN bank account numbers, and email addresses from documents.
Write fragments
Info
Under the hood, ftm-analyze uses followthemoney-store to store entity data. followthemoney-store stores entity data as "fragments". Every fragment stores a subset of the properties. Read more about fragments
Any extracted entities or patterns are then stored in a separate entity fragment. Assuming that the Word document uploaded mentions a person named "John Doe", the entity fragment written to the FollowTheMoney Store might look like this:
id | origin | fragment | data |
---|---|---|---|
97e1f... | analyze | default |
Additionally, ftm-analyze will also create separate entities for mentions of people and organizations. While this creates some redundancy, it allows OpenAleph to take them into account during cross-referencing. For example, another entity fragment will be written because "John Doe" was recognized as a name of a person:
id | origin | fragment | data |
---|---|---|---|
310a4... | analyze | default |
Dispatch index task
At the end of the analyze
task, ftm-analyze dispatches an index
task. This pushes a task object to the index queue for OpenAleph that includes a payload with the IDs of any entities written in the previous step.
Thanks to Till Prochaska who initially wrote up the pipeline for the original Aleph Documentation