Skip to content

Analyze Pipeline

Detected languages

ftm-analyze uses the fastText text classification library with a pre-trained model to detect the language of the document if it is not specified explicitly.

Named-entity recognition (NER)

ftm-analyze uses the SpaCy natural-language processing (NLP) framework and a number of pre-trained models for different languages to extract names of people, organizations, and countries from the text previously extracted from the Word document.

Extract patterns

In addition to NLP techniques, ftm-analyze also uses simple regular expressions to extract phone numbers, IBAN bank account numbers, and email addresses from documents.

Write fragments

Info

Under the hood, ftm-analyze uses followthemoney-store to store entity data. followthemoney-store stores entity data as "fragments". Every fragment stores a subset of the properties. Read more about fragments

Any extracted entities or patterns are then stored in a separate entity fragment. Assuming that the Word document uploaded mentions a person named "John Doe", the entity fragment written to the FollowTheMoney Store might look like this:

id origin fragment data
97e1f... analyze default
{
  "schema": "Pages",
  "properties": {
    "peopleMentioned": ["John Doe"],
    "detectedLanguage": ["eng"]
  }
}

Additionally, ftm-analyze will also create separate entities for mentions of people and organizations. While this creates some redundancy, it allows OpenAleph to take them into account during cross-referencing. For example, another entity fragment will be written because "John Doe" was recognized as a name of a person:

id origin fragment data
310a4... analyze default
{
  "schema": "Mention",
  "properties": {
    "name": ["John Doe"],
    "document": ["97e1f..."], // ID of the `Pages` entity
    "resolved": ["356aa..."],
    "detectedSchema": ["Person"]
  }
}

Dispatch index task

At the end of the analyze task, ftm-analyze dispatches an index task. This pushes a task object to the index queue for OpenAleph that includes a payload with the IDs of any entities written in the previous step.


Thanks to Till Prochaska who initially wrote up the pipeline for the original Aleph Documentation