Extract
juditha extract walks a fulltext and returns every stored name that appears in it. The mechanism is an Aho-Corasick automaton built at index time over every multi-token normalized name in the corpus.
Use this when:
- You need exact (post-normalization) matches with maximum speed.
- Your corpus fits comfortably in memory at build time.
- You don't need fuzzy / phonetic / variant tolerance during extraction (those still apply to
juditha lookup).
For variant-tolerant extraction (or for corpora large enough that the automaton would blow up at build time) see Percolate.
CLI
echo "The European Parliament met today." > /tmp/doc.txt
juditha extract -i /tmp/doc.txt
# {"text":"European Parliament","start":4,"end":23,"schema":"PublicBody"}
-i reads input from a file / URL / stdin. -o writes the mention list to a file / URL / stdout, one JSON object per line.
# Pipe straight from a document store
curl -s https://example.org/report.txt | juditha extract -o mentions.json
Python
from juditha import get_store
store = get_store()
text = "The European Parliament met today, and the European Council convened later."
mentions = store.extract(text)
for m in mentions:
print(m.text, m.start, m.end, m.schema_)
# European Parliament 4 23 PublicBody
# European Council 44 60 PublicBody
store.extract returns list[Mention]. Mention.text is the original surface form (the slice text[m.start:m.end]), Mention.schema_ is the FollowTheMoney schema label (use Mention.schema_ in Python; the JSON surface uses "schema" via Pydantic alias).
How it works
- At
buildtime every name in every cluster is ICU-normalized (NFKC casefold + Latin transliteration via rigour) and tokenized. Names with fewer than 2 tokens or with a total normalized length under 8 characters are dropped (filters out "EU", "ag", and similar noise). - Each surviving pattern is wrapped with leading and trailing spaces (
" european parliament ") so the automaton itself enforces token-boundary alignment. - At extraction time the input text is tokenized with the same ICU normalization, joined back together with spaces, and run through the automaton in a single O(n) pass.
- Match positions in the normalized text are mapped back to original-text byte offsets via a precomputed
char_to_tokenarray, soMention.startandMention.endindex into the original text.
Persistence
The automaton is persisted alongside the tantivy index as automaton.txt (tab-delimited pattern<TAB>schema per line). The first extract call after a process start loads it from disk; subsequent calls reuse the in-memory automaton.
juditha build regenerates this file in the same pass that rebuilds the tantivy index.
Limits
- Exact post-normalization match only. "Müller" and "muller" match (ICU folding), but "Jane Doe" and "Jane M. Doe" do not (the intervening token breaks the match).
- Single-token names ("EU", "Britta") are dropped to keep noise low. If you need to surface those, percolate doesn't change that floor either, by design.
- Build-time memory scales with corpus size. For multi-million-name corpora
juditha percolateis the more sustainable choice.