Usage / Python
juditha is in-process. Import it, get a Store, call its methods. There is no HTTP API and no client / server split.
The minimal API
The package re-exports two helpers:
lookup is a memoised, top-level convenience for the most common case (one query, one best match). get_store returns the (cached) Store object for fine-grained access (search with filters, extraction, percolation, iterating the aggregator).
lookup()
from juditha import lookup
res = lookup(
"Jane Doe",
threshold=0.95, # optional, defaults to settings.fuzzy_threshold
uri=None, # optional, defaults to settings.uri
schemata=("Person",), # optional FollowTheMoney schema narrowing, must be a tuple
)
lookup is wrapped in lru_cache(100_000), so repeated queries (same args) are O(1). Return value is Result | None.
Result extends Doc with query, score, took (ms), caption (best display name via rigour's pick_name), and common_schema (FollowTheMoney schema reduction):
res.key # "doe jane" – the order-independent canonical key
res.names # {"Jane Doe"}
res.aliases # set of alternate surface forms
res.countries # ISO country codes derived from the FTM entity
res.schemata # FTM schemata that contributed to this cluster
res.score # similarity in [0, 1]
res.caption # human-readable display name
res.common_schema # e.g. "Person", "Organization", "LegalEntity"
get_store() and the Store class
from juditha import get_store
store = get_store() # uses settings.uri (env var or default)
store = get_store("/var/lib/juditha") # explicit path
get_store resolves the URI at call time and caches one Store per resolved URI. plyvel allows only one open handle per LevelDB path, so this cache is effectively a per-URI singleton.
The methods you will use most:
# Best-match search, same engine as juditha.lookup
result = store.search(query, threshold=None, limit=None, schemata=None)
# Aho-Corasick extraction over fulltext
mentions = store.extract("Some text mentioning Jane Doe.")
# Percolation: reverse search of the names index against the text
mentions = store.percolate("Some text mentioning Jane Doe.", slop=0)
extract and percolate both return list[Mention]. See Extract and Percolate for the differences.
Writing into the store
from juditha import get_store
from juditha import io
store = get_store()
# Either: stream FTM entities into the aggregator
io.load_proxies("entities.ftm.json", store)
# ...or push individual entities
store.aggregator.put(some_entity_proxy)
store.aggregator.flush()
# Then rebuild the searchable index + extractor
store.build()
store.build() deletes and recreates the tantivy index, then iterates the aggregator once feeding both tantivy and the Aho-Corasick extractor.
Shutting down
In a long-running worker you do not need to do anything explicit; the cached Store lives for the process lifetime.
In one-shot scripts or tests that switch URIs, call store.close() to flush pending writes, drain tantivy merges, and close the LevelDB handle:
Models
juditha.model exposes the data classes you get back from the API. All inherit from pydantic.BaseModel.
Doc(key, names, aliases, countries, schemata, score)– an aggregated cluster.Result(Doc, query, took, common_schema, caption)– a search hit.Mention(text, start, end, schema_)– a span extracted from a fulltext. The Python attribute isschema_(the JSON field isschema; see the Pydantic alias note).
Pydantic alias on Mention
Mention.schema_ carries the FTM-style schema label of the matched name. The Python attribute is schema_ because BaseModel.schema is reserved. The JSON surface uses "schema" via a Pydantic alias, so mention.model_dump_json() produces {"text": "...", "start": ..., "end": ..., "schema": "..."}. Both Mention(schema="Person") and Mention(schema_="Person") work on the constructor side.