Usage / Python

juditha is in-process. Import it, get a Store, call its methods. There is no HTTP API and no client / server split.

The minimal API

The package re-exports two helpers:

from juditha import lookup, get_store

lookup is a memoised, top-level convenience for the most common case (one query, one best match). get_store returns the (cached) Store object for fine-grained access (search with filters, extraction, percolation, iterating the aggregator).

`lookup()`

from juditha import lookup

res = lookup(
    "Jane Doe",
    threshold=0.95,           # optional, defaults to settings.fuzzy_threshold
    uri=None,                 # optional, defaults to settings.uri
    schemata=("Person",),     # optional FollowTheMoney schema narrowing, must be a tuple
)

lookup is wrapped in lru_cache(100_000), so repeated queries (same args) are O(1). Return value is Result | None.

Result extends Doc with query, score, took (ms), caption (best display name via rigour's pick_name), and common_schema (FollowTheMoney schema reduction):

res.key            # "doe jane" – the order-independent canonical key
res.names          # {"Jane Doe"}
res.aliases        # set of alternate surface forms
res.countries      # ISO country codes derived from the FTM entity
res.schemata       # FTM schemata that contributed to this cluster
res.score          # similarity in [0, 1]
res.caption        # human-readable display name
res.common_schema  # e.g. "Person", "Organization", "LegalEntity"

`get_store()` and the `Store` class

from juditha import get_store

store = get_store()                  # uses settings.uri (env var or default)
store = get_store("/var/lib/juditha") # explicit path

get_store resolves the URI at call time and caches one Store per resolved URI. plyvel allows only one open handle per LevelDB path, so this cache is effectively a per-URI singleton.

The methods you will use most:

# Best-match search, same engine as juditha.lookup
result = store.search(query, threshold=None, limit=None, schemata=None)

# Aho-Corasick extraction over fulltext
mentions = store.extract("Some text mentioning Jane Doe.")

# Percolation: reverse search of the names index against the text
mentions = store.percolate("Some text mentioning Jane Doe.", slop=0)

extract and percolate both return list[Mention]. See Extract and Percolate for the differences.

Writing into the store

from juditha import get_store
from juditha import io

store = get_store()

# Either: stream FTM entities into the aggregator
io.load_proxies("entities.ftm.json", store)

# ...or push individual entities
store.aggregator.put(some_entity_proxy)
store.aggregator.flush()

# Then rebuild the searchable index + extractor
store.build()

store.build() deletes and recreates the tantivy index, then iterates the aggregator once feeding both tantivy and the Aho-Corasick extractor.

Shutting down

In a long-running worker you do not need to do anything explicit; the cached Store lives for the process lifetime.

In one-shot scripts or tests that switch URIs, call store.close() to flush pending writes, drain tantivy merges, and close the LevelDB handle:

store = get_store("/tmp/jtest")
# ... do work ...
store.close()

Models

juditha.model exposes the data classes you get back from the API. All inherit from pydantic.BaseModel.

from juditha.model import Doc, Result, Mention

Doc(key, names, aliases, countries, schemata, score) – an aggregated cluster.
Result(Doc, query, took, common_schema, caption) – a search hit.
Mention(text, start, end, schema_) – a span extracted from a fulltext. The Python attribute is schema_ (the JSON field is schema; see the Pydantic alias note).

Pydantic alias on `Mention`

Mention.schema_ carries the FTM-style schema label of the matched name. The Python attribute is schema_ because BaseModel.schema is reserved. The JSON surface uses "schema" via a Pydantic alias, so mention.model_dump_json() produces {"text": "...", "start": ..., "end": ..., "schema": "..."}. Both Mention(schema="Person") and Mention(schema_="Person") work on the constructor side.