Settings

Settings management is handled via pydantic-settings, so either env vars with JUDITHA_ prefix or a local .env file works.

Storage and query

`JUDITHA_URI`


Field	`uri`
Type	`str`
Default	`juditha.db`
Read at	every `get_store()` call (resolved at call time)
Rebuild needed	n/a (it selects which store to open)

Path to the store directory. Holds the LevelDB aggregator (names.db/), the tantivy index (tantivy.db/), and the Aho-Corasick patterns file (automaton.txt). Supports any anystore-resolvable path: local filesystem, S3, ...

`JUDITHA_FUZZY_THRESHOLD`


Field	`fuzzy_threshold`
Type	`float`
Default	`0.97`
Read at	every `Store.search` / `lookup()` call
Rebuild needed	no

Jaro similarity threshold for lookup / Store.search. Hits below this score are dropped. Lower for noisier inputs (OCR, NER spans with edit-distance corruption); raise to demand near-exact matches.

`JUDITHA_LIMIT`


Field	`limit`
Type	`int`
Default	`10`
Read at	every `Store.search` / `lookup()` call
Rebuild needed	no

Maximum candidate hits tantivy returns per search before the Jaro / rapidfuzz rerank stage. Raise if the rerank often drops the only correct candidate because it ranked outside the top-10 by BM25.

`JUDITHA_MIN_LENGTH`


Field	`min_length`
Type	`int`
Default	`4`
Read at	every `Store.search` / `lookup()` call
Rebuild needed	no

Minimum length of the normalized query. Queries that normalize to fewer than this many characters return None without touching tantivy.

Index-time tuning

`JUDITHA_MIN_TOKEN_LENGTH`


Field	`min_token_length`
Type	`int`
Default	`4`
Read at	build time (`Store.build`, `AhoExtractor.add_doc`) and query time (percolator blocking)
Rebuild needed	yes (`juditha build`)

Per-token length floor shared by:

The percolator's blocking set ({f for f in text_forms if len(f) >= min_token_length})
The tokens field on the names index (populated at index time with the same floor)
The Aho-Corasick extractor's pattern-length filter, derived as min_token_length × MIN_TOKEN_COUNT (default 4 × 2 = 8 total normalized chars)

Lower this to admit shorter, stopword-ish tokens. Short-token names like Abu Bakr or XI Jinping then carry richer blocking signatures and survive the BM25 percolate_block_limit cap on large corpora. Cost: index size grows and the blocking-set fan-out increases at query time.

Index-time and query-time use of this setting are coupled: the percolator's blocking set at query time MUST match the floor used when populating the tokens field at index time, otherwise blocking misses real candidates. Always run juditha build after changing this.

Query-time tuning

`JUDITHA_PERCOLATE_BLOCK_LIMIT`


Field	`percolate_block_limit`
Type	`int`
Default	`10_000`
Read at	every `Store.percolate` / `percolate()` call
Rebuild needed	no

Maximum number of candidate Docs the percolator's blocking stage returns. Candidates are BM25-ranked by how many (and how rare) of the input text's tokens they match. Tantivy 0.25 does not expose a minimum_should_match knob, so the only way to ensure low-frequency candidates aren't crowded out at multi-million-cluster scale is to raise this cap.

Setting	When	Cost
`10_000` (default)	Corpora up to ~1 M clusters, latency-sensitive workloads	~1 s on 90 K-token input
`100_000`	Multi-million-cluster corpora where parity with Aho matters	~4 to 5 s on 90 K-token input
`1_000_000`	Investigative / batch jobs where recall trumps latency	linear in candidates surviving blocking

See Benchmark for the cost / recall curve at three corpus sizes.

Debug

`JUDITHA_DEBUG`


Field	`debug`
Type	`bool`
Default	`false`
Read at	CLI startup
Rebuild needed	no

When true, typer renders rich tracebacks on CLI errors. Useful for development.

Module-level constants (not settings)

Some thresholds live as module-level constants in juditha.percolator and juditha.extraction rather than on Settings, either because they're conceptually coupled to a class invariant or because changing them is invasive enough to warrant a code change:

Constant	Module	Value	Purpose
`MIN_TOKEN_COUNT`	`juditha.percolator`, `juditha.extraction`	`2`	Single-token names are always skipped (too noisy as both Aho patterns and phrase-query candidates). Same value, declared in two places to keep modules independent.
`NUM_CPU`	`juditha.store`	`multiprocessing.cpu_count()`	Sizes the tantivy writer's heap × num_threads at build time.