Skip to content

molforge.validation

validation

Cross-validation utilities for protein design.

This subpackage captures the pattern of scoring designs across one or more validators and combining the results — the natural follow-on to :mod:molforge.metrics for de novo design workflows.

Three core building blocks:

Criteria — declarative success conditions: - :class:Criterion — atomic comparison (e.g. pLDDT > 80). Compose with &, |, ~. - :class:NamedCriterion — a criterion with a human-readable label. - :class:CriteriaSet — a named collection of criteria evaluated together (implicit AND across criteria; pass/fail captured per criterion for diagnostics).

Verdicts — per-design results: - :class:Verdict — the metric values, criterion results, and overall pass/fail for one design. - :func:rank_verdicts — sort verdicts by score, optionally filtering to only-passed.

Orchestration: - :func:cross_validate — run a list of designs through one or more validators (any callable that returns a metric dict), apply criteria, return verdicts. - :func:consensus — merge verdict lists across validators ("ESMFold AND AlphaFold both confirm").

Example::

from molforge.validation import (
    cross_validate, consensus, rank_verdicts,
    Criterion, CriteriaSet,
)

success = (
    CriteriaSet()
    .add("plddt_ok", Criterion.gt("esmfold.plddt", 80))
    .add("tm_ok",    Criterion.gt("esmfold.tm", 0.5))
    .add("rmsd_ok",  Criterion.lt("esmfold.rmsd", 2.0))
)

verdicts = cross_validate(
    designs=sequences,
    validators={"esmfold": esmfold_score_fn},
    criteria=success,
    score_metric="esmfold.plddt",
)

for v in rank_verdicts(verdicts, only_passed=True):
    print(v.design_id, v.values["esmfold.plddt"])

CriteriaSet dataclass

CriteriaSet(criteria: dict[str, Criterion] = dict())

A named collection of criteria evaluated together.

Each criterion is evaluated separately so per-criterion pass/fail is available in the resulting :class:Verdict. The overall verdict passes only if all named criteria pass (an implicit AND).

metric_names property

metric_names: frozenset[str]

Union of metric names across all criteria.

add

add(name: str, criterion: Criterion) -> CriteriaSet

Add a named criterion. Returns self for chaining.

evaluate

evaluate(values: Mapping[str, Any]) -> dict[str, bool]

Evaluate every criterion against values.

Returns a dict mapping criterion name to its pass/fail result.

passes

passes(values: Mapping[str, Any]) -> bool

True iff every criterion passes.

Criterion

Criterion(
    evaluate: Callable[[Mapping[str, Any]], bool],
    describe: Callable[[], str],
    metric_names: frozenset[str],
)

Declarative success criterion: a metric name + comparison + threshold.

Construct via the factory classmethods :meth:gt, :meth:ge, :meth:lt, :meth:le, :meth:eq, :meth:ne rather than instantiating directly — they make the intent obvious in the calling code.

Compose with the standard logical operators::

a & b   # both must pass
a | b   # either must pass
~a      # invert (passes when ``a`` would fail)

metric_names property

metric_names: frozenset[str]

Names of all metrics this criterion references.

gt classmethod

gt(metric: str, threshold: float) -> Criterion

metric > threshold.

ge classmethod

ge(metric: str, threshold: float) -> Criterion

metric >= threshold.

lt classmethod

lt(metric: str, threshold: float) -> Criterion

metric < threshold.

le classmethod

le(metric: str, threshold: float) -> Criterion

metric <= threshold.

eq classmethod

eq(metric: str, value: Any) -> Criterion

metric == value.

ne classmethod

ne(metric: str, value: Any) -> Criterion

metric != value.

evaluate

evaluate(values: Mapping[str, Any]) -> bool

Return True iff values satisfies this criterion.

Parameters:

Name Type Description Default
values Mapping[str, Any]

Dict mapping metric names to their measured values. Must contain every metric this criterion references (see :attr:metric_names).

required

Raises:

Type Description
KeyError

If a referenced metric is missing from values. Missing metrics are treated as a programming error, not a failed criterion; if you want "missing = fail", filter upstream or use None explicitly (None always fails an atomic comparison).

NamedCriterion dataclass

NamedCriterion(
    name: str, criterion: Criterion, description: str = ""
)

A criterion paired with a human-readable name.

Useful when you want diagnostics that say "design passed fold_quality but failed solubility" rather than dumping the full criterion expression.

Verdict dataclass

Verdict(
    design_id: str,
    values: dict[str, Any] = dict(),
    criteria_results: dict[str, bool] = dict(),
    passed: bool = False,
    score: float = float("inf"),
    metadata: dict[str, Any] = dict(),
)

The result of validating a single design.

Attributes:

Name Type Description
design_id str

An identifier for the design (sequence string, file path, index, etc.). Used as the join key in :func:consensus across validators.

values dict[str, Any]

Dict of all metric values measured for this design. Keys are metric names; values are typically floats but can be anything criteria evaluate against.

criteria_results dict[str, bool]

Dict mapping criterion name to its pass/fail result.

passed bool

True iff every criterion passed (the overall verdict).

score float

A single scalar used for ranking. Smaller-is-better by convention (matches ProteinMPNN and most folding-confidence scores when negated); if you have a larger-is-better metric, store -value here.

metadata dict[str, Any]

Engine-specific extras (validator name, runtime, etc.).

failed_criteria property

failed_criteria: list[str]

Names of criteria that failed (empty if everything passed).

passed_criteria property

passed_criteria: list[str]

Names of criteria that passed.

consensus

consensus(
    verdict_lists: Mapping[str, Iterable[Verdict]],
    *,
    mode: str = "all",
    threshold: int | None = None,
) -> list[Verdict]

Combine verdicts from multiple validators into a single consensus list.

Each input list is the result of running the same set of designs through one validator. Designs are joined by design_id, so every validator must have produced a verdict for every design (use :func:cross_validate with multiple validators, or run each separately with consistent IDs).

Parameters:

Name Type Description Default
verdict_lists Mapping[str, Iterable[Verdict]]

Dict mapping validator name to its list of Verdicts. The dict keys are folded into the merged values so per-validator metrics stay distinguishable.

required
mode str

Consensus rule: - "all" (default): every validator must mark the design as passed. - "any": any validator suffices. - "majority": more than half of validators must pass. - "threshold": at least threshold validators must pass (provide threshold).

'all'
threshold int | None

Required when mode="threshold"; the minimum number of validators that must mark the design as passed.

None

Returns:

Type Description
list[Verdict]

A list of consensus :class:Verdict instances, one per design,

list[Verdict]

with metrics from every validator merged into values and

list[Verdict]

passed set per the chosen rule.

Raises:

Type Description
ValueError

If validator verdict lists disagree on design IDs, or for invalid mode/threshold combos.

Example::

esm_verdicts = cross_validate(seqs, validators={"esmfold": fn1}, criteria=c)
af_verdicts  = cross_validate(seqs, validators={"alphafold": fn2}, criteria=c)

# Only accept designs both folding models agree are good
joint = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="all",
)

cross_validate

cross_validate(
    designs: Iterable[Any],
    *,
    validators: Mapping[str, Validator],
    criteria: CriteriaSet,
    design_id: Callable[[Any], str] | None = None,
    score_metric: str | None = None,
    on_error: str = "raise",
) -> list[Verdict]

Run every design through every validator; collect verdicts.

Parameters:

Name Type Description Default
designs Iterable[Any]

Iterable of designs (typically a list of Protein / DesignedSequence / sequence strings).

required
validators Mapping[str, Validator]

Dict mapping validator name to a callable that consumes a design and returns a dict of metric values. Validator names are prepended to each metric key so values from different validators don't collide in the verdict dict — e.g. validators={"esmfold": fn} produces keys like "esmfold.plddt".

required
criteria CriteriaSet

The :class:CriteriaSet to apply. Should reference metric keys in their fully-qualified form ("esmfold.plddt" not just "plddt").

required
design_id Callable[[Any], str] | None

Optional function to extract a string ID from each design. Defaults to str(design) truncated to 60 chars.

None
score_metric str | None

Name of the metric (after qualification) to use as the sortable :attr:Verdict.score. If None, the score is the count of failed criteria (so passed designs sort to the front).

None
on_error str

How to handle exceptions raised by a validator:

  • "raise" (default): propagate the exception immediately. This is the default because a validator that throws is almost always a bug — a misconfigured engine, a missing dependency, a bad input — and silently swallowing it produces a full list of passed=False verdicts that looks like a real result. Failing loud surfaces the problem at once.
  • "record": catch the exception, record it under verdict.metadata["validator_errors"], mark that verdict passed=False, and carry on to the next design. Opt into this when you genuinely want a large batch to survive individual validator failures (e.g. an overnight screen where one bad design shouldn't abort the run).

.. versionchanged:: 0.2 The default flipped from "record" to "raise". Code that relied on the old fault-tolerant default must now pass on_error="record" explicitly.

'raise'

Returns:

Name Type Description
One list[Verdict]

class:Verdict per design, in input order.

Example::

criteria = (
    CriteriaSet()
    .add("fold_quality", Criterion.gt("esmfold.plddt", 80))
    .add("backbone_match", Criterion.gt("esmfold.tm", 0.5))
    .add("rmsd_ok", Criterion.lt("esmfold.rmsd", 2.0))
)

def esmfold_validator(seq):
    predicted = esm_engine.predict(seq)
    return {
        "plddt": predicted.metadata["mean_confidence"],
        "tm":    tm_score(predicted, target_backbone),
        "rmsd":  rmsd(predicted, target_backbone, subset="ca"),
    }

verdicts = cross_validate(
    designs=sequences,
    validators={"esmfold": esmfold_validator},
    criteria=criteria,
    score_metric="esmfold.plddt",
)

for v in rank_verdicts(verdicts, only_passed=True):
    print(v.design_id, v.values["esmfold.plddt"])

rank_verdicts

rank_verdicts(
    verdicts: Iterable[Verdict],
    *,
    only_passed: bool = False,
    by: str | None = None,
) -> list[Verdict]

Sort verdicts for ranking.

Parameters:

Name Type Description Default
verdicts Iterable[Verdict]

Verdicts to sort.

required
only_passed bool

If True, drop verdicts that didn't pass before sorting. Useful for "show me the successful designs".

False
by str | None

Sort key. None (default) sorts by verdict.score ascending (lower = better). Otherwise sort by the named metric value in verdict.values, ascending.

None

Returns:

Type Description
list[Verdict]

A new list, sorted. Original order is preserved within

list[Verdict]

equal-score groups (stable sort).