Cross-engine validation: when one validator isn't enough¶

A single folding model can over-confidently mis-predict structures. The standard mitigation in de novo design is to run multiple folding models and only accept designs that all of them confirm. This notebook shows the molforge mechanism for that pattern:

Wrap each validator (folding model + scoring) as a Python callable returning a metric dict.
Run cross_validate against each.
Combine results with consensus.

For reproducibility we use two synthetic validators that mimic ESMFold-like and AlphaFold-like output distributions deterministically. This keeps the notebook executable end-to-end without GPU. In real use, the validator functions wrap the actual engines.

In [1]:

Copied!





import molforge as mf
import numpy as np
from molforge.validation import (
    Criterion, CriteriaSet, cross_validate, consensus, rank_verdicts,
)
print(f"molforge {mf.__version__}")
import molforge as mf
import numpy as np
from molforge.validation import (
    Criterion, CriteriaSet, cross_validate, consensus, rank_verdicts,
)
print(f"molforge {mf.__version__}")

molforge 0.0.1

1. The candidate designs¶

We're scoring a small batch of candidate sequences from some upstream design step (e.g. ProteinMPNN output). Each has an opaque ID; in real use these would be the design indices from a generation run.

In [1]:

Copied!





# Five candidate sequences with deliberately varied quality
candidates = [
    "MKEALEELRRRYGGG",   # A — strong, both models agree
    "MEEELKEAVRRAYGS",   # B — strong, both models agree
    "MAEELKAAVDRGYGG",   # C — borderline, one model disagrees
    "MKAALKEALDRAYGG",   # D — weak, both models reject
    "MQQEEILSAVEDPHK",   # E — weak, both models reject
]
print(f"Scoring {len(candidates)} candidate sequences")
# Five candidate sequences with deliberately varied quality
candidates = [
    "MKEALEELRRRYGGG",   # A — strong, both models agree
    "MEEELKEAVRRAYGS",   # B — strong, both models agree
    "MAEELKAAVDRGYGG",   # C — borderline, one model disagrees
    "MKAALKEALDRAYGG",   # D — weak, both models reject
    "MQQEEILSAVEDPHK",   # E — weak, both models reject
]
print(f"Scoring {len(candidates)} candidate sequences")

Scoring 5 candidate sequences

2. The synthetic validators¶

In production these would wrap ESMFold().predict(...) and AlphaFold().predict(...) and compute metrics. For the demo, the two validators below return deterministic dicts keyed on sequence:

In [1]:

Copied!





# Each validator returns: plddt, tm, rmsd, lddt
# Deliberately picked so:
#   - A, B pass cleanly in both
#   - C passes ESM but fails AF (one model overconfident)
#   - D fails both
#   - E fails both, much worse

ESM_TABLE = {
    "MKEALEELRRRYGGG": {"plddt": 87.2, "tm": 0.892, "rmsd": 0.95, "lddt": 0.821},
    "MEEELKEAVRRAYGS": {"plddt": 84.1, "tm": 0.851, "rmsd": 1.21, "lddt": 0.789},
    "MAEELKAAVDRGYGG": {"plddt": 82.4, "tm": 0.798, "rmsd": 1.43, "lddt": 0.732},
    "MKAALKEALDRAYGG": {"plddt": 71.8, "tm": 0.526, "rmsd": 1.97, "lddt": 0.684},
    "MQQEEILSAVEDPHK": {"plddt": 62.3, "tm": 0.382, "rmsd": 4.05, "lddt": 0.521},
}

AF_TABLE = {
    "MKEALEELRRRYGGG": {"plddt": 91.5, "tm": 0.901, "rmsd": 0.79, "lddt": 0.864},
    "MEEELKEAVRRAYGS": {"plddt": 88.7, "tm": 0.872, "rmsd": 1.04, "lddt": 0.832},
    "MAEELKAAVDRGYGG": {"plddt": 74.2, "tm": 0.412, "rmsd": 2.34, "lddt": 0.620},  # AF disagrees!
    "MKAALKEALDRAYGG": {"plddt": 69.5, "tm": 0.498, "rmsd": 2.12, "lddt": 0.655},
    "MQQEEILSAVEDPHK": {"plddt": 58.7, "tm": 0.341, "rmsd": 4.92, "lddt": 0.487},
}

def esmfold_validator(seq):
    return ESM_TABLE[seq]

def alphafold_validator(seq):
    return AF_TABLE[seq]

print("Validators ready.")
# Each validator returns: plddt, tm, rmsd, lddt
# Deliberately picked so:
#   - A, B pass cleanly in both
#   - C passes ESM but fails AF (one model overconfident)
#   - D fails both
#   - E fails both, much worse

ESM_TABLE = {
    "MKEALEELRRRYGGG": {"plddt": 87.2, "tm": 0.892, "rmsd": 0.95, "lddt": 0.821},
    "MEEELKEAVRRAYGS": {"plddt": 84.1, "tm": 0.851, "rmsd": 1.21, "lddt": 0.789},
    "MAEELKAAVDRGYGG": {"plddt": 82.4, "tm": 0.798, "rmsd": 1.43, "lddt": 0.732},
    "MKAALKEALDRAYGG": {"plddt": 71.8, "tm": 0.526, "rmsd": 1.97, "lddt": 0.684},
    "MQQEEILSAVEDPHK": {"plddt": 62.3, "tm": 0.382, "rmsd": 4.05, "lddt": 0.521},
}

AF_TABLE = {
    "MKEALEELRRRYGGG": {"plddt": 91.5, "tm": 0.901, "rmsd": 0.79, "lddt": 0.864},
    "MEEELKEAVRRAYGS": {"plddt": 88.7, "tm": 0.872, "rmsd": 1.04, "lddt": 0.832},
    "MAEELKAAVDRGYGG": {"plddt": 74.2, "tm": 0.412, "rmsd": 2.34, "lddt": 0.620},  # AF disagrees!
    "MKAALKEALDRAYGG": {"plddt": 69.5, "tm": 0.498, "rmsd": 2.12, "lddt": 0.655},
    "MQQEEILSAVEDPHK": {"plddt": 58.7, "tm": 0.341, "rmsd": 4.92, "lddt": 0.487},
}

def esmfold_validator(seq):
    return ESM_TABLE[seq]

def alphafold_validator(seq):
    return AF_TABLE[seq]

print("Validators ready.")

Validators ready.

3. Success criteria¶

The standard de novo design success bar (Watson et al. 2023):

pLDDT > 80 (model is confident)
TM-score > 0.5 (same fold)
RMSD < 2.0 Å (close geometric match)

Expressed declaratively:

In [1]:

Copied!





def make_criteria(prefix):
    """Build the standard success criteria, namespaced to a validator."""
    return (
        CriteriaSet()
        .add("plddt_ok", Criterion.gt(f"{prefix}.plddt", 80.0))
        .add("tm_ok",    Criterion.gt(f"{prefix}.tm", 0.5))
        .add("rmsd_ok",  Criterion.lt(f"{prefix}.rmsd", 2.0))
    )

# We'll need one CriteriaSet per validator (they reference namespaced metrics)
esm_criteria = make_criteria("esmfold")
af_criteria  = make_criteria("alphafold")
print(f"Each set: {list(esm_criteria.criteria.keys())}")
def make_criteria(prefix):
    """Build the standard success criteria, namespaced to a validator."""
    return (
        CriteriaSet()
        .add("plddt_ok", Criterion.gt(f"{prefix}.plddt", 80.0))
        .add("tm_ok",    Criterion.gt(f"{prefix}.tm", 0.5))
        .add("rmsd_ok",  Criterion.lt(f"{prefix}.rmsd", 2.0))
    )

# We'll need one CriteriaSet per validator (they reference namespaced metrics)
esm_criteria = make_criteria("esmfold")
af_criteria  = make_criteria("alphafold")
print(f"Each set: {list(esm_criteria.criteria.keys())}")

Each set: ['plddt_ok', 'tm_ok', 'rmsd_ok']

4. Single-validator pass¶

Run each validator independently. This produces two list[Verdict] sequences, one per validator, with the same design_id ordering.

In [1]:

Copied!





esm_verdicts = cross_validate(
    designs=candidates,
    validators={"esmfold": esmfold_validator},
    criteria=esm_criteria,
)
af_verdicts = cross_validate(
    designs=candidates,
    validators={"alphafold": alphafold_validator},
    criteria=af_criteria,
)

print("ESMFold's verdict:")
for v in esm_verdicts:
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"plddt={v.values['esmfold.plddt']:.1f}  "
          f"tm={v.values['esmfold.tm']:.3f}  "
          f"rmsd={v.values['esmfold.rmsd']:.2f}")
print()
print("AlphaFold's verdict:")
for v in af_verdicts:
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"plddt={v.values['alphafold.plddt']:.1f}  "
          f"tm={v.values['alphafold.tm']:.3f}  "
          f"rmsd={v.values['alphafold.rmsd']:.2f}")
esm_verdicts = cross_validate(
    designs=candidates,
    validators={"esmfold": esmfold_validator},
    criteria=esm_criteria,
)
af_verdicts = cross_validate(
    designs=candidates,
    validators={"alphafold": alphafold_validator},
    criteria=af_criteria,
)

print("ESMFold's verdict:")
for v in esm_verdicts:
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"plddt={v.values['esmfold.plddt']:.1f}  "
          f"tm={v.values['esmfold.tm']:.3f}  "
          f"rmsd={v.values['esmfold.rmsd']:.2f}")
print()
print("AlphaFold's verdict:")
for v in af_verdicts:
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"plddt={v.values['alphafold.plddt']:.1f}  "
          f"tm={v.values['alphafold.tm']:.3f}  "
          f"rmsd={v.values['alphafold.rmsd']:.2f}")

ESMFold's verdict:
  MKEALEELRRRYGGG  PASS  plddt=87.2  tm=0.892  rmsd=0.95
  MEEELKEAVRRAYGS  PASS  plddt=84.1  tm=0.851  rmsd=1.21
  MAEELKAAVDRGYGG  PASS  plddt=82.4  tm=0.798  rmsd=1.43
  MKAALKEALDRAYGG  FAIL  plddt=71.8  tm=0.526  rmsd=1.97
  MQQEEILSAVEDPHK  FAIL  plddt=62.3  tm=0.382  rmsd=4.05

AlphaFold's verdict:
  MKEALEELRRRYGGG  PASS  plddt=91.5  tm=0.901  rmsd=0.79
  MEEELKEAVRRAYGS  PASS  plddt=88.7  tm=0.872  rmsd=1.04
  MAEELKAAVDRGYGG  FAIL  plddt=74.2  tm=0.412  rmsd=2.34
  MKAALKEALDRAYGG  FAIL  plddt=69.5  tm=0.498  rmsd=2.12
  MQQEEILSAVEDPHK  FAIL  plddt=58.7  tm=0.341  rmsd=4.92

Notice candidate C (MAEELKAAVDRGYGG): ESMFold says PASS, AlphaFold says FAIL. This is exactly the kind of disagreement the consensus pattern catches.

5. Consensus: combining the validators¶

consensus() merges verdict lists. The mode parameter decides the combining rule:

In [1]:

Copied!





# Strict: both validators must pass
strict = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="all",
)
print("STRICT (both must pass):")
for v in strict:
    n = v.metadata["n_passed"]
    total = v.metadata["n_validators"]
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"({n}/{total} validators passed)")
# Strict: both validators must pass
strict = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="all",
)
print("STRICT (both must pass):")
for v in strict:
    n = v.metadata["n_passed"]
    total = v.metadata["n_validators"]
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"({n}/{total} validators passed)")

STRICT (both must pass):
  MKEALEELRRRYGGG  PASS  (2/2 validators passed)
  MEEELKEAVRRAYGS  PASS  (2/2 validators passed)
  MAEELKAAVDRGYGG  FAIL  (1/2 validators passed)
  MKAALKEALDRAYGG  FAIL  (0/2 validators passed)
  MQQEEILSAVEDPHK  FAIL  (0/2 validators passed)

In [1]:

Copied!





# Permissive: at least one validator must pass
permissive = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="any",
)
print("PERMISSIVE (at least one must pass):")
for v in permissive:
    n = v.metadata["n_passed"]
    total = v.metadata["n_validators"]
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"({n}/{total} validators passed)")
# Permissive: at least one validator must pass
permissive = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="any",
)
print("PERMISSIVE (at least one must pass):")
for v in permissive:
    n = v.metadata["n_passed"]
    total = v.metadata["n_validators"]
    print(f"  {v.design_id}  {'PASS' if v.passed else 'FAIL'}  "
          f"({n}/{total} validators passed)")

PERMISSIVE (at least one must pass):
  MKEALEELRRRYGGG  PASS  (2/2 validators passed)
  MEEELKEAVRRAYGS  PASS  (2/2 validators passed)
  MAEELKAAVDRGYGG  PASS  (1/2 validators passed)
  MKAALKEALDRAYGG  FAIL  (0/2 validators passed)
  MQQEEILSAVEDPHK  FAIL  (0/2 validators passed)

6. Inspecting per-validator detail¶

The merged verdict carries the full metric dict from every validator (each metric stays namespaced to its source) plus per- criterion pass/fail with the validator name prepended:

In [1]:

Copied!





# Look at the borderline case in detail
borderline = next(v for v in strict if v.design_id == "MAEELKAAVDRGYGG")
print(f"Design: {borderline.design_id}")
print(f"Passed overall: {borderline.passed}")
print(f"Metric values:")
for k in sorted(borderline.values):
    print(f"  {k} = {borderline.values[k]}")
print(f"Per-validator criteria:")
for k in sorted(borderline.criteria_results):
    print(f"  {k}: {'PASS' if borderline.criteria_results[k] else 'FAIL'}")
# Look at the borderline case in detail
borderline = next(v for v in strict if v.design_id == "MAEELKAAVDRGYGG")
print(f"Design: {borderline.design_id}")
print(f"Passed overall: {borderline.passed}")
print(f"Metric values:")
for k in sorted(borderline.values):
    print(f"  {k} = {borderline.values[k]}")
print(f"Per-validator criteria:")
for k in sorted(borderline.criteria_results):
    print(f"  {k}: {'PASS' if borderline.criteria_results[k] else 'FAIL'}")

Design: MAEELKAAVDRGYGG
Passed overall: False
Metric values:
  alphafold.lddt = 0.62
  alphafold.plddt = 74.2
  alphafold.rmsd = 2.34
  alphafold.tm = 0.412
  esmfold.lddt = 0.732
  esmfold.plddt = 82.4
  esmfold.rmsd = 1.43
  esmfold.tm = 0.798
Per-validator criteria:
  af.plddt_ok: FAIL
  af.rmsd_ok: FAIL
  af.tm_ok: FAIL
  esm.plddt_ok: PASS
  esm.rmsd_ok: PASS
  esm.tm_ok: PASS

You can see exactly why the design was rejected: AlphaFold disagreed with ESMFold on all three criteria. If you had only run ESMFold, you would have accepted this design.

7. Ranking the survivors¶

Use rank_verdicts to sort and filter:

In [1]:

Copied!





# Take the strict consensus, keep only passing, then sort by
# (ESMFold's pLDDT, descending)
survivors = rank_verdicts(strict, only_passed=True, by="esmfold.plddt")
print("Surviving designs ranked by esmfold.plddt (lower-first):")
for v in survivors:
    print(f"  {v.design_id}  "
          f"esm.plddt={v.values['esmfold.plddt']:.1f}  "
          f"af.plddt={v.values['alphafold.plddt']:.1f}")
# Take the strict consensus, keep only passing, then sort by
# (ESMFold's pLDDT, descending)
survivors = rank_verdicts(strict, only_passed=True, by="esmfold.plddt")
print("Surviving designs ranked by esmfold.plddt (lower-first):")
for v in survivors:
    print(f"  {v.design_id}  "
          f"esm.plddt={v.values['esmfold.plddt']:.1f}  "
          f"af.plddt={v.values['alphafold.plddt']:.1f}")

Surviving designs ranked by esmfold.plddt (lower-first):
  MEEELKEAVRRAYGS  esm.plddt=84.1  af.plddt=88.7
  MKEALEELRRRYGGG  esm.plddt=87.2  af.plddt=91.5

8. The real-world pattern¶

In a production design pipeline, the only thing that changes vs. this notebook is the body of the validator functions. The orchestration layer doesn't care whether you're calling synthetic stubs or real GPU-bound engines:

from molforge.wrappers.folding import ESMFold, AlphaFold
from molforge.structure import rmsd
from molforge.metrics import tm_score, lddt

esm = ESMFold(device="cuda")
af  = AlphaFold(num_models=1)

def esmfold_validator(seq):
    predicted = esm.predict(seq)
    return {
        "plddt": predicted.metadata["mean_confidence"],
        "tm":    tm_score(predicted, target_backbone),
        "rmsd":  rmsd(predicted, target_backbone, subset="ca"),
        "lddt":  lddt(predicted, target_backbone),
    }

def alphafold_validator(seq):
    predicted = af.predict(seq)
    return {
        "plddt": predicted.metadata["mean_confidence"],
        "tm":    tm_score(predicted, target_backbone),
        "rmsd":  rmsd(predicted, target_backbone, subset="ca"),
        "lddt":  lddt(predicted, target_backbone),
    }

# The rest is identical to this notebook.

A few practical notes from running real pipelines:

AlphaFold-AlphaFold consensus (multiple AlphaFold seeds / models) catches model-internal variance.
Cross-architecture consensus (ESMFold + AlphaFold) catches architecture-specific biases. The classic finding from the RFdiffusion paper is that ESMFold tends to be overconfident on designs that AlphaFold (with its full MSA pipeline) flags as problematic.
Three-way consensus (ESMFold + AlphaFold + RoseTTAFold) with mode="majority" is the de facto gold standard when compute is cheap. molforge's consensus() supports this directly.

What's next¶

See de_novo_design.ipynb for the upstream of this pipeline: RFdiffusion + ProteinMPNN backbone / sequence generation.
See 06_plugin_authoring.ipynb for how to plug your own validator into molforge's registry so it can be discovered by name.