Cross-engine validation: when one validator isn't enough¶
A single folding model can over-confidently mis-predict structures. The standard mitigation in de novo design is to run multiple folding models and only accept designs that all of them confirm. This notebook shows the molforge mechanism for that pattern:
- Wrap each validator (folding model + scoring) as a Python callable returning a metric dict.
- Run
cross_validateagainst each. - Combine results with
consensus.
For reproducibility we use two synthetic validators that mimic ESMFold-like and AlphaFold-like output distributions deterministically. This keeps the notebook executable end-to-end without GPU. In real use, the validator functions wrap the actual engines.
import molforge as mf
import numpy as np
from molforge.validation import (
Criterion, CriteriaSet, cross_validate, consensus, rank_verdicts,
)
print(f"molforge {mf.__version__}")
molforge 0.0.1
1. The candidate designs¶
We're scoring a small batch of candidate sequences from some upstream design step (e.g. ProteinMPNN output). Each has an opaque ID; in real use these would be the design indices from a generation run.
# Five candidate sequences with deliberately varied quality
candidates = [
"MKEALEELRRRYGGG", # A — strong, both models agree
"MEEELKEAVRRAYGS", # B — strong, both models agree
"MAEELKAAVDRGYGG", # C — borderline, one model disagrees
"MKAALKEALDRAYGG", # D — weak, both models reject
"MQQEEILSAVEDPHK", # E — weak, both models reject
]
print(f"Scoring {len(candidates)} candidate sequences")
Scoring 5 candidate sequences
2. The synthetic validators¶
In production these would wrap ESMFold().predict(...) and
AlphaFold().predict(...) and compute metrics. For the demo, the
two validators below return deterministic dicts keyed on sequence:
# Each validator returns: plddt, tm, rmsd, lddt
# Deliberately picked so:
# - A, B pass cleanly in both
# - C passes ESM but fails AF (one model overconfident)
# - D fails both
# - E fails both, much worse
ESM_TABLE = {
"MKEALEELRRRYGGG": {"plddt": 87.2, "tm": 0.892, "rmsd": 0.95, "lddt": 0.821},
"MEEELKEAVRRAYGS": {"plddt": 84.1, "tm": 0.851, "rmsd": 1.21, "lddt": 0.789},
"MAEELKAAVDRGYGG": {"plddt": 82.4, "tm": 0.798, "rmsd": 1.43, "lddt": 0.732},
"MKAALKEALDRAYGG": {"plddt": 71.8, "tm": 0.526, "rmsd": 1.97, "lddt": 0.684},
"MQQEEILSAVEDPHK": {"plddt": 62.3, "tm": 0.382, "rmsd": 4.05, "lddt": 0.521},
}
AF_TABLE = {
"MKEALEELRRRYGGG": {"plddt": 91.5, "tm": 0.901, "rmsd": 0.79, "lddt": 0.864},
"MEEELKEAVRRAYGS": {"plddt": 88.7, "tm": 0.872, "rmsd": 1.04, "lddt": 0.832},
"MAEELKAAVDRGYGG": {"plddt": 74.2, "tm": 0.412, "rmsd": 2.34, "lddt": 0.620}, # AF disagrees!
"MKAALKEALDRAYGG": {"plddt": 69.5, "tm": 0.498, "rmsd": 2.12, "lddt": 0.655},
"MQQEEILSAVEDPHK": {"plddt": 58.7, "tm": 0.341, "rmsd": 4.92, "lddt": 0.487},
}
def esmfold_validator(seq):
return ESM_TABLE[seq]
def alphafold_validator(seq):
return AF_TABLE[seq]
print("Validators ready.")
Validators ready.
3. Success criteria¶
The standard de novo design success bar (Watson et al. 2023):
- pLDDT > 80 (model is confident)
- TM-score > 0.5 (same fold)
- RMSD < 2.0 Å (close geometric match)
Expressed declaratively:
def make_criteria(prefix):
"""Build the standard success criteria, namespaced to a validator."""
return (
CriteriaSet()
.add("plddt_ok", Criterion.gt(f"{prefix}.plddt", 80.0))
.add("tm_ok", Criterion.gt(f"{prefix}.tm", 0.5))
.add("rmsd_ok", Criterion.lt(f"{prefix}.rmsd", 2.0))
)
# We'll need one CriteriaSet per validator (they reference namespaced metrics)
esm_criteria = make_criteria("esmfold")
af_criteria = make_criteria("alphafold")
print(f"Each set: {list(esm_criteria.criteria.keys())}")
Each set: ['plddt_ok', 'tm_ok', 'rmsd_ok']
4. Single-validator pass¶
Run each validator independently. This produces two list[Verdict]
sequences, one per validator, with the same design_id ordering.
esm_verdicts = cross_validate(
designs=candidates,
validators={"esmfold": esmfold_validator},
criteria=esm_criteria,
)
af_verdicts = cross_validate(
designs=candidates,
validators={"alphafold": alphafold_validator},
criteria=af_criteria,
)
print("ESMFold's verdict:")
for v in esm_verdicts:
print(f" {v.design_id} {'PASS' if v.passed else 'FAIL'} "
f"plddt={v.values['esmfold.plddt']:.1f} "
f"tm={v.values['esmfold.tm']:.3f} "
f"rmsd={v.values['esmfold.rmsd']:.2f}")
print()
print("AlphaFold's verdict:")
for v in af_verdicts:
print(f" {v.design_id} {'PASS' if v.passed else 'FAIL'} "
f"plddt={v.values['alphafold.plddt']:.1f} "
f"tm={v.values['alphafold.tm']:.3f} "
f"rmsd={v.values['alphafold.rmsd']:.2f}")
ESMFold's verdict: MKEALEELRRRYGGG PASS plddt=87.2 tm=0.892 rmsd=0.95 MEEELKEAVRRAYGS PASS plddt=84.1 tm=0.851 rmsd=1.21 MAEELKAAVDRGYGG PASS plddt=82.4 tm=0.798 rmsd=1.43 MKAALKEALDRAYGG FAIL plddt=71.8 tm=0.526 rmsd=1.97 MQQEEILSAVEDPHK FAIL plddt=62.3 tm=0.382 rmsd=4.05 AlphaFold's verdict: MKEALEELRRRYGGG PASS plddt=91.5 tm=0.901 rmsd=0.79 MEEELKEAVRRAYGS PASS plddt=88.7 tm=0.872 rmsd=1.04 MAEELKAAVDRGYGG FAIL plddt=74.2 tm=0.412 rmsd=2.34 MKAALKEALDRAYGG FAIL plddt=69.5 tm=0.498 rmsd=2.12 MQQEEILSAVEDPHK FAIL plddt=58.7 tm=0.341 rmsd=4.92
Notice candidate C (MAEELKAAVDRGYGG): ESMFold says
PASS, AlphaFold says FAIL. This is exactly the kind of disagreement
the consensus pattern catches.
5. Consensus: combining the validators¶
consensus() merges verdict lists. The mode parameter decides the
combining rule:
# Strict: both validators must pass
strict = consensus(
{"esm": esm_verdicts, "af": af_verdicts},
mode="all",
)
print("STRICT (both must pass):")
for v in strict:
n = v.metadata["n_passed"]
total = v.metadata["n_validators"]
print(f" {v.design_id} {'PASS' if v.passed else 'FAIL'} "
f"({n}/{total} validators passed)")
STRICT (both must pass): MKEALEELRRRYGGG PASS (2/2 validators passed) MEEELKEAVRRAYGS PASS (2/2 validators passed) MAEELKAAVDRGYGG FAIL (1/2 validators passed) MKAALKEALDRAYGG FAIL (0/2 validators passed) MQQEEILSAVEDPHK FAIL (0/2 validators passed)
# Permissive: at least one validator must pass
permissive = consensus(
{"esm": esm_verdicts, "af": af_verdicts},
mode="any",
)
print("PERMISSIVE (at least one must pass):")
for v in permissive:
n = v.metadata["n_passed"]
total = v.metadata["n_validators"]
print(f" {v.design_id} {'PASS' if v.passed else 'FAIL'} "
f"({n}/{total} validators passed)")
PERMISSIVE (at least one must pass): MKEALEELRRRYGGG PASS (2/2 validators passed) MEEELKEAVRRAYGS PASS (2/2 validators passed) MAEELKAAVDRGYGG PASS (1/2 validators passed) MKAALKEALDRAYGG FAIL (0/2 validators passed) MQQEEILSAVEDPHK FAIL (0/2 validators passed)
6. Inspecting per-validator detail¶
The merged verdict carries the full metric dict from every validator (each metric stays namespaced to its source) plus per- criterion pass/fail with the validator name prepended:
# Look at the borderline case in detail
borderline = next(v for v in strict if v.design_id == "MAEELKAAVDRGYGG")
print(f"Design: {borderline.design_id}")
print(f"Passed overall: {borderline.passed}")
print(f"Metric values:")
for k in sorted(borderline.values):
print(f" {k} = {borderline.values[k]}")
print(f"Per-validator criteria:")
for k in sorted(borderline.criteria_results):
print(f" {k}: {'PASS' if borderline.criteria_results[k] else 'FAIL'}")
Design: MAEELKAAVDRGYGG Passed overall: False Metric values: alphafold.lddt = 0.62 alphafold.plddt = 74.2 alphafold.rmsd = 2.34 alphafold.tm = 0.412 esmfold.lddt = 0.732 esmfold.plddt = 82.4 esmfold.rmsd = 1.43 esmfold.tm = 0.798 Per-validator criteria: af.plddt_ok: FAIL af.rmsd_ok: FAIL af.tm_ok: FAIL esm.plddt_ok: PASS esm.rmsd_ok: PASS esm.tm_ok: PASS
You can see exactly why the design was rejected: AlphaFold disagreed with ESMFold on all three criteria. If you had only run ESMFold, you would have accepted this design.
7. Ranking the survivors¶
Use rank_verdicts to sort and filter:
# Take the strict consensus, keep only passing, then sort by
# (ESMFold's pLDDT, descending)
survivors = rank_verdicts(strict, only_passed=True, by="esmfold.plddt")
print("Surviving designs ranked by esmfold.plddt (lower-first):")
for v in survivors:
print(f" {v.design_id} "
f"esm.plddt={v.values['esmfold.plddt']:.1f} "
f"af.plddt={v.values['alphafold.plddt']:.1f}")
Surviving designs ranked by esmfold.plddt (lower-first): MEEELKEAVRRAYGS esm.plddt=84.1 af.plddt=88.7 MKEALEELRRRYGGG esm.plddt=87.2 af.plddt=91.5
8. The real-world pattern¶
In a production design pipeline, the only thing that changes vs. this notebook is the body of the validator functions. The orchestration layer doesn't care whether you're calling synthetic stubs or real GPU-bound engines:
from molforge.wrappers.folding import ESMFold, AlphaFold
from molforge.structure import rmsd
from molforge.metrics import tm_score, lddt
esm = ESMFold(device="cuda")
af = AlphaFold(num_models=1)
def esmfold_validator(seq):
predicted = esm.predict(seq)
return {
"plddt": predicted.metadata["mean_confidence"],
"tm": tm_score(predicted, target_backbone),
"rmsd": rmsd(predicted, target_backbone, subset="ca"),
"lddt": lddt(predicted, target_backbone),
}
def alphafold_validator(seq):
predicted = af.predict(seq)
return {
"plddt": predicted.metadata["mean_confidence"],
"tm": tm_score(predicted, target_backbone),
"rmsd": rmsd(predicted, target_backbone, subset="ca"),
"lddt": lddt(predicted, target_backbone),
}
# The rest is identical to this notebook.
A few practical notes from running real pipelines:
- AlphaFold-AlphaFold consensus (multiple AlphaFold seeds / models) catches model-internal variance.
- Cross-architecture consensus (ESMFold + AlphaFold) catches architecture-specific biases. The classic finding from the RFdiffusion paper is that ESMFold tends to be overconfident on designs that AlphaFold (with its full MSA pipeline) flags as problematic.
- Three-way consensus (ESMFold + AlphaFold + RoseTTAFold) with
mode="majority"is the de facto gold standard when compute is cheap. molforge'sconsensus()supports this directly.
What's next¶
- See
de_novo_design.ipynbfor the upstream of this pipeline: RFdiffusion + ProteinMPNN backbone / sequence generation. - See
06_plugin_authoring.ipynbfor how to plug your own validator into molforge's registry so it can be discovered by name.