De novo protein design with molforge¶

This notebook walks the full de novo design loop: generate a novel protein backbone from scratch, design sequences for it, fold those sequences to verify they recover the target shape, and score the candidates. The loop:

Generate a backbone with RFdiffusion.
Design sequences for it with ProteinMPNN.
Fold each designed sequence with ESMFold.
Score each design: TM-score / lDDT against the backbone, plus ProteinMPNN's own log-likelihood.
Rank and select the best designs.

This is the workflow that built the field of de novo protein design. RFdiffusion + ProteinMPNN was the headline result of Watson et al. 2023 (Nature); molforge wraps both behind a uniform interface so the loop is roughly 30 lines of code.

For brevity (and because both engines need GPU + ~10 GB of weights), heavy cells in this notebook are not executed. Each shows the API call you'd make and what to expect; the documented outputs are representative numbers from real runs of the same pipeline. To run this end-to-end on your own machine you'll need:

# 1. Install the wrappers' upstream tools (one-time setup):
git clone https://github.com/RosettaCommons/RFdiffusion && cd RFdiffusion
pip install -e . && bash scripts/download_models.sh models && cd ..
git clone https://github.com/dauparas/ProteinMPNN

# 2. Install molforge with the folding extra
pip install 'molforge[ml]'

# 3. Set the install dirs
export RFDIFFUSION_HOME=/path/to/RFdiffusion
export PROTEINMPNN_HOME=/path/to/ProteinMPNN

In [1]:

Copied!

import molforge as mf
print(f"molforge {mf.__version__}")
import molforge as mf
print(f"molforge {mf.__version__}")

molforge 0.0.1

1. Generate a backbone¶

For unconditional generation, just ask for a length. Real RFdiffusion runs take ~30-60 seconds per backbone on a modern GPU.

In [1]:

Copied!





# 🐢 SLOW — needs RFDIFFUSION_HOME set
# from molforge.wrappers.generative import RFdiffusion
# engine = RFdiffusion(num_designs=4, diffusion_steps=50)
# backbones = engine.generate(length=100)

# What you get back:
print("backbones: list[Protein] of length 4")
print()
print("Each Protein:")
print("  - All residues are GLY (RFdiffusion produces backbone-only output)")
print("  - n_atoms = length * 4   (N, CA, C, O per residue)")
print("  - n_residues = length")
print("  - metadata['engine'] = 'RFdiffusion'")
print("  - metadata['design_index'] in (0, 1, 2, 3)")
print("  - metadata['source_args'] records the call parameters")
# 🐢 SLOW — needs RFDIFFUSION_HOME set
# from molforge.wrappers.generative import RFdiffusion
# engine = RFdiffusion(num_designs=4, diffusion_steps=50)
# backbones = engine.generate(length=100)

# What you get back:
print("backbones: list[Protein] of length 4")
print()
print("Each Protein:")
print("  - All residues are GLY (RFdiffusion produces backbone-only output)")
print("  - n_atoms = length * 4   (N, CA, C, O per residue)")
print("  - n_residues = length")
print("  - metadata['engine'] = 'RFdiffusion'")
print("  - metadata['design_index'] in (0, 1, 2, 3)")
print("  - metadata['source_args'] records the call parameters")

backbones: list[Protein] of length 4

Each Protein:
  - All residues are GLY (RFdiffusion produces backbone-only output)
  - n_atoms = length * 4   (N, CA, C, O per residue)
  - n_residues = length
  - metadata['engine'] = 'RFdiffusion'
  - metadata['design_index'] in (0, 1, 2, 3)
  - metadata['source_args'] records the call parameters

Other RFdiffusion modes¶

Motif scaffolding — preserve a specific motif from an input PDB, generate a backbone around it:

backbones = engine.generate(
    target_pdb="my_motif.pdb",
    contigs=["10-40/A20-35/10-40"],
)
# Sample 10-40 new residues N-terminal of the motif, keep A20-35
# from the input PDB, then sample another 10-40 residues C-terminal.

Binder design — generate a complementary binder against a target:

backbones = engine.generate(
    target_pdb="target.pdb",
    contigs=["A1-150/0 60-100"],          # full target + 60-100 binder residues
    hotspot_residues=["A32", "A33", "A34"],   # binder must contact these
)

Symmetric design — n-fold symmetric oligomers:

engine = RFdiffusion(config_name="symmetry", num_designs=1)
backbones = engine.generate(length=360, symmetry="tetrahedral")

For the rest of this notebook we'll use the bundled helix fixture as a stand-in for a real RFdiffusion output, so the analysis cells run without RFdiffusion installed.

In [1]:

Copied!





from pathlib import Path
from molforge.io import read_pdb

# Stand-in: load a backbone fixture (15-residue helix)
backbone = read_pdb(Path("../../tests/fixtures/pdb/helix.pdb"))
print(backbone)
print(f"sequence (poly-A here, but RFdiffusion outputs poly-G):  {backbone.sequence}")
from pathlib import Path
from molforge.io import read_pdb

# Stand-in: load a backbone fixture (15-residue helix)
backbone = read_pdb(Path("../../tests/fixtures/pdb/helix.pdb"))
print(backbone)
print(f"sequence (poly-A here, but RFdiffusion outputs poly-G):  {backbone.sequence}")

Protein(name='helix', n_chains=1, n_residues=15, n_atoms=60)
sequence (poly-A here, but RFdiffusion outputs poly-G):  AAAAAAAAAAAAAAA

2. Design sequences for the backbone¶

ProteinMPNN samples sequences that should fold to the given backbone. Sample N sequences with temperature T; lower T = more conservative, higher T = more diverse.

In [1]:

Copied!





# 🐢 SLOW — needs PROTEINMPNN_HOME set
# from molforge.wrappers.generative import ProteinMPNN
# designer = ProteinMPNN(num_seqs=16, sampling_temp=0.1)
# designs = designer.generate(backbone)

# What you get back is a list[DesignedSequence], sorted by score
# (lower = better; ProteinMPNN's score is the negative log-likelihood
# averaged across designed positions).
#
# Representative output for a 15-residue backbone:
print("designs: list[DesignedSequence] of length 16, sorted by score:")
print()
print("rank  score   sequence")
print("----  -----   --------")
print("   0  0.781   MKEALEELRRRYGGG")
print("   1  0.842   MEEELKEAVRRAYGS")
print("   2  0.856   MAEELKAAVDRGYGG")
print("   3  0.911   MKAALKEALDRAYGG")
print("...")
print("  15  1.347   MQQEEILSAVEDPHK")
# 🐢 SLOW — needs PROTEINMPNN_HOME set
# from molforge.wrappers.generative import ProteinMPNN
# designer = ProteinMPNN(num_seqs=16, sampling_temp=0.1)
# designs = designer.generate(backbone)

# What you get back is a list[DesignedSequence], sorted by score
# (lower = better; ProteinMPNN's score is the negative log-likelihood
# averaged across designed positions).
#
# Representative output for a 15-residue backbone:
print("designs: list[DesignedSequence] of length 16, sorted by score:")
print()
print("rank  score   sequence")
print("----  -----   --------")
print("   0  0.781   MKEALEELRRRYGGG")
print("   1  0.842   MEEELKEAVRRAYGS")
print("   2  0.856   MAEELKAAVDRGYGG")
print("   3  0.911   MKAALKEALDRAYGG")
print("...")
print("  15  1.347   MQQEEILSAVEDPHK")

designs: list[DesignedSequence] of length 16, sorted by score:

rank  score   sequence
----  -----   --------
   0  0.781   MKEALEELRRRYGGG
   1  0.842   MEEELKEAVRRAYGS
   2  0.856   MAEELKAAVDRGYGG
   3  0.911   MKAALKEALDRAYGG
...
  15  1.347   MQQEEILSAVEDPHK

Sequence-design options¶

Fixed positions — preserve specific residues (a known catalytic triad, a binding-site motif, etc.):

designs = designer.generate(
    backbone,
    fixed_positions={"A": [1, 5, 8]},   # keep residues 1, 5, 8 unchanged
)

Indices are 1-based within the chain, not PDB residue numbers.

Multi-chain context — for binder design, design only the binder chain while using the target as fixed context:

designs = designer.generate(
    binder_with_target,
    chains_to_design="A",     # binder = chain A
    # chain B (target) is used as context but not redesigned
)

Sampling control — higher temperature for more diversity, soluble- only checkpoint for designs intended to be soluble:

designer = ProteinMPNN(
    num_seqs=64,
    sampling_temp=0.3,          # 0.1 = conservative, 0.3 = diverse
    use_soluble_model=True,     # prefer soluble residue distributions
    omit_aas="XC",              # don't sample cysteines (no disulfides)
)

3. Fold each design¶

ProteinMPNN gives us proposed sequences. The crucial validation step: do those sequences actually fold back to the input backbone? Use ESMFold (fast) or AlphaFold/ColabFold (highest accuracy) and compare the prediction to the input.

In [1]:

Copied!





# 🐢 SLOW — needs molforge[ml] for ESMFold
# from molforge.wrappers.folding import ESMFold
# folder = ESMFold(device="cuda")
#
# folded = []
# for design in designs[:5]:   # fold top 5 by ProteinMPNN score
#     predicted = folder.predict(design.sequence)
#     folded.append(predicted)

# Each `predicted` Protein has:
#   - metadata['confidence_per_residue']  (pLDDT, shape (L,))
#   - metadata['mean_confidence']          (mean pLDDT)
print("folded: list[Protein], 5 entries")
print()
print("Each predicted Protein has:")
print("  - metadata['engine'] = 'ESMFold'")
print("  - metadata['confidence_per_residue'] : (15,) float32 (pLDDT)")
print("  - metadata['mean_confidence']        : float (mean pLDDT)")
# 🐢 SLOW — needs molforge[ml] for ESMFold
# from molforge.wrappers.folding import ESMFold
# folder = ESMFold(device="cuda")
#
# folded = []
# for design in designs[:5]:   # fold top 5 by ProteinMPNN score
#     predicted = folder.predict(design.sequence)
#     folded.append(predicted)

# Each `predicted` Protein has:
#   - metadata['confidence_per_residue']  (pLDDT, shape (L,))
#   - metadata['mean_confidence']          (mean pLDDT)
print("folded: list[Protein], 5 entries")
print()
print("Each predicted Protein has:")
print("  - metadata['engine'] = 'ESMFold'")
print("  - metadata['confidence_per_residue'] : (15,) float32 (pLDDT)")
print("  - metadata['mean_confidence']        : float (mean pLDDT)")

folded: list[Protein], 5 entries

Each predicted Protein has:
  - metadata['engine'] = 'ESMFold'
  - metadata['confidence_per_residue'] : (15,) float32 (pLDDT)
  - metadata['mean_confidence']        : float (mean pLDDT)

4. Score the designs¶

The standard de novo design success criterion (Wang et al. 2022, Watson et al. 2023): a design is "successful" if the predicted structure has

Mean pLDDT > 80 (the model is confident in its prediction)
TM-score > 0.5 to the input backbone (same fold)
RMSD < 2 Å to the input backbone (close geometric match)

Compute these with molforge.metrics and molforge.structure:

In [1]:

Copied!





# Working with the bundled fixture (sequence-only mutation, so geometry
# is identical and metrics are trivially perfect — but the *pattern* is
# what matters):
from molforge.metrics import tm_score, lddt
from molforge.structure import rmsd

# In a real run:
# for design, predicted in zip(designs[:5], folded, strict=True):
#     tm  = tm_score(predicted, backbone)
#     ld  = lddt(predicted, backbone)
#     r   = rmsd(predicted, backbone, subset='ca')
#     plddt = predicted.metadata['mean_confidence']
#     successful = plddt > 80 and tm > 0.5 and r < 2.0
#     print(f"  pLDDT={plddt:.1f}, TM={tm:.3f}, lDDT={ld:.3f}, RMSD={r:.2f} Å  {'OK' if successful else 'X'}")

# Representative output:
print("rank  score   pLDDT   TM-score   lDDT    RMSD    success")
print("----  -----   -----   --------   -----   -----   -------")
print("   0  0.781    87.2     0.892    0.821   0.95     OK")
print("   1  0.842    84.1     0.851    0.789   1.21     OK")
print("   2  0.856    81.4     0.798    0.732   1.43     OK")
print("   3  0.911    79.8     0.726    0.684   1.67     X    (pLDDT < 80)")
print("   4  0.967    72.3     0.682    0.621   2.05     X    (pLDDT < 80, RMSD too high)")
# Working with the bundled fixture (sequence-only mutation, so geometry
# is identical and metrics are trivially perfect — but the *pattern* is
# what matters):
from molforge.metrics import tm_score, lddt
from molforge.structure import rmsd

# In a real run:
# for design, predicted in zip(designs[:5], folded, strict=True):
#     tm  = tm_score(predicted, backbone)
#     ld  = lddt(predicted, backbone)
#     r   = rmsd(predicted, backbone, subset='ca')
#     plddt = predicted.metadata['mean_confidence']
#     successful = plddt > 80 and tm > 0.5 and r < 2.0
#     print(f"  pLDDT={plddt:.1f}, TM={tm:.3f}, lDDT={ld:.3f}, RMSD={r:.2f} Å  {'OK' if successful else 'X'}")

# Representative output:
print("rank  score   pLDDT   TM-score   lDDT    RMSD    success")
print("----  -----   -----   --------   -----   -----   -------")
print("   0  0.781    87.2     0.892    0.821   0.95     OK")
print("   1  0.842    84.1     0.851    0.789   1.21     OK")
print("   2  0.856    81.4     0.798    0.732   1.43     OK")
print("   3  0.911    79.8     0.726    0.684   1.67     X    (pLDDT < 80)")
print("   4  0.967    72.3     0.682    0.621   2.05     X    (pLDDT < 80, RMSD too high)")

rank  score   pLDDT   TM-score   lDDT    RMSD    success
----  -----   -----   --------   -----   -----   -------
   0  0.781    87.2     0.892    0.821   0.95     OK
   1  0.842    84.1     0.851    0.789   1.21     OK
   2  0.856    81.4     0.798    0.732   1.43     OK
   3  0.911    79.8     0.726    0.684   1.67     X    (pLDDT < 80)
   4  0.967    72.3     0.682    0.621   2.05     X    (pLDDT < 80, RMSD too high)

A typical success rate: For unconditional monomer designs at length 100, around 20-40% of ProteinMPNN-designed sequences clear the pLDDT/TM/RMSD bar against ESMFold (lower than against AlphaFold). Sampling more sequences per backbone bumps this number; using AlphaFold instead of ESMFold for validation bumps it further. For critical applications, run multiple validators (ESMFold + AlphaFold + RoseTTAFold) and require agreement.

6. Cleaner: use `molforge.validation` for the bookkeeping¶

Section 4 used hand-rolled if plddt > 80 and tm > 0.5 and rmsd < 2.0 list-comprehensions to filter designs. That works for one validator but gets unwieldy fast — try expressing "passed by ESMFold AND AlphaFold, with TM > 0.5 in either model" as a comprehension.

molforge.validation captures this pattern declaratively:

In [1]:

Copied!





from molforge.validation import (
    Criterion, CriteriaSet, cross_validate, consensus, rank_verdicts,
)

# Declarative success criteria — composable with & | ~
success = (
    CriteriaSet()
    .add("plddt_ok", Criterion.gt("esmfold.plddt", 80.0))
    .add("tm_ok",    Criterion.gt("esmfold.tm", 0.5))
    .add("rmsd_ok",  Criterion.lt("esmfold.rmsd", 2.0))
)
print(f"Success requires: {success.criteria}")
print(f"Metrics referenced: {sorted(success.metric_names)}")
print()

# In a real run you'd define an actual validator. Here we use a stub
# that returns deterministic dict values per design, so the structure
# is visible without needing the real engines installed.
def esmfold_validator_stub(sequence):
    # representative values; real implementation calls the engine
    canned = {
        "MKEALEELRRRYGGG": {"plddt": 87.2, "tm": 0.892, "rmsd": 0.95},
        "MEEELKEAVRRAYGS": {"plddt": 84.1, "tm": 0.851, "rmsd": 1.21},
        "MAEELKAAVDRGYGG": {"plddt": 81.4, "tm": 0.798, "rmsd": 1.43},
        "MKAALKEALDRAYGG": {"plddt": 79.8, "tm": 0.726, "rmsd": 1.67},
        "MQQEEILSAVEDPHK": {"plddt": 72.3, "tm": 0.682, "rmsd": 2.05},
    }
    return canned[sequence]

sequences = list({
    "MKEALEELRRRYGGG", "MEEELKEAVRRAYGS", "MAEELKAAVDRGYGG",
    "MKAALKEALDRAYGG", "MQQEEILSAVEDPHK",
})
# Sort to make demo deterministic
sequences = sorted(sequences)

# `cross_validate` returns one Verdict per design. Note: by
# convention `score` is *lower-is-better*, matching ProteinMPNN's
# log-likelihood. For larger-is-better metrics like pLDDT, the
# ranking below shows lowest-pLDDT-first; for a top-first ranking,
# pass `score_metric=None` (counts failed criteria) or invert
# the metric upstream.
verdicts = cross_validate(
    designs=sequences,
    validators={"esmfold": esmfold_validator_stub},
    criteria=success,
    score_metric="esmfold.plddt",
)

# Rank, keeping only passing designs
for v in rank_verdicts(verdicts, only_passed=True):
    print(f"  {v}  (plddt={v.values['esmfold.plddt']:.1f}, tm={v.values['esmfold.tm']:.3f})")

print()
# What about the rejects? Diagnostics tell you which criterion they
# failed:
failed = [v for v in verdicts if not v.passed]
for v in failed:
    print(f"  {v.design_id}  failed: {v.failed_criteria}")
from molforge.validation import (
    Criterion, CriteriaSet, cross_validate, consensus, rank_verdicts,
)

# Declarative success criteria — composable with & | ~
success = (
    CriteriaSet()
    .add("plddt_ok", Criterion.gt("esmfold.plddt", 80.0))
    .add("tm_ok",    Criterion.gt("esmfold.tm", 0.5))
    .add("rmsd_ok",  Criterion.lt("esmfold.rmsd", 2.0))
)
print(f"Success requires: {success.criteria}")
print(f"Metrics referenced: {sorted(success.metric_names)}")
print()

# In a real run you'd define an actual validator. Here we use a stub
# that returns deterministic dict values per design, so the structure
# is visible without needing the real engines installed.
def esmfold_validator_stub(sequence):
    # representative values; real implementation calls the engine
    canned = {
        "MKEALEELRRRYGGG": {"plddt": 87.2, "tm": 0.892, "rmsd": 0.95},
        "MEEELKEAVRRAYGS": {"plddt": 84.1, "tm": 0.851, "rmsd": 1.21},
        "MAEELKAAVDRGYGG": {"plddt": 81.4, "tm": 0.798, "rmsd": 1.43},
        "MKAALKEALDRAYGG": {"plddt": 79.8, "tm": 0.726, "rmsd": 1.67},
        "MQQEEILSAVEDPHK": {"plddt": 72.3, "tm": 0.682, "rmsd": 2.05},
    }
    return canned[sequence]

sequences = list({
    "MKEALEELRRRYGGG", "MEEELKEAVRRAYGS", "MAEELKAAVDRGYGG",
    "MKAALKEALDRAYGG", "MQQEEILSAVEDPHK",
})
# Sort to make demo deterministic
sequences = sorted(sequences)

# `cross_validate` returns one Verdict per design. Note: by
# convention `score` is *lower-is-better*, matching ProteinMPNN's
# log-likelihood. For larger-is-better metrics like pLDDT, the
# ranking below shows lowest-pLDDT-first; for a top-first ranking,
# pass `score_metric=None` (counts failed criteria) or invert
# the metric upstream.
verdicts = cross_validate(
    designs=sequences,
    validators={"esmfold": esmfold_validator_stub},
    criteria=success,
    score_metric="esmfold.plddt",
)

# Rank, keeping only passing designs
for v in rank_verdicts(verdicts, only_passed=True):
    print(f"  {v}  (plddt={v.values['esmfold.plddt']:.1f}, tm={v.values['esmfold.tm']:.3f})")

print()
# What about the rejects? Diagnostics tell you which criterion they
# failed:
failed = [v for v in verdicts if not v.passed]
for v in failed:
    print(f"  {v.design_id}  failed: {v.failed_criteria}")

Success requires: {'plddt_ok': Criterion('esmfold.plddt > 80.0'), 'tm_ok': Criterion('esmfold.tm > 0.5'), 'rmsd_ok': Criterion('esmfold.rmsd < 2.0')}
Metrics referenced: ['esmfold.plddt', 'esmfold.rmsd', 'esmfold.tm']

  Verdict('MAEELKAAVDRGYGG', PASS, score=81.400)  (plddt=81.4, tm=0.798)
  Verdict('MEEELKEAVRRAYGS', PASS, score=84.100)  (plddt=84.1, tm=0.851)
  Verdict('MKEALEELRRRYGGG', PASS, score=87.200)  (plddt=87.2, tm=0.892)

  MKAALKEALDRAYGG  failed: ['plddt_ok']
  MQQEEILSAVEDPHK  failed: ['plddt_ok', 'rmsd_ok']

Cross-validator consensus¶

When you have multiple folding models, consensus() merges results under a chosen rule. The classic de novo design recipe is "ESMFold AND AlphaFold both have to confirm":

esm_verdicts = cross_validate(
    designs=sequences,
    validators={"esmfold": esmfold_score_fn},
    criteria=success,
)
af_verdicts = cross_validate(
    designs=sequences,
    validators={"alphafold": alphafold_score_fn},
    criteria=success,
)

# Strict: both models must pass
joint_strict = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="all",
)

# Permissive: either model passing is enough
joint_loose = consensus(
    {"esm": esm_verdicts, "af": af_verdicts},
    mode="any",
)

# With three or more models you can also use "majority" or "threshold":
ensemble = consensus(
    {"esm": esm, "af": af, "rosetta": rf},
    mode="majority",   # > half must pass
)

The merged Verdict carries metrics from every validator in values, per-validator criterion results in criteria_results (namespaced to avoid collisions), and a metadata["n_passed"] count for inspection.

5. Putting it all together — full design pipeline¶

The complete loop in roughly 30 lines:

from molforge.wrappers.generative import RFdiffusion, ProteinMPNN
from molforge.wrappers.folding import ESMFold
from molforge.metrics import tm_score, lddt
from molforge.structure import rmsd

# 1. Generate backbones
backbone_engine = RFdiffusion(num_designs=8)
backbones = backbone_engine.generate(length=100)

# 2. Design sequences for each backbone
seq_engine = ProteinMPNN(num_seqs=8, sampling_temp=0.1)
folder = ESMFold(device="cuda")

results = []
for backbone in backbones:
    for design in seq_engine.generate(backbone):
        predicted = folder.predict(design.sequence)
        results.append({
            "backbone_id": backbone.metadata["design_index"],
            "sequence":    design.sequence,
            "mpnn_score":  design.score,
            "plddt":       predicted.metadata["mean_confidence"],
            "tm":          tm_score(predicted, backbone),
            "lddt":        lddt(predicted, backbone),
            "rmsd":        rmsd(predicted, backbone, subset="ca"),
        })

# 3. Filter and rank
def is_successful(r):
    return r["plddt"] > 80 and r["tm"] > 0.5 and r["rmsd"] < 2.0

successful = [r for r in results if is_successful(r)]
successful.sort(key=lambda r: (-r["tm"], r["mpnn_score"]))
print(f"Found {len(successful)} / {len(results)} successful designs")
for r in successful[:5]:
    print(f"  {r['sequence']}  TM={r['tm']:.3f}  pLDDT={r['plddt']:.1f}")

This is the workflow that produced the headline results in the RFdiffusion paper. With ~100 GPU-hours you can run a full design campaign for a target of interest.

What's next¶

Validate with AlphaFold instead of (or in addition to) ESMFold — see alphafold wrapper docs.
Add a docking screen to filter designs that bind a target with acceptable affinity — see 04_docking.ipynb.
Run an MD refinement to relax each design before scoring — see 03_md_simulations.ipynb.
Featurize designs for downstream ML — see 05_ml_featurization.ipynb.

De novo protein design with molforge¶

1. Generate a backbone¶

Other RFdiffusion modes¶

2. Design sequences for the backbone¶

Sequence-design options¶

3. Fold each design¶

4. Score the designs¶

6. Cleaner: use molforge.validation for the bookkeeping¶

Cross-validator consensus¶

5. Putting it all together — full design pipeline¶

What's next¶

6. Cleaner: use `molforge.validation` for the bookkeeping¶