End-to-end protein design loop with molforge¶
This notebook walks through a real protein-engineering workflow using nothing but molforge and the engines its wrappers integrate with. The loop:
- Start with a protein sequence.
- Fold it with ESMFold to get a 3-D structure.
- Analyze the result: secondary structure (DSSP), confidence (pLDDT), geometry (Rg).
- Mutate a residue at the sequence level (
A123V). - Re-fold the mutant.
- Compare the two structures: per-residue RMSD, contact-map overlap, change in secondary structure.
For brevity (and to keep the notebook running on machines without a GPU), the heavy ESMFold inference calls are shown but not executed in the rendered version — their expected outputs are inlined. To run this end-to-end on your own machine:
pip install 'molforge[ml]'
jupyter lab notebooks/examples/end_to_end_design.ipynb
and re-execute the cells marked with # 🐢 SLOW.
This is the workflow molforge exists to make easy. Everything you see below would normally be three different Python environments and a folder full of conversion scripts.
import molforge as mf
print(f"molforge version: {mf.__version__}")
molforge version: 0.0.1
1. Start with a sequence¶
We'll use a 76-residue fragment of ubiquitin (PDB 1UBQ) — short enough to fold quickly, well-characterized enough that we can sanity-check the output.
# Human ubiquitin, residues 1-76
wild_type_seq = (
"MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
)
print(f"Length: {len(wild_type_seq)} residues")
# Quick sequence stats from molforge.sequence
from molforge.sequence import composition, gravy, molecular_weight, aromaticity
print(f"Molecular weight: {molecular_weight(wild_type_seq):.0f} Da")
print(f"GRAVY (hydrophobicity): {gravy(wild_type_seq):+.3f}")
print(f"Aromatic fraction: {aromaticity(wild_type_seq):.1%}")
Length: 76 residues Molecular weight: 8565 Da GRAVY (hydrophobicity): -0.489 Aromatic fraction: 3.9%
2. Fold the wild-type sequence¶
ESMFold is a single-sequence transformer folding model. It doesn't need a multiple-sequence alignment (unlike AlphaFold), which makes it fast — a sequence of this size folds in a few seconds on a modern GPU, a minute or two on CPU.
In the cell below we show the call you'd make. The actual output is loaded from a fixture so the notebook renders without requiring torch/transformers.
# 🐢 SLOW — uncomment to run for real (needs `pip install 'molforge[ml]'`)
# from molforge.wrappers.folding import ESMFold
# engine = ESMFold(device="cuda") # or "cpu"
# wild_type = engine.predict(wild_type_seq)
# For demonstration we load a pre-computed prediction.
# (In your own work, the `wild_type` from the line above replaces this.)
import numpy as np
from molforge.core import AtomArray, Protein
# (placeholder: in a real run, `wild_type` would be the ESMFold output
# Protein. For the rendered notebook we synthesize a minimal stand-in
# so the downstream cells have a structure to operate on.)
print("In a real run, wild_type would be a 76-residue Protein here.")
print("Continuing with placeholder for rendering...")
In a real run, wild_type would be a 76-residue Protein here. Continuing with placeholder for rendering...
For the rest of this notebook we'll use the bundled fixture
tests/fixtures/pdb/helix.pdb (an idealized 15-residue helix) as a
concrete Protein to demonstrate the API. Substitute your real
ESMFold output for wild_type when you run this on your own machine.
from pathlib import Path
from molforge.io import read_pdb
# Use the bundled helix fixture as a stand-in for an ESMFold output.
# In production, this is just `wild_type = engine.predict(wild_type_seq)`.
fixture_path = Path("../../tests/fixtures/pdb/helix.pdb")
wild_type = read_pdb(fixture_path)
print(wild_type)
print(f"Chain IDs: {[c.chain_id for c in wild_type.chains]}")
print(f"Sequence: {wild_type.sequence}")
Protein(name='helix', n_chains=1, n_residues=15, n_atoms=60) Chain IDs: ['A'] Sequence: AAAAAAAAAAAAAAA
3. Analyze the predicted structure¶
Three quick analyses: secondary structure (DSSP), bulk geometry, and confidence (pLDDT — only available for real ESMFold output).
from molforge.structure import dssp_3state, radius_of_gyration, centroid
ss = dssp_3state(wild_type)
print(f"3-state secondary structure:")
print(f" sequence: {wild_type.sequence}")
print(f" DSSP: {ss}")
print(f" helix: {ss.count('H')} / {len(ss)} residues ({ss.count('H')/len(ss):.0%})")
print(f" strand: {ss.count('E')} / {len(ss)} residues")
print()
print(f"Radius of gyration: {radius_of_gyration(wild_type):.2f} Å")
print(f"Centroid: {centroid(wild_type)}")
3-state secondary structure: sequence: AAAAAAAAAAAAAAA DSSP: CHHHHHHHHHHHHHC helix: 13 / 15 residues (87%) strand: 0 / 15 residues Radius of gyration: 6.95 Å Centroid: [7.1066 7.4708 6.7356]
When you fold the real ubiquitin sequence with ESMFold, you'll
also see per-residue confidence (pLDDT) attached to
protein.metadata:
plddt = wild_type.metadata["confidence_per_residue"] # (76,) float32
print(f"Mean pLDDT: {wild_type.metadata['mean_confidence']:.1f}")
print(f"Low-confidence residues (<70): {(plddt < 70).sum()}")
For ubiquitin, ESMFold typically returns mean pLDDT > 85 — high confidence across the whole fold.
4. Mutate the sequence¶
We'll mutate the canonical lysine-48 (the linkage point for K48 polyubiquitination) to arginine — a classic protein-engineering test case. Real users would write:
mutant_seq = mf.sequence.apply_mutation(wild_type_seq, "K48R")
For the fixture demo, we'll mutate one of the alanines instead.
from molforge.sequence import apply_mutation, mutate_protein
# On a string sequence (the wild-type from cell 2):
mutant_seq = apply_mutation(wild_type_seq, "K48R")
print(f"Wild-type position 48: {wild_type_seq[47]}")
print(f"Mutant position 48: {mutant_seq[47]}")
print(f"Sequences differ at: {[i for i, (a, b) in enumerate(zip(wild_type_seq, mutant_seq)) if a != b]}")
print()
# On the Protein directly (using the fixture):
mutant = mutate_protein(wild_type, "A7G", chain_id="A")
print(f"Mutated residue 7: {wild_type['A'][7].name} → {mutant['A'][7].name}")
print(f"Coords unchanged: {np.allclose(mutant.atom_array.coords, wild_type.atom_array.coords)}")
Wild-type position 48: K Mutant position 48: R Sequences differ at: [47] Mutated residue 7: ALA → GLY Coords unchanged: True
Note: mutate_protein is a sequence-only mutation — it
updates the residue label but leaves the atomic coordinates alone.
That's correct for the design-loop pattern: you change the sequence
identity, then re-fold (or repack side chains) to get accurate
geometry. The next cell shows the re-fold step.
5. Re-fold the mutant¶
# 🐢 SLOW — needs molforge[ml]
mutant_structure = engine.predict(mutant_seq)
For ESMFold, a single point mutation typically shifts the structure by less than 1 Å RMSD (mutations rarely break the global fold). But local geometry around the mutation site can change significantly, which is exactly what we want to detect in step 6.
6. Compare wild-type and mutant¶
The interesting comparisons:
- Per-residue RMSD after global superposition — pinpoints which loops moved.
- Contact-map overlap — detects topology changes.
- Secondary structure difference — flags any helix/strand changes.
from molforge.structure import rmsd, rmsd_per_residue, contact_map
# Pretend the mutant was re-folded (in practice the next two lines are:
# mutant_structure = engine.predict(mutant_seq)
# and `mutant_structure` becomes the second arg below.
# Here we'll just compare wild_type against the sequence-mutated copy
# to demonstrate the API — RMSD will be ~0 because coords are unchanged.)
global_rmsd = rmsd(mutant, wild_type, subset="ca")
per_residue_rmsd = rmsd_per_residue(mutant, wild_type, subset="ca")
print(f"Global CA-RMSD: {global_rmsd:.3f} Å")
print(f"Per-residue CA-RMSD (first 5):")
for i, val in enumerate(per_residue_rmsd[:5]):
print(f" residue {i+1}: {val:.3f} Å")
Global CA-RMSD: 0.000 Å Per-residue CA-RMSD (first 5): residue 1: 0.000 Å residue 2: 0.000 Å residue 3: 0.000 Å residue 4: 0.000 Å residue 5: 0.000 Å
# Contact-map comparison: how many contacts are shared vs. unique
# to each structure?
cmap_wt = contact_map(wild_type, cutoff=8.0, atom_choice="ca")
cmap_mut = contact_map(mutant, cutoff=8.0, atom_choice="ca")
shared = (cmap_wt & cmap_mut).sum()
only_wt = (cmap_wt & ~cmap_mut).sum()
only_mut = (~cmap_wt & cmap_mut).sum()
print(f"Shared contacts: {shared}")
print(f"Lost on mutation: {only_wt}")
print(f"Gained on mutation: {only_mut}")
Shared contacts: 100 Lost on mutation: 0 Gained on mutation: 0
# Secondary-structure diff
ss_wt = dssp_3state(wild_type)
ss_mut = dssp_3state(mutant)
diffs = [(i, a, b) for i, (a, b) in enumerate(zip(ss_wt, ss_mut)) if a != b]
print(f"WT SS: {ss_wt}")
print(f"Mut SS: {ss_mut}")
print(f"Residues that changed SS class: {len(diffs)}")
for i, a, b in diffs[:5]:
print(f" residue {i+1}: {a} → {b}")
WT SS: CHHHHHHHHHHHHHC Mut SS: CHHHHHHHHHHHHHC Residues that changed SS class: 0
What we built¶
In ~50 lines of code we walked an entire protein-design loop:
- Sequence in, with quick composition stats from
molforge.sequence. - Folded structure out (real engine + lazy import via
molforge.wrappers.folding.ESMFold). - Structural analysis:
dssp_3state,radius_of_gyration,centroidfrommolforge.structure. - Point mutation:
apply_mutationandmutate_proteinfrommolforge.sequence. - Re-folding (same engine call as step 2 — uniform API).
- Quantitative wild-type vs. mutant comparison:
rmsd,rmsd_per_residue,contact_map,dssp_3stateagain.
Every step used a single, consistent Protein data model. No file-
format conversions, no Python environment juggling, no glue code.
Next steps for your own workflow¶
- Dock a candidate against a target: see
notebooks/walkthroughs/04_docking.ipynbfor AutoDock Vina viamolforge.wrappers.docking.Vina. - Run a short MD trajectory to refine the predicted structure: the OpenMM wrapper is on the roadmap.
- Build a screening pipeline by combining
ESMFold().predict_many([...])(batched folding) withdssp_3stateandradius_of_gyrationfilters.
For any of these, the molforge data model and wrapper interfaces are the same. The library is the connective tissue; you bring the science.