Walkthrough 2 — Structural analysis¶
This walkthrough covers molforge.structure: the workhorses for
analyzing the geometric properties of protein structures and
comparing them.
Topics:
- Geometry primitives — centroid, radius of gyration, bounding box.
- Superposition and RMSD — Kabsch / Umeyama alignment, RMSD across atom subsets, per-residue RMSD.
- Contact and distance maps — residue-residue connectivity.
- DSSP secondary structure — 8-state and 3-state assignment, no external binary required.
- SASA — solvent-accessible surface area via Shrake-Rupley.
- Backbone dihedrals — φ, ψ, ω angles, Ramachandran data.
For featurizers built on top of these (one-hot, RBF distances, graph construction), see Walkthrough 5.
import molforge as mf
from pathlib import Path
print(f"molforge {mf.__version__}")
molforge 0.0.1
# Load a representative fixture
from molforge.io import read_pdb
protein = read_pdb(Path("../../tests/fixtures/pdb/mini_mixed.pdb"))
print(protein)
print(f"chain IDs: {[c.chain_id for c in protein.chains]}")
print(f"sequence: {protein.sequence}")
Protein(name='mini_mixed', n_chains=1, n_residues=15, n_atoms=60) chain IDs: ['A'] sequence: AAAAAAAAAAAAAAA
1. Geometry primitives¶
The bulk geometric properties of a structure: centroid (geometric or mass-weighted), radius of gyration (compactness metric), and the axis-aligned bounding box.
from molforge.structure import centroid, center_of_mass, radius_of_gyration, bounding_box
print(f"Geometric centroid: {centroid(protein)}")
print(f"Mass-weighted COM: {center_of_mass(protein)}")
print(f"Radius of gyration: {radius_of_gyration(protein):.3f} Å")
lo, hi = bounding_box(protein)
print(f"Bounding box (Å): {lo} -> {hi}")
Geometric centroid: [5.2214333 4.331450 1.3270833] Mass-weighted COM: [5.2355117 4.358220 1.3374702] Radius of gyration: 5.247 Å Bounding box (Å): [ 0. 0. -1.013] -> [16.033 11.262 5.092]
2. Superposition and RMSD¶
superpose finds the optimal rigid-body alignment between two
structures via the Kabsch / Umeyama SVD method. The returned
SuperpositionResult carries the rotation, translation, aligned
mobile coords, and the resulting RMSD.
import numpy as np
from copy import deepcopy
from molforge.structure import superpose, rmsd, rmsd_per_residue
# Translate the protein 10 Å in x and rotate slightly to simulate "the same
# protein in a different reference frame"
target = deepcopy(protein)
theta = np.radians(30)
rot = np.array([[np.cos(theta), -np.sin(theta), 0],
[np.sin(theta), np.cos(theta), 0],
[0, 0, 1]])
target.atom_array.coords = (rot @ protein.atom_array.coords.T).T + np.array([10.0, 0, 0], dtype=np.float32)
# Without alignment, RMSD is enormous
print(f"RMSD without alignment: {rmsd(target, protein, align=False):.3f} Å")
# With alignment, the rigid-body change is removed
print(f"RMSD with alignment: {rmsd(target, protein, align=True):.3f} Å")
RMSD without alignment: 8.009 Å RMSD with alignment: 0.000 Å
Atom subsets: rmsd accepts a subset parameter to control
which atoms are compared.
# CA-only (default, standard for fold comparison)
print(f"CA-only RMSD: {rmsd(target, protein, subset='ca'):.3f} Å")
# Backbone N-CA-C
print(f"Backbone (N-CA-C): {rmsd(target, protein, subset='backbone'):.3f} Å")
# Mainchain N-CA-C-O
print(f"Backbone + O: {rmsd(target, protein, subset='backbone_o'):.3f} Å")
# All heavy atoms
print(f"All heavy atoms: {rmsd(target, protein, subset='all_heavy'):.3f} Å")
CA-only RMSD: 0.000 Å Backbone (N-CA-C): 0.000 Å Backbone + O: 0.000 Å All heavy atoms: 0.000 Å
Per-residue RMSD localizes structural differences:
# Add some noise to specific residues so we can see localized RMSD
noisy = deepcopy(protein)
rng = np.random.default_rng(42)
# Perturb only residues 5-8 (the middle helix region)
arr = noisy.atom_array
mask = (arr.residue_id >= 5) & (arr.residue_id <= 8)
arr.coords[mask] += rng.normal(scale=0.5, size=arr.coords[mask].shape).astype(np.float32)
per_res = rmsd_per_residue(noisy, protein, subset="ca")
print(f"Per-residue CA-RMSD ({per_res.shape[0]} residues):")
for i, val in enumerate(per_res, start=1):
flag = " *" if val > 0.1 else ""
print(f" residue {i:>2}: {val:.3f} Å{flag}")
Per-residue CA-RMSD (15 residues): residue 1: 0.051 Å residue 2: 0.051 Å residue 3: 0.053 Å residue 4: 0.058 Å residue 5: 0.076 Å residue 6: 0.081 Å residue 7: 0.106 Å residue 8: 0.071 Å residue 9: 0.068 Å residue 10: 0.049 Å residue 11: 1.232 Å * residue 12: 0.633 Å * residue 13: 0.351 Å * residue 14: 0.572 Å * residue 15: 0.058 Å
The per-residue RMSD cleanly localizes which residues moved — the noise we injected into residues 5-8 shows up as a clear spike, while the rest of the chain stays near zero. This is exactly what you want for finding "what changed" between two conformations.
3. Contact and distance maps¶
contact_map returns a binary residue-residue connectivity matrix at
a chosen cutoff. distance_map returns the continuous distance matrix.
residue_contacts returns all-atom contacts as a sorted list of
((chain, resid), (chain, resid), distance) tuples — useful for
interface analysis.
from molforge.structure import contact_map, distance_map
# Continuous CA-CA distance matrix
dmap = distance_map(protein, atom_choice="ca")
print(f"distance_map shape: {dmap.shape}")
print(f"some distances: 0->5={dmap[0,5]:.2f} Å, 0->10={dmap[0,10]:.2f} Å")
print()
# Binary contact map at 8 Å (CASP standard)
cmap = contact_map(protein, cutoff=8.0, atom_choice="ca", exclude_neighbors=2)
print(f"contact_map shape: {cmap.shape}")
print(f"non-trivial contacts (excluding +/-2 neighbors): {cmap.sum() // 2}")
distance_map shape: (15, 15) some distances: 0->5=8.76 Å, 0->10=3.80 Å contact_map shape: (15, 15) non-trivial contacts (excluding +/-2 neighbors): 46
4. DSSP secondary structure¶
dssp_3state returns the per-residue secondary-structure classification
in the standard 3-letter alphabet: H for helix, E for strand,
C for coil. The 8-state DSSP alphabet is also available via dssp.
from molforge.structure import dssp, dssp_3state
ss3 = dssp_3state(protein)
print(f"Sequence: {protein.sequence}")
print(f"3-state SS: {ss3}")
print()
# Full 8-state output (H/G/I helices, E/B strands, T/S turns, '-' coil)
full = dssp(protein)
print(f"8-state SS: {''.join(full['codes_8'])}")
print(f"keys: {list(full.keys())}")
Sequence: AAAAAAAAAAAAAAA 3-state SS: CHHHEEEEEEEECCC 8-state SS: -HHHEEEEEEEE--- keys: ['codes_8', 'codes_3', 'residue_labels', 'hbond_energies']
5. SASA (solvent-accessible surface area)¶
sasa returns per-atom SASA via Shrake-Rupley. sasa_per_residue
sums per residue. total_sasa gives the scalar shortcut.
from molforge.structure import sasa, sasa_per_residue, total_sasa
# Cheap parameter for the demo — production should use 100+
per_atom = sasa(protein, n_sphere_points=64)
print(f"Per-atom SASA: shape {per_atom.shape}")
print(f"sample: first 5 atoms = {per_atom[:5]}")
per_res = sasa_per_residue(protein, n_sphere_points=64)
print(f"Per-residue SASA: {per_res.shape[0]} residues")
print(f"most exposed (top 3 by SASA):")
top_idx = np.argsort(per_res)[::-1][:3]
for i in top_idx:
print(f" residue {i+1}: {per_res[i]:.1f} Ų")
print(f"Total SASA: {total_sasa(protein, n_sphere_points=64):.1f} Ų")
Per-atom SASA: shape (60,) sample: first 5 atoms = [49.55 39.63 1.89 0. 8.54] Per-residue SASA: 15 residues most exposed (top 3 by SASA): residue 15: 120.4 Ų residue 7: 106.1 Ų residue 14: 94.8 Ų Total SASA: 955.2 Ų
6. Backbone dihedrals¶
phi_psi_omega returns all three backbone dihedrals per residue.
Chain termini and residues with missing backbones are NaN.
from molforge.structure import phi_psi_omega, ramachandran
phi, psi, omega = phi_psi_omega(protein)
print(f"phi shape: {phi.shape}")
print(f"first 5 (phi, psi, omega) triples:")
for i in range(5):
p, q, w = phi[i], psi[i], omega[i]
p_str = f"{p:>7.2f}" if not np.isnan(p) else " NaN"
q_str = f"{q:>7.2f}" if not np.isnan(q) else " NaN"
w_str = f"{w:>7.2f}" if not np.isnan(w) else " NaN"
print(f" residue {i+1}: phi={p_str}, psi={q_str}, omega={w_str}")
print(f"...")
print(f" residue 15: phi={phi[-1]:.2f}, psi={'NaN' if np.isnan(psi[-1]) else f'{psi[-1]:.2f}'}, omega={omega[-1]:.2f}")
phi shape: (15,) first 5 (phi, psi, omega) triples: residue 1: phi= NaN, psi= 44.99, omega= NaN residue 2: phi= 59.98, psi= 45.02, omega=-179.97 residue 3: phi= 60.00, psi= 45.04, omega= 179.99 residue 4: phi= 59.94, psi= 45.06, omega=-179.95 residue 5: phi= 59.97, psi= 44.99, omega=-179.99 ... residue 15: phi=119.97, psi=NaN, omega=-179.99
# For Ramachandran plots, use the convenience wrapper
rama = ramachandran(protein)
print(f"ramachandran shape: {rama.shape}")
print(f"first 3 (phi, psi) pairs (NaN at termini):")
for row in rama[:3]:
p, q = row
p_str = f"{p:>7.2f}" if not np.isnan(p) else " NaN"
q_str = f"{q:>7.2f}" if not np.isnan(q) else " NaN"
print(f" ({p_str}, {q_str})")
ramachandran shape: (15, 2) first 3 (phi, psi) pairs (NaN at termini): ( NaN, 44.99) ( 59.98, 45.02) ( 60.00, 45.04)
Putting it together — fold quality assessment¶
A common workflow: fold a sequence, then assess the predicted structure
on multiple axes. The example below uses a pre-loaded fixture (in
practice you'd use ESMFold().predict(sequence)).
# Quick "is this structure reasonable?" check
print(f"Structure quality summary")
print(f"=" * 40)
print(f"Sequence: {protein.sequence}")
print(f"Length: {protein.n_residues}")
print(f"Rg: {radius_of_gyration(protein):.2f} Å")
print(f"3-state SS: {dssp_3state(protein)}")
helix_pct = dssp_3state(protein).count('H') / protein.n_residues
strand_pct = dssp_3state(protein).count('E') / protein.n_residues
print(f" helix: {helix_pct:.0%}")
print(f" strand: {strand_pct:.0%}")
print(f"Total SASA: {total_sasa(protein, n_sphere_points=64):.0f} Ų")
print(f"All φ/ψ defined: {not np.isnan(phi_psi_omega(protein)[0][1:-1]).any()}")
Structure quality summary ======================================== Sequence: AAAAAAAAAAAAAAA Length: 15 Rg: 5.25 Å 3-state SS: CHHHEEEEEEEECCC helix: 20% strand: 53% Total SASA: 955 Ų All φ/ψ defined: True
All of this is reusable across any Protein instance — whether
loaded from disk via read_pdb, predicted from a sequence by
ESMFold().predict(...), or extracted as a single frame from an MD
Trajectory via traj.frame(i). The data model is the connective
tissue.
See also:
- Walkthrough 1 — sequence operations (alignment, mutations).
- Walkthrough 4 — docking with AutoDock Vina.
- Walkthrough 5 — ML featurization (graph construction, ESM-2 embeddings).