Walkthrough 2 — Structural analysis¶

This walkthrough covers molforge.structure: the workhorses for analyzing the geometric properties of protein structures and comparing them.

Topics:

Geometry primitives — centroid, radius of gyration, bounding box.
Superposition and RMSD — Kabsch / Umeyama alignment, RMSD across atom subsets, per-residue RMSD.
Contact and distance maps — residue-residue connectivity.
DSSP secondary structure — 8-state and 3-state assignment, no external binary required.
SASA — solvent-accessible surface area via Shrake-Rupley.
Backbone dihedrals — φ, ψ, ω angles, Ramachandran data.

For featurizers built on top of these (one-hot, RBF distances, graph construction), see Walkthrough 5.

In [1]:

Copied!

import molforge as mf
from pathlib import Path
print(f"molforge {mf.__version__}")
import molforge as mf
from pathlib import Path
print(f"molforge {mf.__version__}")

molforge 0.0.1

In [1]:

Copied!





# Load a representative fixture
from molforge.io import read_pdb
protein = read_pdb(Path("../../tests/fixtures/pdb/mini_mixed.pdb"))
print(protein)
print(f"chain IDs: {[c.chain_id for c in protein.chains]}")
print(f"sequence:  {protein.sequence}")
# Load a representative fixture
from molforge.io import read_pdb
protein = read_pdb(Path("../../tests/fixtures/pdb/mini_mixed.pdb"))
print(protein)
print(f"chain IDs: {[c.chain_id for c in protein.chains]}")
print(f"sequence:  {protein.sequence}")

Protein(name='mini_mixed', n_chains=1, n_residues=15, n_atoms=60)
chain IDs: ['A']
sequence:  AAAAAAAAAAAAAAA

1. Geometry primitives¶

The bulk geometric properties of a structure: centroid (geometric or mass-weighted), radius of gyration (compactness metric), and the axis-aligned bounding box.

In [1]:

Copied!





from molforge.structure import centroid, center_of_mass, radius_of_gyration, bounding_box

print(f"Geometric centroid:   {centroid(protein)}")
print(f"Mass-weighted COM:    {center_of_mass(protein)}")
print(f"Radius of gyration:   {radius_of_gyration(protein):.3f} Å")
lo, hi = bounding_box(protein)
print(f"Bounding box (Å):     {lo} -> {hi}")
from molforge.structure import centroid, center_of_mass, radius_of_gyration, bounding_box

print(f"Geometric centroid:   {centroid(protein)}")
print(f"Mass-weighted COM:    {center_of_mass(protein)}")
print(f"Radius of gyration:   {radius_of_gyration(protein):.3f} Å")
lo, hi = bounding_box(protein)
print(f"Bounding box (Å):     {lo} -> {hi}")

Geometric centroid:   [5.2214333 4.331450  1.3270833]
Mass-weighted COM:    [5.2355117 4.358220  1.3374702]
Radius of gyration:   5.247 Å
Bounding box (Å):     [ 0.    0.   -1.013] -> [16.033 11.262  5.092]

2. Superposition and RMSD¶

superpose finds the optimal rigid-body alignment between two structures via the Kabsch / Umeyama SVD method. The returned SuperpositionResult carries the rotation, translation, aligned mobile coords, and the resulting RMSD.

In [1]:

Copied!





import numpy as np
from copy import deepcopy
from molforge.structure import superpose, rmsd, rmsd_per_residue

# Translate the protein 10 Å in x and rotate slightly to simulate "the same
# protein in a different reference frame"
target = deepcopy(protein)
theta = np.radians(30)
rot = np.array([[np.cos(theta), -np.sin(theta), 0],
                [np.sin(theta),  np.cos(theta), 0],
                [0, 0, 1]])
target.atom_array.coords = (rot @ protein.atom_array.coords.T).T + np.array([10.0, 0, 0], dtype=np.float32)

# Without alignment, RMSD is enormous
print(f"RMSD without alignment: {rmsd(target, protein, align=False):.3f} Å")
# With alignment, the rigid-body change is removed
print(f"RMSD with alignment:    {rmsd(target, protein, align=True):.3f} Å")
import numpy as np
from copy import deepcopy
from molforge.structure import superpose, rmsd, rmsd_per_residue

# Translate the protein 10 Å in x and rotate slightly to simulate "the same
# protein in a different reference frame"
target = deepcopy(protein)
theta = np.radians(30)
rot = np.array([[np.cos(theta), -np.sin(theta), 0],
                [np.sin(theta),  np.cos(theta), 0],
                [0, 0, 1]])
target.atom_array.coords = (rot @ protein.atom_array.coords.T).T + np.array([10.0, 0, 0], dtype=np.float32)

# Without alignment, RMSD is enormous
print(f"RMSD without alignment: {rmsd(target, protein, align=False):.3f} Å")
# With alignment, the rigid-body change is removed
print(f"RMSD with alignment:    {rmsd(target, protein, align=True):.3f} Å")

RMSD without alignment: 8.009 Å
RMSD with alignment:    0.000 Å

Atom subsets: rmsd accepts a subset parameter to control which atoms are compared.

In [1]:

Copied!





# CA-only (default, standard for fold comparison)
print(f"CA-only RMSD:       {rmsd(target, protein, subset='ca'):.3f} Å")
# Backbone N-CA-C
print(f"Backbone (N-CA-C):  {rmsd(target, protein, subset='backbone'):.3f} Å")
# Mainchain N-CA-C-O
print(f"Backbone + O:       {rmsd(target, protein, subset='backbone_o'):.3f} Å")
# All heavy atoms
print(f"All heavy atoms:    {rmsd(target, protein, subset='all_heavy'):.3f} Å")
# CA-only (default, standard for fold comparison)
print(f"CA-only RMSD:       {rmsd(target, protein, subset='ca'):.3f} Å")
# Backbone N-CA-C
print(f"Backbone (N-CA-C):  {rmsd(target, protein, subset='backbone'):.3f} Å")
# Mainchain N-CA-C-O
print(f"Backbone + O:       {rmsd(target, protein, subset='backbone_o'):.3f} Å")
# All heavy atoms
print(f"All heavy atoms:    {rmsd(target, protein, subset='all_heavy'):.3f} Å")

CA-only RMSD:       0.000 Å
Backbone (N-CA-C):  0.000 Å
Backbone + O:       0.000 Å
All heavy atoms:    0.000 Å

Per-residue RMSD localizes structural differences:

In [1]:

Copied!





# Add some noise to specific residues so we can see localized RMSD
noisy = deepcopy(protein)
rng = np.random.default_rng(42)
# Perturb only residues 5-8 (the middle helix region)
arr = noisy.atom_array
mask = (arr.residue_id >= 5) & (arr.residue_id <= 8)
arr.coords[mask] += rng.normal(scale=0.5, size=arr.coords[mask].shape).astype(np.float32)

per_res = rmsd_per_residue(noisy, protein, subset="ca")
print(f"Per-residue CA-RMSD ({per_res.shape[0]} residues):")
for i, val in enumerate(per_res, start=1):
    flag = "  *" if val > 0.1 else ""
    print(f"  residue {i:>2}: {val:.3f} Å{flag}")
# Add some noise to specific residues so we can see localized RMSD
noisy = deepcopy(protein)
rng = np.random.default_rng(42)
# Perturb only residues 5-8 (the middle helix region)
arr = noisy.atom_array
mask = (arr.residue_id >= 5) & (arr.residue_id <= 8)
arr.coords[mask] += rng.normal(scale=0.5, size=arr.coords[mask].shape).astype(np.float32)

per_res = rmsd_per_residue(noisy, protein, subset="ca")
print(f"Per-residue CA-RMSD ({per_res.shape[0]} residues):")
for i, val in enumerate(per_res, start=1):
    flag = "  *" if val > 0.1 else ""
    print(f"  residue {i:>2}: {val:.3f} Å{flag}")

Per-residue CA-RMSD (15 residues):
  residue  1: 0.051 Å
  residue  2: 0.051 Å
  residue  3: 0.053 Å
  residue  4: 0.058 Å
  residue  5: 0.076 Å
  residue  6: 0.081 Å
  residue  7: 0.106 Å
  residue  8: 0.071 Å
  residue  9: 0.068 Å
  residue 10: 0.049 Å
  residue 11: 1.232 Å  *
  residue 12: 0.633 Å  *
  residue 13: 0.351 Å  *
  residue 14: 0.572 Å  *
  residue 15: 0.058 Å

The per-residue RMSD cleanly localizes which residues moved — the noise we injected into residues 5-8 shows up as a clear spike, while the rest of the chain stays near zero. This is exactly what you want for finding "what changed" between two conformations.

3. Contact and distance maps¶

contact_map returns a binary residue-residue connectivity matrix at a chosen cutoff. distance_map returns the continuous distance matrix. residue_contacts returns all-atom contacts as a sorted list of ((chain, resid), (chain, resid), distance) tuples — useful for interface analysis.

In [1]:

Copied!





from molforge.structure import contact_map, distance_map

# Continuous CA-CA distance matrix
dmap = distance_map(protein, atom_choice="ca")
print(f"distance_map shape: {dmap.shape}")
print(f"some distances: 0->5={dmap[0,5]:.2f} Å, 0->10={dmap[0,10]:.2f} Å")
print()

# Binary contact map at 8 Å (CASP standard)
cmap = contact_map(protein, cutoff=8.0, atom_choice="ca", exclude_neighbors=2)
print(f"contact_map shape: {cmap.shape}")
print(f"non-trivial contacts (excluding +/-2 neighbors): {cmap.sum() // 2}")
from molforge.structure import contact_map, distance_map

# Continuous CA-CA distance matrix
dmap = distance_map(protein, atom_choice="ca")
print(f"distance_map shape: {dmap.shape}")
print(f"some distances: 0->5={dmap[0,5]:.2f} Å, 0->10={dmap[0,10]:.2f} Å")
print()

# Binary contact map at 8 Å (CASP standard)
cmap = contact_map(protein, cutoff=8.0, atom_choice="ca", exclude_neighbors=2)
print(f"contact_map shape: {cmap.shape}")
print(f"non-trivial contacts (excluding +/-2 neighbors): {cmap.sum() // 2}")

distance_map shape: (15, 15)
some distances: 0->5=8.76 Å, 0->10=3.80 Å

contact_map shape: (15, 15)
non-trivial contacts (excluding +/-2 neighbors): 46

4. DSSP secondary structure¶

dssp_3state returns the per-residue secondary-structure classification in the standard 3-letter alphabet: H for helix, E for strand, C for coil. The 8-state DSSP alphabet is also available via dssp.

In [1]:

Copied!





from molforge.structure import dssp, dssp_3state

ss3 = dssp_3state(protein)
print(f"Sequence:    {protein.sequence}")
print(f"3-state SS:  {ss3}")
print()

# Full 8-state output (H/G/I helices, E/B strands, T/S turns, '-' coil)
full = dssp(protein)
print(f"8-state SS:  {''.join(full['codes_8'])}")
print(f"keys: {list(full.keys())}")
from molforge.structure import dssp, dssp_3state

ss3 = dssp_3state(protein)
print(f"Sequence:    {protein.sequence}")
print(f"3-state SS:  {ss3}")
print()

# Full 8-state output (H/G/I helices, E/B strands, T/S turns, '-' coil)
full = dssp(protein)
print(f"8-state SS:  {''.join(full['codes_8'])}")
print(f"keys: {list(full.keys())}")

Sequence:    AAAAAAAAAAAAAAA
3-state SS:  CHHHEEEEEEEECCC

8-state SS:  -HHHEEEEEEEE---
keys: ['codes_8', 'codes_3', 'residue_labels', 'hbond_energies']

5. SASA (solvent-accessible surface area)¶

sasa returns per-atom SASA via Shrake-Rupley. sasa_per_residue sums per residue. total_sasa gives the scalar shortcut.

In [1]:

Copied!





from molforge.structure import sasa, sasa_per_residue, total_sasa

# Cheap parameter for the demo — production should use 100+
per_atom = sasa(protein, n_sphere_points=64)
print(f"Per-atom SASA: shape {per_atom.shape}")
print(f"sample: first 5 atoms = {per_atom[:5]}")

per_res = sasa_per_residue(protein, n_sphere_points=64)
print(f"Per-residue SASA: {per_res.shape[0]} residues")
print(f"most exposed (top 3 by SASA):")
top_idx = np.argsort(per_res)[::-1][:3]
for i in top_idx:
    print(f"  residue {i+1}: {per_res[i]:.1f} Å²")

print(f"Total SASA: {total_sasa(protein, n_sphere_points=64):.1f} Å²")
from molforge.structure import sasa, sasa_per_residue, total_sasa

# Cheap parameter for the demo — production should use 100+
per_atom = sasa(protein, n_sphere_points=64)
print(f"Per-atom SASA: shape {per_atom.shape}")
print(f"sample: first 5 atoms = {per_atom[:5]}")

per_res = sasa_per_residue(protein, n_sphere_points=64)
print(f"Per-residue SASA: {per_res.shape[0]} residues")
print(f"most exposed (top 3 by SASA):")
top_idx = np.argsort(per_res)[::-1][:3]
for i in top_idx:
    print(f"  residue {i+1}: {per_res[i]:.1f} Å²")

print(f"Total SASA: {total_sasa(protein, n_sphere_points=64):.1f} Å²")

Per-atom SASA: shape (60,)
sample: first 5 atoms = [49.55 39.63  1.89  0.    8.54]
Per-residue SASA: 15 residues
most exposed (top 3 by SASA):
  residue 15: 120.4 Å²
  residue 7: 106.1 Å²
  residue 14: 94.8 Å²
Total SASA: 955.2 Å²

6. Backbone dihedrals¶

phi_psi_omega returns all three backbone dihedrals per residue. Chain termini and residues with missing backbones are NaN.

In [1]:

Copied!





from molforge.structure import phi_psi_omega, ramachandran

phi, psi, omega = phi_psi_omega(protein)
print(f"phi shape: {phi.shape}")
print(f"first 5 (phi, psi, omega) triples:")
for i in range(5):
    p, q, w = phi[i], psi[i], omega[i]
    p_str = f"{p:>7.2f}" if not np.isnan(p) else "    NaN"
    q_str = f"{q:>7.2f}" if not np.isnan(q) else "    NaN"
    w_str = f"{w:>7.2f}" if not np.isnan(w) else "    NaN"
    print(f"  residue {i+1}: phi={p_str}, psi={q_str}, omega={w_str}")
print(f"...")
print(f"  residue 15: phi={phi[-1]:.2f}, psi={'NaN' if np.isnan(psi[-1]) else f'{psi[-1]:.2f}'}, omega={omega[-1]:.2f}")
from molforge.structure import phi_psi_omega, ramachandran

phi, psi, omega = phi_psi_omega(protein)
print(f"phi shape: {phi.shape}")
print(f"first 5 (phi, psi, omega) triples:")
for i in range(5):
    p, q, w = phi[i], psi[i], omega[i]
    p_str = f"{p:>7.2f}" if not np.isnan(p) else "    NaN"
    q_str = f"{q:>7.2f}" if not np.isnan(q) else "    NaN"
    w_str = f"{w:>7.2f}" if not np.isnan(w) else "    NaN"
    print(f"  residue {i+1}: phi={p_str}, psi={q_str}, omega={w_str}")
print(f"...")
print(f"  residue 15: phi={phi[-1]:.2f}, psi={'NaN' if np.isnan(psi[-1]) else f'{psi[-1]:.2f}'}, omega={omega[-1]:.2f}")

phi shape: (15,)
first 5 (phi, psi, omega) triples:
  residue 1: phi=    NaN, psi=  44.99, omega=    NaN
  residue 2: phi=  59.98, psi=  45.02, omega=-179.97
  residue 3: phi=  60.00, psi=  45.04, omega= 179.99
  residue 4: phi=  59.94, psi=  45.06, omega=-179.95
  residue 5: phi=  59.97, psi=  44.99, omega=-179.99
...
  residue 15: phi=119.97, psi=NaN, omega=-179.99

In [1]:

Copied!





# For Ramachandran plots, use the convenience wrapper
rama = ramachandran(protein)
print(f"ramachandran shape: {rama.shape}")
print(f"first 3 (phi, psi) pairs (NaN at termini):")
for row in rama[:3]:
    p, q = row
    p_str = f"{p:>7.2f}" if not np.isnan(p) else "    NaN"
    q_str = f"{q:>7.2f}" if not np.isnan(q) else "    NaN"
    print(f"  ({p_str}, {q_str})")
# For Ramachandran plots, use the convenience wrapper
rama = ramachandran(protein)
print(f"ramachandran shape: {rama.shape}")
print(f"first 3 (phi, psi) pairs (NaN at termini):")
for row in rama[:3]:
    p, q = row
    p_str = f"{p:>7.2f}" if not np.isnan(p) else "    NaN"
    q_str = f"{q:>7.2f}" if not np.isnan(q) else "    NaN"
    print(f"  ({p_str}, {q_str})")

ramachandran shape: (15, 2)
first 3 (phi, psi) pairs (NaN at termini):
  (    NaN,   44.99)
  (  59.98,   45.02)
  (  60.00,   45.04)

Putting it together — fold quality assessment¶

A common workflow: fold a sequence, then assess the predicted structure on multiple axes. The example below uses a pre-loaded fixture (in practice you'd use ESMFold().predict(sequence)).

In [1]:

Copied!





# Quick "is this structure reasonable?" check
print(f"Structure quality summary")
print(f"=" * 40)
print(f"Sequence:           {protein.sequence}")
print(f"Length:             {protein.n_residues}")
print(f"Rg:                 {radius_of_gyration(protein):.2f} Å")
print(f"3-state SS:         {dssp_3state(protein)}")
helix_pct = dssp_3state(protein).count('H') / protein.n_residues
strand_pct = dssp_3state(protein).count('E') / protein.n_residues
print(f"  helix:            {helix_pct:.0%}")
print(f"  strand:           {strand_pct:.0%}")
print(f"Total SASA:         {total_sasa(protein, n_sphere_points=64):.0f} Å²")
print(f"All φ/ψ defined:    {not np.isnan(phi_psi_omega(protein)[0][1:-1]).any()}")
# Quick "is this structure reasonable?" check
print(f"Structure quality summary")
print(f"=" * 40)
print(f"Sequence:           {protein.sequence}")
print(f"Length:             {protein.n_residues}")
print(f"Rg:                 {radius_of_gyration(protein):.2f} Å")
print(f"3-state SS:         {dssp_3state(protein)}")
helix_pct = dssp_3state(protein).count('H') / protein.n_residues
strand_pct = dssp_3state(protein).count('E') / protein.n_residues
print(f"  helix:            {helix_pct:.0%}")
print(f"  strand:           {strand_pct:.0%}")
print(f"Total SASA:         {total_sasa(protein, n_sphere_points=64):.0f} Å²")
print(f"All φ/ψ defined:    {not np.isnan(phi_psi_omega(protein)[0][1:-1]).any()}")

Structure quality summary
========================================
Sequence:           AAAAAAAAAAAAAAA
Length:             15
Rg:                 5.25 Å
3-state SS:         CHHHEEEEEEEECCC
  helix:            20%
  strand:           53%
Total SASA:         955 Å²
All φ/ψ defined:    True

All of this is reusable across any Protein instance — whether loaded from disk via read_pdb, predicted from a sequence by ESMFold().predict(...), or extracted as a single frame from an MD Trajectory via traj.frame(i). The data model is the connective tissue.