Walkthrough 5 — ML featurization¶
molforge ships an ml subpackage that converts a Protein (or just a
sequence) into the kinds of tensors that ML models expect. The
sequence featurizers work on strings; the structure featurizers
operate on 3-D coordinates; the embeddings wrapper hits a real
protein language model.
This walkthrough covers the four layers in order:
- Sequence featurizers — one-hot, BLOSUM, positional encoding.
- Structure featurizers — distance maps, RBF-binned distances, pair orientations, local environment, combined node features.
- Graph construction — turn a Protein into a graph suitable for PyTorch Geometric / DGL.
- Pre-trained embeddings — ESM-2 via the
ESM2Embedderwrapper.
Everything returns float32 NumPy arrays. Convert to your preferred
tensor library with torch.from_numpy(arr) or equivalent.
import molforge as mf
import numpy as np
print(f"molforge {mf.__version__}")
molforge 0.0.1
1. Sequence featurizers¶
These work on a plain string. The shapes are always (L, D) where
L is sequence length and D is feature dim.
from molforge.ml import one_hot, blosum_embed, positional_encoding, compose_features
seq = "MKTVRQERLKSIVRILERSK" # 20 residues
print(f"Sequence: {seq} (length {len(seq)})")
print()
# One-hot: 21 dims (20 standard AAs + unknown bucket)
oh = one_hot(seq)
print(f"one_hot: {oh.shape}, dtype={oh.dtype}")
# BLOSUM62 row per residue: 20 dims, captures substitution profile
blosum = blosum_embed(seq, matrix="BLOSUM62")
print(f"blosum_embed: {blosum.shape}, dtype={blosum.dtype}")
# Sinusoidal positional encoding (Vaswani et al.)
pe = positional_encoding(len(seq), dim=32)
print(f"positional_encoding: {pe.shape}, dtype={pe.dtype}")
Sequence: MKTVRQERLKSIVRILERSK (length 20) one_hot: (20, 21), dtype=float32 blosum_embed: (20, 20), dtype=float32 positional_encoding: (20, 32), dtype=float32
# Concatenate them into a single per-residue feature tensor
combined = compose_features(oh, blosum, pe)
print(f"combined: {combined.shape}")
# 21 + 20 + 32 = 73 dims per residue
print(f"per-residue feature dim: {combined.shape[-1]}")
combined: (20, 73) per-residue feature dim: 73
2. Structure featurizers¶
These need a Protein with real 3-D coordinates. We'll use the
bundled helix fixture for demonstration; substitute your real ESMFold
output (engine.predict(seq)) for genuine use.
from pathlib import Path
from molforge.io import read_pdb
protein = read_pdb(Path("../../tests/fixtures/pdb/helix.pdb"))
print(protein)
print(f"residues: {protein.n_residues}")
Protein(name='helix', n_chains=1, n_residues=15, n_atoms=60) residues: 15
from molforge.ml import (
pair_distances,
pair_distance_features,
pair_orientations,
local_environment,
)
# Raw CA-CA distance map
d = pair_distances(protein, atom_choice="ca")
print(f"pair_distances: {d.shape}")
print(f"sample distances [0,1:4]: {d[0, 1:4]}")
print()
# Gaussian-RBF binned distances (the standard input for distance-based GNNs)
rbf = pair_distance_features(protein, n_bins=16, d_min=2.0, d_max=22.0)
print(f"pair_distance_features: {rbf.shape}")
print(f"sample RBF [0, 1] (first 4 bins): {rbf[0, 1, :4]}")
print()
# Pair orientations: CA-CA unit vectors, distances, and local-frame cosines
orient = pair_orientations(protein)
print(f"orientations keys: {list(orient.keys())}")
print(f"direction shape: {orient['direction'].shape}")
print(f"cosine sample (residue 0 vs all):")
print(f" {orient['cosine'][0, :5]}")
pair_distances: (15, 15) sample distances [0,1:4]: [3.8039281 5.464399 5.1869736] pair_distance_features: (15, 15, 16) sample RBF [0, 1] (first 4 bins): [0.40042388 0.93961453 0.81112 0.25758818] orientations keys: ['direction', 'distance', 'cosine'] direction shape: (15, 15, 3) cosine sample (residue 0 vs all): [0. 0.99999994 0.71829283 0.29170692 0.568028 ]
# Local environment: counts of atoms by element within a radius
env = local_environment(protein, radius=10.0)
print(f"local_environment: {env.shape}")
print(f"columns: C, N, O, S, other")
print(f"first 3 residues:")
print(env[:3])
local_environment: (15, 5) columns: C, N, O, S, other first 3 residues: [[12. 7. 5. 0. 0.] [14. 8. 6. 0. 0.] [16. 9. 7. 0. 0.]]
Combined per-residue node features¶
per_residue_features packages one-hot identity + local environment + DSSP into a single tensor — the standard input for protein GNNs.
from molforge.ml import per_residue_features
nodes = per_residue_features(protein)
print(f"per_residue_features: {nodes.shape}")
# 21 (one-hot) + 5 (environment) + 3 (DSSP) = 29
print(f"breakdown: 21 one-hot + 5 environment + 3 DSSP = 29")
print(f"first residue features (first 25 dims):")
print(nodes[0, :25])
per_residue_features: (15, 29) breakdown: 21 one-hot + 5 environment + 3 DSSP = 29 first residue features (first 25 dims): [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 12. 7. 5. 0.]
3. Graph construction¶
to_graph builds a ProteinGraph with the standard (node_features, edge_index, edge_features) triple. Drop-in for PyTorch Geometric:
import torch_geometric as pyg
g = to_graph(protein)
pyg_data = pyg.data.Data(
x=torch.from_numpy(g.node_features),
edge_index=torch.from_numpy(g.edge_index),
edge_attr=torch.from_numpy(g.edge_features),
)
For DGL or other frameworks the conversion is similarly trivial — see their respective docs for the exact incantation.
from molforge.ml import to_graph
graph = to_graph(protein, cutoff=10.0, edge_distance_bins=16)
print(f"nodes: {graph.n_nodes}")
print(f"edges: {graph.n_edges}")
print(f"node_dim: {graph.node_dim}")
print(f"edge_dim: {graph.edge_dim}")
print()
print(f"edge_index shape: {graph.edge_index.shape}")
print(f"first 8 edges (src -> tgt):")
for s, t in zip(graph.edge_index[0, :8], graph.edge_index[1, :8]):
print(f" {int(s)} -> {int(t)}")
nodes: 15 edges: 120 node_dim: 29 edge_dim: 16 edge_index shape: (2, 120) first 8 edges (src -> tgt): 0 -> 1 0 -> 2 0 -> 3 0 -> 4 0 -> 5 1 -> 0 1 -> 2 1 -> 3
# Edges are bidirectional (the (i, j) edge is paired with (j, i))
edges = set(
(int(s), int(t))
for s, t in zip(graph.edge_index[0], graph.edge_index[1])
)
all_paired = all((t, s) in edges for s, t in edges)
print(f"all edges have reverse pair: {all_paired}")
# Tightening the cutoff reduces edges
tight = to_graph(protein, cutoff=5.0)
loose = to_graph(protein, cutoff=15.0)
print(f"5 Å cutoff: {tight.n_edges} edges")
print(f"10 Å cutoff: {graph.n_edges} edges")
print(f"15 Å cutoff: {loose.n_edges} edges")
all edges have reverse pair: True 5 Å cutoff: 28 edges 10 Å cutoff: 120 edges 15 Å cutoff: 180 edges
4. Protein language model embeddings (ESM-2)¶
ESM2Embedder wraps Meta's ESM-2 — the de-facto pre-trained protein
language model. Train a head on its embeddings and you have a strong
baseline for almost any protein-prediction task.
The heavy bits (torch, transformers, the model weights) are lazy-
imported on first call. Construction is free; nothing is loaded until
you actually call embed().
# 🐢 SLOW — needs `pip install 'molforge[ml]'` and ~4 GB GPU memory
# for the 650M-param default model
from molforge.ml import ESM2Embedder
embedder = ESM2Embedder(
model_name="facebook/esm2_t33_650M_UR50D", # 1280-dim, default
device="cuda",
layer=-1, # final layer (most discriminative for downstream tasks)
)
# Per-residue: returns (L, D)
emb = embedder.embed("MKTVRQERLKSIVRILERSK")
# emb.shape # (20, 1280)
# Per-sequence (pooled to a single vector):
pooled = embedder.embed_pooled("MKTVRQERLKSIVRILERSK", pooling="mean")
# pooled.shape # (1280,)
Three model sizes available by tweaking model_name:
| Model | Params | Embedding dim | Notes |
|---|---|---|---|
esm2_t6_8M_UR50D |
8M | 320 | Fast preview; CPU-friendly |
esm2_t30_150M_UR50D |
150M | 640 | Good speed/accuracy tradeoff |
esm2_t33_650M_UR50D |
650M | 1280 | Default; standard in most papers |
esm2_t36_3B_UR50D |
3B | 2560 | Highest single-GPU accuracy |
esm2_t48_15B_UR50D |
15B | 5120 | Research scale |
Putting it all together¶
A common protein-ML pattern: combine sequence one-hot, structural features, and language-model embeddings into a single per-residue feature tensor that feeds your downstream model.
seq_feat = compose_features(
one_hot(seq),
blosum_embed(seq),
positional_encoding(len(seq), dim=64),
) # (L, 105)
struct_feat = per_residue_features(protein) # (L, 29)
lm_emb = embedder.embed(seq) # (L, 1280)
# Concatenate everything per-residue
combined = np.concatenate([seq_feat, struct_feat, lm_emb], axis=-1)
# combined.shape # (L, 1414)
Or for a GNN setup, build the graph once and feed it to your model:
g = to_graph(protein, cutoff=10.0)
# g.node_features is your `x`
# g.edge_index is `edge_index`
# g.edge_features is `edge_attr`
The point of molforge.ml is to free you from re-implementing these
featurizers for every project. The data model and engines are all the
same; the only thing that changes is what you do with the features.