Walkthrough 5 — ML featurization¶

molforge ships an ml subpackage that converts a Protein (or just a sequence) into the kinds of tensors that ML models expect. The sequence featurizers work on strings; the structure featurizers operate on 3-D coordinates; the embeddings wrapper hits a real protein language model.

This walkthrough covers the four layers in order:

Sequence featurizers — one-hot, BLOSUM, positional encoding.
Structure featurizers — distance maps, RBF-binned distances, pair orientations, local environment, combined node features.
Graph construction — turn a Protein into a graph suitable for PyTorch Geometric / DGL.
Pre-trained embeddings — ESM-2 via the ESM2Embedder wrapper.

Everything returns float32 NumPy arrays. Convert to your preferred tensor library with torch.from_numpy(arr) or equivalent.

In [1]:

Copied!

import molforge as mf
import numpy as np
print(f"molforge {mf.__version__}")
import molforge as mf
import numpy as np
print(f"molforge {mf.__version__}")

molforge 0.0.1

1. Sequence featurizers¶

These work on a plain string. The shapes are always (L, D) where L is sequence length and D is feature dim.

In [1]:

Copied!





from molforge.ml import one_hot, blosum_embed, positional_encoding, compose_features

seq = "MKTVRQERLKSIVRILERSK"  # 20 residues
print(f"Sequence: {seq}  (length {len(seq)})")
print()

# One-hot: 21 dims (20 standard AAs + unknown bucket)
oh = one_hot(seq)
print(f"one_hot:             {oh.shape}, dtype={oh.dtype}")

# BLOSUM62 row per residue: 20 dims, captures substitution profile
blosum = blosum_embed(seq, matrix="BLOSUM62")
print(f"blosum_embed:        {blosum.shape}, dtype={blosum.dtype}")

# Sinusoidal positional encoding (Vaswani et al.)
pe = positional_encoding(len(seq), dim=32)
print(f"positional_encoding: {pe.shape}, dtype={pe.dtype}")
from molforge.ml import one_hot, blosum_embed, positional_encoding, compose_features

seq = "MKTVRQERLKSIVRILERSK"  # 20 residues
print(f"Sequence: {seq}  (length {len(seq)})")
print()

# One-hot: 21 dims (20 standard AAs + unknown bucket)
oh = one_hot(seq)
print(f"one_hot:             {oh.shape}, dtype={oh.dtype}")

# BLOSUM62 row per residue: 20 dims, captures substitution profile
blosum = blosum_embed(seq, matrix="BLOSUM62")
print(f"blosum_embed:        {blosum.shape}, dtype={blosum.dtype}")

# Sinusoidal positional encoding (Vaswani et al.)
pe = positional_encoding(len(seq), dim=32)
print(f"positional_encoding: {pe.shape}, dtype={pe.dtype}")

Sequence: MKTVRQERLKSIVRILERSK  (length 20)

one_hot:             (20, 21), dtype=float32
blosum_embed:        (20, 20), dtype=float32
positional_encoding: (20, 32), dtype=float32

In [1]:

Copied!





# Concatenate them into a single per-residue feature tensor
combined = compose_features(oh, blosum, pe)
print(f"combined: {combined.shape}")
# 21 + 20 + 32 = 73 dims per residue
print(f"per-residue feature dim: {combined.shape[-1]}")
# Concatenate them into a single per-residue feature tensor
combined = compose_features(oh, blosum, pe)
print(f"combined: {combined.shape}")
# 21 + 20 + 32 = 73 dims per residue
print(f"per-residue feature dim: {combined.shape[-1]}")

combined: (20, 73)
per-residue feature dim: 73

2. Structure featurizers¶

These need a Protein with real 3-D coordinates. We'll use the bundled helix fixture for demonstration; substitute your real ESMFold output (engine.predict(seq)) for genuine use.

In [1]:

Copied!





from pathlib import Path
from molforge.io import read_pdb

protein = read_pdb(Path("../../tests/fixtures/pdb/helix.pdb"))
print(protein)
print(f"residues: {protein.n_residues}")
from pathlib import Path
from molforge.io import read_pdb

protein = read_pdb(Path("../../tests/fixtures/pdb/helix.pdb"))
print(protein)
print(f"residues: {protein.n_residues}")

Protein(name='helix', n_chains=1, n_residues=15, n_atoms=60)
residues: 15

In [1]:

Copied!





from molforge.ml import (
    pair_distances,
    pair_distance_features,
    pair_orientations,
    local_environment,
)

# Raw CA-CA distance map
d = pair_distances(protein, atom_choice="ca")
print(f"pair_distances:        {d.shape}")
print(f"sample distances [0,1:4]: {d[0, 1:4]}")
print()

# Gaussian-RBF binned distances (the standard input for distance-based GNNs)
rbf = pair_distance_features(protein, n_bins=16, d_min=2.0, d_max=22.0)
print(f"pair_distance_features: {rbf.shape}")
print(f"sample RBF [0, 1] (first 4 bins): {rbf[0, 1, :4]}")
print()

# Pair orientations: CA-CA unit vectors, distances, and local-frame cosines
orient = pair_orientations(protein)
print(f"orientations keys: {list(orient.keys())}")
print(f"direction shape:   {orient['direction'].shape}")
print(f"cosine sample (residue 0 vs all):")
print(f"  {orient['cosine'][0, :5]}")
from molforge.ml import (
    pair_distances,
    pair_distance_features,
    pair_orientations,
    local_environment,
)

# Raw CA-CA distance map
d = pair_distances(protein, atom_choice="ca")
print(f"pair_distances:        {d.shape}")
print(f"sample distances [0,1:4]: {d[0, 1:4]}")
print()

# Gaussian-RBF binned distances (the standard input for distance-based GNNs)
rbf = pair_distance_features(protein, n_bins=16, d_min=2.0, d_max=22.0)
print(f"pair_distance_features: {rbf.shape}")
print(f"sample RBF [0, 1] (first 4 bins): {rbf[0, 1, :4]}")
print()

# Pair orientations: CA-CA unit vectors, distances, and local-frame cosines
orient = pair_orientations(protein)
print(f"orientations keys: {list(orient.keys())}")
print(f"direction shape:   {orient['direction'].shape}")
print(f"cosine sample (residue 0 vs all):")
print(f"  {orient['cosine'][0, :5]}")

pair_distances:        (15, 15)
sample distances [0,1:4]: [3.8039281 5.464399  5.1869736]

pair_distance_features: (15, 15, 16)
sample RBF [0, 1] (first 4 bins): [0.40042388 0.93961453 0.81112    0.25758818]

orientations keys: ['direction', 'distance', 'cosine']
direction shape:   (15, 15, 3)
cosine sample (residue 0 vs all):
  [0.         0.99999994 0.71829283 0.29170692 0.568028  ]

In [1]:

Copied!





# Local environment: counts of atoms by element within a radius
env = local_environment(protein, radius=10.0)
print(f"local_environment: {env.shape}")
print(f"columns: C, N, O, S, other")
print(f"first 3 residues:")
print(env[:3])
# Local environment: counts of atoms by element within a radius
env = local_environment(protein, radius=10.0)
print(f"local_environment: {env.shape}")
print(f"columns: C, N, O, S, other")
print(f"first 3 residues:")
print(env[:3])

local_environment: (15, 5)
columns: C, N, O, S, other
first 3 residues:
[[12.  7.  5.  0.  0.]
 [14.  8.  6.  0.  0.]
 [16.  9.  7.  0.  0.]]

Combined per-residue node features¶

per_residue_features packages one-hot identity + local environment + DSSP into a single tensor — the standard input for protein GNNs.

In [1]:

Copied!





from molforge.ml import per_residue_features

nodes = per_residue_features(protein)
print(f"per_residue_features: {nodes.shape}")
# 21 (one-hot) + 5 (environment) + 3 (DSSP) = 29
print(f"breakdown: 21 one-hot + 5 environment + 3 DSSP = 29")
print(f"first residue features (first 25 dims):")
print(nodes[0, :25])
from molforge.ml import per_residue_features

nodes = per_residue_features(protein)
print(f"per_residue_features: {nodes.shape}")
# 21 (one-hot) + 5 (environment) + 3 (DSSP) = 29
print(f"breakdown: 21 one-hot + 5 environment + 3 DSSP = 29")
print(f"first residue features (first 25 dims):")
print(nodes[0, :25])

per_residue_features: (15, 29)
breakdown: 21 one-hot + 5 environment + 3 DSSP = 29
first residue features (first 25 dims):
[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0. 12.  7.  5.  0.]

3. Graph construction¶

to_graph builds a ProteinGraph with the standard (node_features, edge_index, edge_features) triple. Drop-in for PyTorch Geometric:

import torch_geometric as pyg
g = to_graph(protein)
pyg_data = pyg.data.Data(
    x=torch.from_numpy(g.node_features),
    edge_index=torch.from_numpy(g.edge_index),
    edge_attr=torch.from_numpy(g.edge_features),
)

For DGL or other frameworks the conversion is similarly trivial — see their respective docs for the exact incantation.

In [1]:

Copied!





from molforge.ml import to_graph

graph = to_graph(protein, cutoff=10.0, edge_distance_bins=16)
print(f"nodes:      {graph.n_nodes}")
print(f"edges:      {graph.n_edges}")
print(f"node_dim:   {graph.node_dim}")
print(f"edge_dim:   {graph.edge_dim}")
print()
print(f"edge_index shape: {graph.edge_index.shape}")
print(f"first 8 edges (src -> tgt):")
for s, t in zip(graph.edge_index[0, :8], graph.edge_index[1, :8]):
    print(f"  {int(s)} -> {int(t)}")
from molforge.ml import to_graph

graph = to_graph(protein, cutoff=10.0, edge_distance_bins=16)
print(f"nodes:      {graph.n_nodes}")
print(f"edges:      {graph.n_edges}")
print(f"node_dim:   {graph.node_dim}")
print(f"edge_dim:   {graph.edge_dim}")
print()
print(f"edge_index shape: {graph.edge_index.shape}")
print(f"first 8 edges (src -> tgt):")
for s, t in zip(graph.edge_index[0, :8], graph.edge_index[1, :8]):
    print(f"  {int(s)} -> {int(t)}")

nodes:      15
edges:      120
node_dim:   29
edge_dim:   16

edge_index shape: (2, 120)
first 8 edges (src -> tgt):
  0 -> 1
  0 -> 2
  0 -> 3
  0 -> 4
  0 -> 5
  1 -> 0
  1 -> 2
  1 -> 3

In [1]:

Copied!





# Edges are bidirectional (the (i, j) edge is paired with (j, i))
edges = set(
    (int(s), int(t))
    for s, t in zip(graph.edge_index[0], graph.edge_index[1])
)
all_paired = all((t, s) in edges for s, t in edges)
print(f"all edges have reverse pair: {all_paired}")

# Tightening the cutoff reduces edges
tight = to_graph(protein, cutoff=5.0)
loose = to_graph(protein, cutoff=15.0)
print(f"5  Å cutoff: {tight.n_edges} edges")
print(f"10 Å cutoff: {graph.n_edges} edges")
print(f"15 Å cutoff: {loose.n_edges} edges")
# Edges are bidirectional (the (i, j) edge is paired with (j, i))
edges = set(
    (int(s), int(t))
    for s, t in zip(graph.edge_index[0], graph.edge_index[1])
)
all_paired = all((t, s) in edges for s, t in edges)
print(f"all edges have reverse pair: {all_paired}")

# Tightening the cutoff reduces edges
tight = to_graph(protein, cutoff=5.0)
loose = to_graph(protein, cutoff=15.0)
print(f"5  Å cutoff: {tight.n_edges} edges")
print(f"10 Å cutoff: {graph.n_edges} edges")
print(f"15 Å cutoff: {loose.n_edges} edges")

all edges have reverse pair: True
5  Å cutoff: 28 edges
10 Å cutoff: 120 edges
15 Å cutoff: 180 edges

4. Protein language model embeddings (ESM-2)¶

ESM2Embedder wraps Meta's ESM-2 — the de-facto pre-trained protein language model. Train a head on its embeddings and you have a strong baseline for almost any protein-prediction task.

The heavy bits (torch, transformers, the model weights) are lazy- imported on first call. Construction is free; nothing is loaded until you actually call embed().

# 🐢 SLOW — needs `pip install 'molforge[ml]'` and ~4 GB GPU memory
# for the 650M-param default model
from molforge.ml import ESM2Embedder

embedder = ESM2Embedder(
    model_name="facebook/esm2_t33_650M_UR50D",   # 1280-dim, default
    device="cuda",
    layer=-1,        # final layer (most discriminative for downstream tasks)
)

# Per-residue: returns (L, D)
emb = embedder.embed("MKTVRQERLKSIVRILERSK")
# emb.shape  # (20, 1280)

# Per-sequence (pooled to a single vector):
pooled = embedder.embed_pooled("MKTVRQERLKSIVRILERSK", pooling="mean")
# pooled.shape  # (1280,)

Three model sizes available by tweaking model_name:

Model	Params	Embedding dim	Notes
`esm2_t6_8M_UR50D`	8M	320	Fast preview; CPU-friendly
`esm2_t30_150M_UR50D`	150M	640	Good speed/accuracy tradeoff
`esm2_t33_650M_UR50D`	650M	1280	Default; standard in most papers
`esm2_t36_3B_UR50D`	3B	2560	Highest single-GPU accuracy
`esm2_t48_15B_UR50D`	15B	5120	Research scale

Putting it all together¶

A common protein-ML pattern: combine sequence one-hot, structural features, and language-model embeddings into a single per-residue feature tensor that feeds your downstream model.

seq_feat   = compose_features(
    one_hot(seq),
    blosum_embed(seq),
    positional_encoding(len(seq), dim=64),
)                                              # (L, 105)
struct_feat = per_residue_features(protein)    # (L, 29)
lm_emb      = embedder.embed(seq)              # (L, 1280)

# Concatenate everything per-residue
combined = np.concatenate([seq_feat, struct_feat, lm_emb], axis=-1)
# combined.shape  # (L, 1414)

Or for a GNN setup, build the graph once and feed it to your model:

g = to_graph(protein, cutoff=10.0)
# g.node_features is your `x`
# g.edge_index is `edge_index`
# g.edge_features is `edge_attr`

The point of molforge.ml is to free you from re-implementing these featurizers for every project. The data model and engines are all the same; the only thing that changes is what you do with the features.