Skip to content

molforge.ml

ml

ML utilities: featurizers, embeddings, and graph representations.

Turn a :class:molforge.core.Protein into ML-ready tensors. Three layers:

Sequence featurizers (no structure needed): - :func:one_hot — 21-dim one-hot encoding. - :func:blosum_embed — BLOSUM62 / PAM250 rows as embeddings. - :func:positional_encoding — sinusoidal (Vaswani-style) positional. - :func:compose_features — concatenate featurizers.

Structure featurizers (need 3D coordinates): - :func:pair_distances — float32 distance map. - :func:pair_distance_features — Gaussian-RBF binned distances (the standard featurization for distance-based GNNs). - :func:pair_orientations — backbone orientation features (CA-CA vectors, distances, and local-frame cosines). - :func:local_environment — per-residue atomic environment counts by element. - :func:per_residue_features — combined node-feature vector (one-hot + environment + DSSP).

Protein language model embeddings (heavy, lazy): - :class:ESM2Embedder — wraps ESM-2 via HuggingFace transformers.

Graph construction: - :func:to_graph — build a :class:ProteinGraph ready for PyTorch Geometric / DGL.

The featurizers return plain NumPy float32 arrays. Convert to your preferred tensor library downstream (torch.from_numpy(arr) etc.).

EmbeddingNotInstalledError

Bases: ImportError

Raised when the heavy embedding dependencies aren't installed.

ESM2Embedder

ESM2Embedder(
    *,
    model_name: str = "facebook/esm2_t33_650M_UR50D",
    device: str | None = None,
    layer: int = -1,
    dtype: str = "float32",
)

Per-residue and per-sequence embeddings via ESM-2.

Parameters:

Name Type Description Default
model_name str

HuggingFace model identifier (default "facebook/esm2_t33_650M_UR50D").

'facebook/esm2_t33_650M_UR50D'
device str | None

"cuda", "cpu", "mps", or None for auto-detect.

None
layer int

which transformer layer to extract embeddings from. Defaults to the model's last layer. Final-layer embeddings are most discriminative for most downstream tasks; mid-layer embeddings often work better for structure prediction.

-1
dtype str

"float32" (default) or "float16" for faster GPU inference.

'float32'
Example

from molforge.ml import ESM2Embedder embedder = ESM2Embedder(device="cuda") emb = embedder.embed("MKTVRQERLKSIVRILERSK") emb.shape (20, 1280)

embed

embed(sequence: str) -> NDArray[np.float32]

Per-residue embeddings for a single sequence.

Parameters:

Name Type Description Default
sequence str

one-letter amino-acid sequence.

required

Returns:

Type Description
NDArray[float32]

(L, D) float32 array, where L is sequence length

NDArray[float32]

and D is the model's embedding dimensionality

NDArray[float32]

(e.g. 1280 for the 650M model). The CLS / EOS tokens

NDArray[float32]

are stripped before returning so the leading axis aligns

NDArray[float32]

with residue index.

embed_many

embed_many(
    sequences: Sequence[str],
) -> list[NDArray[np.float32]]

Per-residue embeddings for a batch of sequences.

Sequences of different lengths can't be stacked into a single tensor without padding, so we return a list of (L_i, D) arrays. For length-padded batched inference, drop down to the underlying self._model directly via self._tokenizer with padding=True.

embed_pooled

embed_pooled(
    sequence: str, *, pooling: str = "mean"
) -> NDArray[np.float32]

Per-sequence embedding (a single fixed-size vector per protein).

Parameters:

Name Type Description Default
sequence str

one-letter amino-acid sequence.

required
pooling str

"mean" (default), "max", or "cls". "cls" returns the CLS token's embedding without averaging.

'mean'

Returns:

Type Description
NDArray[float32]

(D,) float32 array.

ProteinGraph dataclass

ProteinGraph(
    node_features: NDArray[float32],
    edge_index: NDArray[int64],
    edge_features: NDArray[float32],
    residue_labels: list[tuple[str, int]],
)

A residue-graph representation of a protein structure.

Attributes:

Name Type Description
node_features NDArray[float32]

(n_res, F_node) float32 array.

edge_index NDArray[int64]

(2, n_edges) int64 array of (source, target) residue index pairs. Follows the PyTorch Geometric convention: edge_index[0] is the source row, edge_index[1]` is the target.

edge_features NDArray[float32]

(n_edges, F_edge) float32 array.

residue_labels list[tuple[str, int]]

[(chain_id, residue_id)] for each node.

to_graph

to_graph(
    protein: Protein,
    *,
    cutoff: float = 10.0,
    self_loops: bool = False,
    include_dssp: bool = True,
    include_environment: bool = True,
    edge_distance_bins: int = 16,
) -> ProteinGraph

Build a residue-graph representation of protein.

Nodes: each protein residue with a CA atom. Edges: undirected (both directions present) connections between residues whose CA atoms are within cutoff Å of each other.

Parameters:

Name Type Description Default
protein Protein

input structure.

required
cutoff float

edge cutoff distance in Å. 10 Å is the typical default for protein GNNs (captures both direct and ~1-residue-away contacts).

10.0
self_loops bool

include i->i edges. Some GNN architectures want these; most don't.

False
include_dssp bool

passed to :func:per_residue_features.

True
include_environment bool

passed to :func:per_residue_features.

True
edge_distance_bins int

number of RBF basis functions for the edge-distance feature; pass 0 to use raw distance.

16

Returns:

Name Type Description
A ProteinGraph

class:ProteinGraph.

Example

from molforge.ml import to_graph g = to_graph(my_protein, cutoff=8.0) g.node_features.shape (76, 29) g.edge_index.shape (2, 421) g.edge_features.shape (421, 16)

blosum_embed

blosum_embed(
    sequence: str, *, matrix: str = "BLOSUM62"
) -> NDArray[np.float32]

Use rows of a substitution matrix as per-residue embeddings.

This is a classic trick from the pre-deep-learning era: every residue gets a 20-dim embedding that captures its substitution profile against the standard amino acids. Surprisingly strong as a baseline for many downstream tasks.

Parameters:

Name Type Description Default
sequence str

one-letter amino-acid sequence.

required
matrix str

substitution matrix name ("BLOSUM62" or "PAM250").

'BLOSUM62'

Returns:

Type Description
NDArray[float32]

(L, 20) float32 array. Non-standard residues get the

NDArray[float32]

"X" row (which exists in both BLOSUM62 and PAM250).

compose_features

compose_features(
    *features: NDArray[float32],
) -> NDArray[np.float32]

Concatenate multiple per-residue feature arrays along the last axis.

All inputs must have the same leading dimension (sequence length).

Parameters:

Name Type Description Default
*features NDArray[float32]

any number of (L, D_i) arrays.

()

Returns:

Type Description
NDArray[float32]

(L, sum(D_i)) float32 array.

Raises:

Type Description
ValueError

if shapes are incompatible.

one_hot

one_hot(
    sequence: str, *, include_unk: bool = True
) -> NDArray[np.float32]

One-hot encode a protein sequence.

Parameters:

Name Type Description Default
sequence str

one-letter amino-acid sequence.

required
include_unk bool

if True (default), use a 21-dimensional encoding with column 20 set to 1 for any non-standard residue. If False, use 20 dims and silently drop the bit for non-standard residues (their rows will be all zero).

True

Returns:

Type Description
NDArray[float32]

(L, 21) or (L, 20) float32 array.

positional_encoding

positional_encoding(
    length: int, dim: int = 64, *, base: float = 10000.0
) -> NDArray[np.float32]

Sinusoidal absolute positional encoding (Vaswani et al. 2017).

Identical formulation to the original Transformer paper: PE[pos, 2i] = sin(pos / base^(2i/dim)) PE[pos, 2i+1] = cos(pos / base^(2i/dim))

Parameters:

Name Type Description Default
length int

sequence length.

required
dim int

embedding dimensionality (must be even).

64
base float

wavelength base. The Vaswani default is 10000.

10000.0

Returns:

Type Description
NDArray[float32]

(length, dim) float32 array.

local_environment

local_environment(
    protein: Protein, *, radius: float = 10.0
) -> NDArray[np.float32]

Per-residue local atomic environment counts.

For each protein residue, count the atoms of each chemical element within radius Å of that residue's CA. This is a simple but effective featurization that captures local packing.

Parameters:

Name Type Description Default
protein Protein

input structure.

required
radius float

cutoff radius in Å (default 10).

10.0

Returns:

Type Description
NDArray[float32]

(n_res, 5) float32 array. Columns are counts of C, N, O,

NDArray[float32]

S, and "other" elements respectively.

pair_distance_features

pair_distance_features(
    protein: Protein,
    *,
    n_bins: int = 16,
    d_min: float = 2.0,
    d_max: float = 22.0,
    atom_choice: Literal["ca", "cb", "heavy", "all"] = "ca",
) -> NDArray[np.float32]

Gaussian-radial-basis-function (RBF) encoding of pair distances.

This is the standard featurization used in modern protein GNNs (Equivariant GNNs, GearNet, etc.). Each pair distance is expanded into n_bins Gaussian basis functions evenly spaced between d_min and d_max, with sigma chosen so the bases overlap.

Parameters:

Name Type Description Default
protein Protein

input structure.

required
n_bins int

number of RBF basis functions. 16 is the typical default.

16
d_min float

lower end of the distance range covered by the basis (Å).

2.0
d_max float

upper end of the distance range covered by the basis (Å).

22.0
atom_choice Literal['ca', 'cb', 'heavy', 'all']

anchor atom per residue.

'ca'

Returns:

Type Description
NDArray[float32]

(n_res, n_res, n_bins) float32 array. The [i, j, k] entry

NDArray[float32]

is exp(-(d_ij - centers[k])^2 / (2 sigma^2)).

pair_distances

pair_distances(
    protein: Protein,
    *,
    atom_choice: Literal["ca", "cb", "heavy", "all"] = "ca",
) -> NDArray[np.float32]

Compute the residue-residue distance matrix.

Convenience wrapper over :func:molforge.structure.distance_map that returns float32 for downstream tensor conversion.

Parameters:

Name Type Description Default
protein Protein

input structure.

required
atom_choice Literal['ca', 'cb', 'heavy', 'all']

which atom defines residue position ("ca", "cb", "heavy", "all").

'ca'

Returns:

Type Description
NDArray[float32]

(n_res, n_res) float32 array of distances in Å.

pair_orientations

pair_orientations(
    protein: Protein,
) -> dict[str, NDArray[np.float32]]

Backbone orientation features between every pair of residues.

Computes for each residue pair (i, j): - direction: unit vector from CA(i) to CA(j) in i's local frame - distance: ||CA(j) - CA(i)|| - cosine: cosine of the angle between the CA(i)-CA(j) vector and residue i's local "forward" direction (CA(i+1) - CA(i-1)). Captures local orientation context.

These are useful as edge features in equivariant-style protein GNNs.

Returns:

Type Description
dict[str, NDArray[float32]]

Dict with keys "direction" ((n, n, 3)), "distance"

dict[str, NDArray[float32]]

((n, n)), and "cosine" ((n, n)).

per_residue_features

per_residue_features(
    protein: Protein,
    *,
    include_sequence: bool = True,
    include_environment: bool = True,
    include_dssp: bool = True,
) -> NDArray[np.float32]

Combined per-residue feature vectors suitable as GNN node features.

Stacks (along the feature dimension): - One-hot residue identity (21 dims, if include_sequence) - Local environment element counts (5 dims, if include_environment) - DSSP 3-state one-hot (3 dims, if include_dssp)

Parameters:

Name Type Description Default
protein Protein

input structure.

required
include_sequence bool

include the one-hot residue identity block (21 dims).

True
include_environment bool

include the local-environment block (5 dims).

True
include_dssp bool

include the DSSP 3-state one-hot block (3 dims).

True

Returns:

Type Description
NDArray[float32]

(n_res, D) float32 array, where D depends on which blocks

NDArray[float32]

are included.