molforge.ml¶

ml ¶

ML utilities: featurizers, embeddings, and graph representations.

Turn a :class:molforge.core.Protein into ML-ready tensors. Three layers:

Sequence featurizers (no structure needed): - :func:one_hot — 21-dim one-hot encoding. - :func:blosum_embed — BLOSUM62 / PAM250 rows as embeddings. - :func:positional_encoding — sinusoidal (Vaswani-style) positional. - :func:compose_features — concatenate featurizers.

Structure featurizers (need 3D coordinates): - :func:pair_distances — float32 distance map. - :func:pair_distance_features — Gaussian-RBF binned distances (the standard featurization for distance-based GNNs). - :func:pair_orientations — backbone orientation features (CA-CA vectors, distances, and local-frame cosines). - :func:local_environment — per-residue atomic environment counts by element. - :func:per_residue_features — combined node-feature vector (one-hot + environment + DSSP).

Protein language model embeddings (heavy, lazy): - :class:ESM2Embedder — wraps ESM-2 via HuggingFace transformers.

Graph construction: - :func:to_graph — build a :class:ProteinGraph ready for PyTorch Geometric / DGL.

The featurizers return plain NumPy float32 arrays. Convert to your preferred tensor library downstream (torch.from_numpy(arr) etc.).

EmbeddingNotInstalledError ¶

Bases: ImportError

Raised when the heavy embedding dependencies aren't installed.

ESM2Embedder ¶

ESM2Embedder(
    *,
    model_name: str = "facebook/esm2_t33_650M_UR50D",
    device: str | None = None,
    layer: int = -1,
    dtype: str = "float32",
)

Per-residue and per-sequence embeddings via ESM-2.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model identifier (default `"facebook/esm2_t33_650M_UR50D"`).	`'facebook/esm2_t33_650M_UR50D'`
`device`	`str \| None`	`"cuda"`, `"cpu"`, `"mps"`, or `None` for auto-detect.	`None`
`layer`	`int`	which transformer layer to extract embeddings from. Defaults to the model's last layer. Final-layer embeddings are most discriminative for most downstream tasks; mid-layer embeddings often work better for structure prediction.	`-1`
`dtype`	`str`	`"float32"` (default) or `"float16"` for faster GPU inference.	`'float32'`

Example

from molforge.ml import ESM2Embedder embedder = ESM2Embedder(device="cuda") emb = embedder.embed("MKTVRQERLKSIVRILERSK") emb.shape (20, 1280)

embed ¶

embed(sequence: str) -> NDArray[np.float32]

Per-residue embeddings for a single sequence.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	one-letter amino-acid sequence.	required

Returns:

Type	Description
`NDArray[float32]`	`(L, D)` float32 array, where `L` is sequence length
`NDArray[float32]`	and `D` is the model's embedding dimensionality
`NDArray[float32]`	(e.g. 1280 for the 650M model). The CLS / EOS tokens
`NDArray[float32]`	are stripped before returning so the leading axis aligns
`NDArray[float32]`	with residue index.

embed_many ¶

embed_many(
    sequences: Sequence[str],
) -> list[NDArray[np.float32]]

Per-residue embeddings for a batch of sequences.

Sequences of different lengths can't be stacked into a single tensor without padding, so we return a list of (L_i, D) arrays. For length-padded batched inference, drop down to the underlying self._model directly via self._tokenizer with padding=True.

embed_pooled ¶

embed_pooled(
    sequence: str, *, pooling: str = "mean"
) -> NDArray[np.float32]

Per-sequence embedding (a single fixed-size vector per protein).

Parameters:

Name	Type	Description	Default
`sequence`	`str`	one-letter amino-acid sequence.	required
`pooling`	`str`	`"mean"` (default), `"max"`, or `"cls"`. `"cls"` returns the CLS token's embedding without averaging.	`'mean'`

Returns:

Type	Description
`NDArray[float32]`	`(D,)` float32 array.

ProteinGraph `dataclass` ¶

ProteinGraph(
    node_features: NDArray[float32],
    edge_index: NDArray[int64],
    edge_features: NDArray[float32],
    residue_labels: list[tuple[str, int]],
)

A residue-graph representation of a protein structure.

Attributes:

Name	Type	Description
`node_features`	`NDArray[float32]`	`(n_res, F_node)` float32 array.
`edge_index`	`NDArray[int64]`	`(2, n_edges)` int64 array of (source, target) residue index pairs. Follows the PyTorch Geometric convention: `edge_index[0]` is the source row, edge_index[1]` is the target.
`edge_features`	`NDArray[float32]`	`(n_edges, F_edge)` float32 array.
`residue_labels`	`list[tuple[str, int]]`	`[(chain_id, residue_id)]` for each node.

to_graph ¶

to_graph(
    protein: Protein,
    *,
    cutoff: float = 10.0,
    self_loops: bool = False,
    include_dssp: bool = True,
    include_environment: bool = True,
    edge_distance_bins: int = 16,
) -> ProteinGraph

Build a residue-graph representation of protein.

Nodes: each protein residue with a CA atom. Edges: undirected (both directions present) connections between residues whose CA atoms are within cutoff Å of each other.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	input structure.	required
`cutoff`	`float`	edge cutoff distance in Å. 10 Å is the typical default for protein GNNs (captures both direct and ~1-residue-away contacts).	`10.0`
`self_loops`	`bool`	include i->i edges. Some GNN architectures want these; most don't.	`False`
`include_dssp`	`bool`	passed to :func:`per_residue_features`.	`True`
`include_environment`	`bool`	passed to :func:`per_residue_features`.	`True`
`edge_distance_bins`	`int`	number of RBF basis functions for the edge-distance feature; pass 0 to use raw distance.	`16`

Returns:

Name	Type	Description
`A`	`ProteinGraph`	class:`ProteinGraph`.

Example

from molforge.ml import to_graph g = to_graph(my_protein, cutoff=8.0) g.node_features.shape (76, 29) g.edge_index.shape (2, 421) g.edge_features.shape (421, 16)

blosum_embed ¶

blosum_embed(
    sequence: str, *, matrix: str = "BLOSUM62"
) -> NDArray[np.float32]

Use rows of a substitution matrix as per-residue embeddings.

This is a classic trick from the pre-deep-learning era: every residue gets a 20-dim embedding that captures its substitution profile against the standard amino acids. Surprisingly strong as a baseline for many downstream tasks.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	one-letter amino-acid sequence.	required
`matrix`	`str`	substitution matrix name (`"BLOSUM62"` or `"PAM250"`).	`'BLOSUM62'`

Returns:

Type	Description
`NDArray[float32]`	`(L, 20)` float32 array. Non-standard residues get the
`NDArray[float32]`	`"X"` row (which exists in both BLOSUM62 and PAM250).

compose_features ¶

compose_features(
    *features: NDArray[float32],
) -> NDArray[np.float32]

Concatenate multiple per-residue feature arrays along the last axis.

All inputs must have the same leading dimension (sequence length).

Parameters:

Name	Type	Description	Default
`*features`	`NDArray[float32]`	any number of `(L, D_i)` arrays.	`()`

Returns:

Type	Description
`NDArray[float32]`	`(L, sum(D_i))` float32 array.

Raises:

Type	Description
`ValueError`	if shapes are incompatible.

one_hot ¶

one_hot(
    sequence: str, *, include_unk: bool = True
) -> NDArray[np.float32]

One-hot encode a protein sequence.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	one-letter amino-acid sequence.	required
`include_unk`	`bool`	if True (default), use a 21-dimensional encoding with column 20 set to 1 for any non-standard residue. If False, use 20 dims and silently drop the bit for non-standard residues (their rows will be all zero).	`True`

Returns:

Type	Description
`NDArray[float32]`	`(L, 21)` or `(L, 20)` float32 array.

positional_encoding ¶

positional_encoding(
    length: int, dim: int = 64, *, base: float = 10000.0
) -> NDArray[np.float32]

Sinusoidal absolute positional encoding (Vaswani et al. 2017).

Identical formulation to the original Transformer paper: PE[pos, 2i] = sin(pos / base^(2i/dim)) PE[pos, 2i+1] = cos(pos / base^(2i/dim))

Parameters:

Name	Type	Description	Default
`length`	`int`	sequence length.	required
`dim`	`int`	embedding dimensionality (must be even).	`64`
`base`	`float`	wavelength base. The Vaswani default is 10000.	`10000.0`

Returns:

Type	Description
`NDArray[float32]`	`(length, dim)` float32 array.

local_environment ¶

local_environment(
    protein: Protein, *, radius: float = 10.0
) -> NDArray[np.float32]

Per-residue local atomic environment counts.

For each protein residue, count the atoms of each chemical element within radius Å of that residue's CA. This is a simple but effective featurization that captures local packing.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	input structure.	required
`radius`	`float`	cutoff radius in Å (default 10).	`10.0`

Returns:

Type	Description
`NDArray[float32]`	`(n_res, 5)` float32 array. Columns are counts of C, N, O,
`NDArray[float32]`	S, and "other" elements respectively.

pair_distance_features ¶

pair_distance_features(
    protein: Protein,
    *,
    n_bins: int = 16,
    d_min: float = 2.0,
    d_max: float = 22.0,
    atom_choice: Literal["ca", "cb", "heavy", "all"] = "ca",
) -> NDArray[np.float32]

Gaussian-radial-basis-function (RBF) encoding of pair distances.

This is the standard featurization used in modern protein GNNs (Equivariant GNNs, GearNet, etc.). Each pair distance is expanded into n_bins Gaussian basis functions evenly spaced between d_min and d_max, with sigma chosen so the bases overlap.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	input structure.	required
`n_bins`	`int`	number of RBF basis functions. 16 is the typical default.	`16`
`d_min`	`float`	lower end of the distance range covered by the basis (Å).	`2.0`
`d_max`	`float`	upper end of the distance range covered by the basis (Å).	`22.0`
`atom_choice`	`Literal['ca', 'cb', 'heavy', 'all']`	anchor atom per residue.	`'ca'`

Returns:

Type	Description
`NDArray[float32]`	`(n_res, n_res, n_bins)` float32 array. The `[i, j, k]` entry
`NDArray[float32]`	is `exp(-(d_ij - centers[k])^2 / (2 sigma^2))`.

pair_distances ¶

pair_distances(
    protein: Protein,
    *,
    atom_choice: Literal["ca", "cb", "heavy", "all"] = "ca",
) -> NDArray[np.float32]

Compute the residue-residue distance matrix.

Convenience wrapper over :func:molforge.structure.distance_map that returns float32 for downstream tensor conversion.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	input structure.	required
`atom_choice`	`Literal['ca', 'cb', 'heavy', 'all']`	which atom defines residue position (`"ca"`, `"cb"`, `"heavy"`, `"all"`).	`'ca'`

Returns:

Type	Description
`NDArray[float32]`	`(n_res, n_res)` float32 array of distances in Å.

pair_orientations ¶

pair_orientations(
    protein: Protein,
) -> dict[str, NDArray[np.float32]]

Backbone orientation features between every pair of residues.

Computes for each residue pair (i, j): - direction: unit vector from CA(i) to CA(j) in i's local frame - distance: ||CA(j) - CA(i)|| - cosine: cosine of the angle between the CA(i)-CA(j) vector and residue i's local "forward" direction (CA(i+1) - CA(i-1)). Captures local orientation context.

These are useful as edge features in equivariant-style protein GNNs.

Returns:

Type	Description
`dict[str, NDArray[float32]]`	Dict with keys `"direction"` (`(n, n, 3)`), `"distance"`
`dict[str, NDArray[float32]]`	(`(n, n)`), and `"cosine"` (`(n, n)`).

per_residue_features ¶

per_residue_features(
    protein: Protein,
    *,
    include_sequence: bool = True,
    include_environment: bool = True,
    include_dssp: bool = True,
) -> NDArray[np.float32]

Combined per-residue feature vectors suitable as GNN node features.

Stacks (along the feature dimension): - One-hot residue identity (21 dims, if include_sequence) - Local environment element counts (5 dims, if include_environment) - DSSP 3-state one-hot (3 dims, if include_dssp)

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	input structure.	required
`include_sequence`	`bool`	include the one-hot residue identity block (21 dims).	`True`
`include_environment`	`bool`	include the local-environment block (5 dims).	`True`
`include_dssp`	`bool`	include the DSSP 3-state one-hot block (3 dims).	`True`

Returns:

Type	Description
`NDArray[float32]`	`(n_res, D)` float32 array, where D depends on which blocks
`NDArray[float32]`	are included.

molforge.ml¶

ml ¶

EmbeddingNotInstalledError ¶

ESM2Embedder ¶

embed ¶

embed_many ¶

embed_pooled ¶

ProteinGraph dataclass ¶

to_graph ¶

blosum_embed ¶

compose_features ¶

one_hot ¶

positional_encoding ¶

local_environment ¶

pair_distance_features ¶

pair_distances ¶

pair_orientations ¶

per_residue_features ¶

ProteinGraph `dataclass` ¶