molforge.ml¶
ml ¶
ML utilities: featurizers, embeddings, and graph representations.
Turn a :class:molforge.core.Protein into ML-ready tensors. Three layers:
Sequence featurizers (no structure needed):
- :func:one_hot — 21-dim one-hot encoding.
- :func:blosum_embed — BLOSUM62 / PAM250 rows as embeddings.
- :func:positional_encoding — sinusoidal (Vaswani-style) positional.
- :func:compose_features — concatenate featurizers.
Structure featurizers (need 3D coordinates):
- :func:pair_distances — float32 distance map.
- :func:pair_distance_features — Gaussian-RBF binned distances
(the standard featurization for distance-based GNNs).
- :func:pair_orientations — backbone orientation features
(CA-CA vectors, distances, and local-frame cosines).
- :func:local_environment — per-residue atomic environment counts
by element.
- :func:per_residue_features — combined node-feature vector
(one-hot + environment + DSSP).
Protein language model embeddings (heavy, lazy):
- :class:ESM2Embedder — wraps ESM-2 via HuggingFace transformers.
Graph construction:
- :func:to_graph — build a :class:ProteinGraph ready for
PyTorch Geometric / DGL.
The featurizers return plain NumPy float32 arrays. Convert to your
preferred tensor library downstream (torch.from_numpy(arr) etc.).
EmbeddingNotInstalledError ¶
Bases: ImportError
Raised when the heavy embedding dependencies aren't installed.
ESM2Embedder ¶
ESM2Embedder(
*,
model_name: str = "facebook/esm2_t33_650M_UR50D",
device: str | None = None,
layer: int = -1,
dtype: str = "float32",
)
Per-residue and per-sequence embeddings via ESM-2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model identifier (default
|
'facebook/esm2_t33_650M_UR50D'
|
device
|
str | None
|
|
None
|
layer
|
int
|
which transformer layer to extract embeddings from. Defaults to the model's last layer. Final-layer embeddings are most discriminative for most downstream tasks; mid-layer embeddings often work better for structure prediction. |
-1
|
dtype
|
str
|
|
'float32'
|
Example
from molforge.ml import ESM2Embedder embedder = ESM2Embedder(device="cuda") emb = embedder.embed("MKTVRQERLKSIVRILERSK") emb.shape (20, 1280)
embed ¶
Per-residue embeddings for a single sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
one-letter amino-acid sequence. |
required |
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
NDArray[float32]
|
and |
NDArray[float32]
|
(e.g. 1280 for the 650M model). The CLS / EOS tokens |
NDArray[float32]
|
are stripped before returning so the leading axis aligns |
NDArray[float32]
|
with residue index. |
embed_many ¶
Per-residue embeddings for a batch of sequences.
Sequences of different lengths can't be stacked into a single
tensor without padding, so we return a list of (L_i, D) arrays.
For length-padded batched inference, drop down to the underlying
self._model directly via self._tokenizer with
padding=True.
embed_pooled ¶
Per-sequence embedding (a single fixed-size vector per protein).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
one-letter amino-acid sequence. |
required |
pooling
|
str
|
|
'mean'
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
ProteinGraph
dataclass
¶
ProteinGraph(
node_features: NDArray[float32],
edge_index: NDArray[int64],
edge_features: NDArray[float32],
residue_labels: list[tuple[str, int]],
)
A residue-graph representation of a protein structure.
Attributes:
| Name | Type | Description |
|---|---|---|
node_features |
NDArray[float32]
|
|
edge_index |
NDArray[int64]
|
|
edge_features |
NDArray[float32]
|
|
residue_labels |
list[tuple[str, int]]
|
|
to_graph ¶
to_graph(
protein: Protein,
*,
cutoff: float = 10.0,
self_loops: bool = False,
include_dssp: bool = True,
include_environment: bool = True,
edge_distance_bins: int = 16,
) -> ProteinGraph
Build a residue-graph representation of protein.
Nodes: each protein residue with a CA atom.
Edges: undirected (both directions present) connections between
residues whose CA atoms are within cutoff Å of each other.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
input structure. |
required |
cutoff
|
float
|
edge cutoff distance in Å. 10 Å is the typical default for protein GNNs (captures both direct and ~1-residue-away contacts). |
10.0
|
self_loops
|
bool
|
include i->i edges. Some GNN architectures want these; most don't. |
False
|
include_dssp
|
bool
|
passed to :func: |
True
|
include_environment
|
bool
|
passed to :func: |
True
|
edge_distance_bins
|
int
|
number of RBF basis functions for the edge-distance feature; pass 0 to use raw distance. |
16
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
ProteinGraph
|
class: |
Example
from molforge.ml import to_graph g = to_graph(my_protein, cutoff=8.0) g.node_features.shape (76, 29) g.edge_index.shape (2, 421) g.edge_features.shape (421, 16)
blosum_embed ¶
Use rows of a substitution matrix as per-residue embeddings.
This is a classic trick from the pre-deep-learning era: every residue gets a 20-dim embedding that captures its substitution profile against the standard amino acids. Surprisingly strong as a baseline for many downstream tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
one-letter amino-acid sequence. |
required |
matrix
|
str
|
substitution matrix name ( |
'BLOSUM62'
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
NDArray[float32]
|
|
compose_features ¶
Concatenate multiple per-residue feature arrays along the last axis.
All inputs must have the same leading dimension (sequence length).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*features
|
NDArray[float32]
|
any number of |
()
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
if shapes are incompatible. |
one_hot ¶
One-hot encode a protein sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
one-letter amino-acid sequence. |
required |
include_unk
|
bool
|
if True (default), use a 21-dimensional encoding with column 20 set to 1 for any non-standard residue. If False, use 20 dims and silently drop the bit for non-standard residues (their rows will be all zero). |
True
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
positional_encoding ¶
Sinusoidal absolute positional encoding (Vaswani et al. 2017).
Identical formulation to the original Transformer paper:
PE[pos, 2i] = sin(pos / base^(2i/dim))
PE[pos, 2i+1] = cos(pos / base^(2i/dim))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
length
|
int
|
sequence length. |
required |
dim
|
int
|
embedding dimensionality (must be even). |
64
|
base
|
float
|
wavelength base. The Vaswani default is 10000. |
10000.0
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
local_environment ¶
Per-residue local atomic environment counts.
For each protein residue, count the atoms of each chemical element
within radius Å of that residue's CA. This is a simple but
effective featurization that captures local packing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
input structure. |
required |
radius
|
float
|
cutoff radius in Å (default 10). |
10.0
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
NDArray[float32]
|
S, and "other" elements respectively. |
pair_distance_features ¶
pair_distance_features(
protein: Protein,
*,
n_bins: int = 16,
d_min: float = 2.0,
d_max: float = 22.0,
atom_choice: Literal["ca", "cb", "heavy", "all"] = "ca",
) -> NDArray[np.float32]
Gaussian-radial-basis-function (RBF) encoding of pair distances.
This is the standard featurization used in modern protein GNNs
(Equivariant GNNs, GearNet, etc.). Each pair distance is expanded
into n_bins Gaussian basis functions evenly spaced between
d_min and d_max, with sigma chosen so the bases overlap.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
input structure. |
required |
n_bins
|
int
|
number of RBF basis functions. 16 is the typical default. |
16
|
d_min
|
float
|
lower end of the distance range covered by the basis (Å). |
2.0
|
d_max
|
float
|
upper end of the distance range covered by the basis (Å). |
22.0
|
atom_choice
|
Literal['ca', 'cb', 'heavy', 'all']
|
anchor atom per residue. |
'ca'
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
NDArray[float32]
|
is |
pair_distances ¶
pair_distances(
protein: Protein,
*,
atom_choice: Literal["ca", "cb", "heavy", "all"] = "ca",
) -> NDArray[np.float32]
Compute the residue-residue distance matrix.
Convenience wrapper over :func:molforge.structure.distance_map
that returns float32 for downstream tensor conversion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
input structure. |
required |
atom_choice
|
Literal['ca', 'cb', 'heavy', 'all']
|
which atom defines residue position
( |
'ca'
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
pair_orientations ¶
Backbone orientation features between every pair of residues.
Computes for each residue pair (i, j):
- direction: unit vector from CA(i) to CA(j) in i's local frame
- distance: ||CA(j) - CA(i)||
- cosine: cosine of the angle between the CA(i)-CA(j) vector
and residue i's local "forward" direction (CA(i+1) - CA(i-1)).
Captures local orientation context.
These are useful as edge features in equivariant-style protein GNNs.
Returns:
| Type | Description |
|---|---|
dict[str, NDArray[float32]]
|
Dict with keys |
dict[str, NDArray[float32]]
|
( |
per_residue_features ¶
per_residue_features(
protein: Protein,
*,
include_sequence: bool = True,
include_environment: bool = True,
include_dssp: bool = True,
) -> NDArray[np.float32]
Combined per-residue feature vectors suitable as GNN node features.
Stacks (along the feature dimension):
- One-hot residue identity (21 dims, if include_sequence)
- Local environment element counts (5 dims, if include_environment)
- DSSP 3-state one-hot (3 dims, if include_dssp)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
input structure. |
required |
include_sequence
|
bool
|
include the one-hot residue identity block (21 dims). |
True
|
include_environment
|
bool
|
include the local-environment block (5 dims). |
True
|
include_dssp
|
bool
|
include the DSSP 3-state one-hot block (3 dims). |
True
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
|
NDArray[float32]
|
are included. |