molforge.ensembles¶

ensembles ¶

Ensemble analysis: weighted statistics over collections of structures.

An ensemble is what you have when one input gives you many outputs:

A docking run returns many plausible poses.
An MD trajectory returns many time-snapshots of one system.
A generative model (RFdiffusion, ProteinMPNN sampling) returns many candidates.

The questions you typically ask of an ensemble are the same regardless of which kind it is:

How should I weight the members? Boltzmann-weight by score, uniform-weight, or use external weights.
How diverse is the ensemble? Pairwise-distance statistics tell you whether the sampling actually explored conformational space.
Which members cluster together? Reduces N noisy samples to k representative groups.
What does the population look like in space? Density maps, contact frequencies.
Can I draw a single representative member? Medoid pick or weighted average.

This module focuses on pose ensembles in v1 (i.e. docking output — the same ligand in N different binding-mode candidates against one receptor) because that's the immediately useful case. The design deliberately admits MD trajectories and structural ensembles as future extensions: boltzmann_weights takes raw scores so it works for any kind of score, and the clustering / density functions take generic coordinate arrays alongside the convenience overloads that take Pose objects.

What's here:

Weighting: - :func:boltzmann_weights — softmax weights from scores. - :func:resample — weighted bootstrap.

Geometry: - :func:pairwise_rmsd — N×N RMSD matrix for ligand heavy atoms. - :func:pose_diversity — summary statistics over pairwise RMSDs.

Clustering: - :func:pose_clusters — hierarchical RMSD clustering with medoids.

Spatial: - :func:binding_site_density — 3D histogram of ligand atom positions, optionally Boltzmann-weighted.

Consensus: - :func:consensus_pose — pick a representative (medoid or weighted-mean) pose.

Limitations of v1:

Pose RMSD is order-sensitive. Atom orderings are assumed to be consistent across poses (which is true for poses returned from a single docking run by Vina, DiffDock, etc.). For symmetric ligands the resulting RMSD is an upper bound on the true symmetry-aware RMSD; for the common case of asymmetric drug-like ligands this is fine. A symmetry-aware RMSD using RDKit graph matching would be a natural future addition.
Receptor is fixed. Ensemble functions treat the receptor as a single conformation; poses differ only in ligand placement. Mixed-receptor ensembles (e.g. one ligand across an MD ensemble of receptor frames) need a separate, slightly different surface.

pose_clusters ¶

pose_clusters(
    poses: Sequence[Pose],
    *,
    cutoff: float = 2.0,
    heavy_atoms_only: bool = True,
) -> PoseClusteringResult

Hierarchical (average-linkage) clustering of poses by RMSD.

Two poses end up in the same cluster if their average linkage to each other (= mean pairwise RMSD across the current cluster membership during agglomeration) stays below cutoff.

Parameters:

Name	Type	Description	Default
`poses`	`Sequence[Pose]`	Sequence of poses to cluster. Must share an atom ordering — see :func:`pairwise_rmsd`.	required
`cutoff`	`float`	RMSD threshold in Å (default 2.0, a common docking- community value for "same binding mode"). Larger ⇒ fewer, looser clusters; smaller ⇒ more, tighter clusters.	`2.0`
`heavy_atoms_only`	`bool`	see :func:`pairwise_rmsd`.	`True`

Returns:

Name	Type	Description
`A`	`PoseClusteringResult`	class:`PoseClusteringResult` with cluster labels, the
	`PoseClusteringResult`	class:`PoseCluster` objects (sorted largest-first), and the
	`PoseClusteringResult`	RMSD matrix used internally.

Raises:

Type	Description
`ValueError`	If `cutoff <= 0`, `poses` is empty, or poses don't share an atom ordering.

Example

from molforge.ensembles import pose_clusters result = pose_clusters(docking.poses, cutoff=2.0) print(f"{result.n_clusters} distinct binding modes")

Get the medoid of the biggest cluster:¶

biggest = result.clusters[0] representative = docking.poses[biggest.medoid]

consensus_pose ¶

consensus_pose(
    poses: Sequence[Pose],
    *,
    weights: NDArray[floating] | None = None,
    method: str = "medoid",
    heavy_atoms_only: bool = True,
) -> Pose

Pick or build a single pose representing the ensemble.

Parameters:

Name	Type	Description	Default
`poses`	`Sequence[Pose]`	Sequence of poses, all with the same atom ordering.	required
`weights`	`NDArray[floating] \| None`	`(n_poses,)` array of weights. Default `None` = uniform. Typically from :func:`boltzmann_weights`. For `method="medoid"`, weights bias which pose is chosen as representative (a high-weight outlier won't be picked unless it's also geometrically central). For `method="mean"`, weights are used directly in the weighted average of coordinates.	`None`
`method`	`str`	One of: `"medoid"` (default): pick an actual pose from the ensemble that minimizes its summed weighted RMSD to the rest. Returns the original Pose object unchanged. `"mean"`: synthesize a new pose with coordinates that are the weighted average across the ensemble. The receptor and metadata are copied from the first pose; the score is set to the weighted average of input scores. Bond geometry is not guaranteed valid.	`'medoid'`
`heavy_atoms_only`	`bool`	see :func:`pairwise_rmsd`. Only affects the medoid choice (mean-averaging always uses all atoms).	`True`

Returns:

Type	Description
`Pose`	A single :class:`Pose`. For `"medoid"` this is one of the
`Pose`	input poses, returned by reference. For `"mean"` it's a
`Pose`	new deep-copied :class:`Pose` with averaged coordinates.

Raises:

Type	Description
`ValueError`	If `poses` is empty, `method` is unrecognized, or `weights` has the wrong length / doesn't sum to ~1.

Example

from molforge.ensembles import boltzmann_weights, consensus_pose w = boltzmann_weights(docking.poses)

Pick the most "central" high-affinity pose.¶

representative = consensus_pose(docking.poses, weights=w)

binding_site_density ¶

binding_site_density(
    poses: Sequence[Pose],
    *,
    spacing: float = 1.0,
    padding: float = 4.0,
    weights: NDArray[floating] | None = None,
    heavy_atoms_only: bool = True,
    origin: NDArray[floating] | None = None,
    shape: tuple[int, int, int] | None = None,
) -> DensityGrid

Compute a 3D spatial density of ligand atom positions across the ensemble.

Each ligand atom in each pose contributes weights[i] (default uniform = 1/n_poses) to the grid cell containing it. Atoms landing outside the grid are silently discarded; the returned total_weight tells you how much weight made it in.

By default the grid is auto-sized: it's the axis-aligned bounding box of all heavy ligand atoms across all poses, padded by padding Å on each side, with the requested spacing. For differential / comparative analysis (where you want two ensembles on the same grid), pass explicit origin and shape.

Parameters:

Name	Type	Description	Default
`poses`	`Sequence[Pose]`	Sequence of poses. Ligand atoms across all poses determine the bounding box by default.	required
`spacing`	`float`	Cubic grid spacing in Å. Default `1.0` (resolves to something a bit finer than typical ligand atom-atom distances but coarser than a true density map).	`1.0`
`padding`	`float`	Bounding-box padding in Å on each side, applied only when `origin` is not given. Default `4.0`.	`4.0`
`weights`	`NDArray[floating] \| None`	`(n_poses,)` array of weights. Default `None` = uniform `1/n_poses` (so total weight ≤ 1.0). Typically comes from :func:`boltzmann_weights`.	`None`
`heavy_atoms_only`	`bool`	see :func:`pairwise_rmsd`.	`True`
`origin`	`NDArray[floating] \| None`	`(3,)` Cartesian coordinate of the grid's `(0, 0, 0)` corner. If provided, `shape` must also be provided, and the auto-bounding-box logic is bypassed.	`None`
`shape`	`tuple[int, int, int] \| None`	`(nx, ny, nz)` explicit grid shape; required if and only if `origin` is provided.	`None`

Returns:

Name	Type	Description
`A`	`DensityGrid`	class:`DensityGrid`.

Raises:

Type	Description
`ValueError`	If `spacing <= 0`, `poses` is empty, `weights` has the wrong length, or only one of `origin` / `shape` is provided.

Example

from molforge.ensembles import boltzmann_weights, binding_site_density weights = boltzmann_weights(docking.poses) grid = binding_site_density(docking.poses, spacing=0.5, weights=weights) hot_spot = np.unravel_index(grid.density.argmax(), grid.shape) grid.coordinate_of(hot_spot) # the most-occupied point in Å array([12.3, 45.6, 78.9], dtype=float32)

pairwise_rmsd ¶

pairwise_rmsd(
    poses: Sequence[Pose], *, heavy_atoms_only: bool = True
) -> NDArray[np.float32]

Compute the N×N pairwise heavy-atom RMSD matrix over a pose ensemble.

Parameters:

Name	Type	Description	Default
`poses`	`Sequence[Pose]`	Sequence of :class:`molforge.docking.Pose` objects, all of which must have the same number of atoms in the same order. (Poses returned by a single docking run normally do.)	required
`heavy_atoms_only`	`bool`	If `True` (default), hydrogens are stripped before computing distances. Most docking engines run on heavy atoms anyway, so for Vina output this is a no-op; for poses from other sources it's a sensible default.	`True`

Returns:

Type	Description
`NDArray[float32]`	A symmetric `(n, n)` float32 matrix with zero diagonal. Entry
`NDArray[float32]`	`[i, j]` is the RMSD between pose `i` and pose `j` in
`NDArray[float32]`	the input units (Å for biology).

Raises:

Type	Description
`ValueError`	If poses don't share an atom ordering (different heavy-atom counts) or if any pose has no atoms.

Example

from molforge.ensembles import pairwise_rmsd rmsd_matrix = pairwise_rmsd(result.poses) rmsd_matrix[0, 1] # RMSD between poses 0 and 1 2.34

pose_diversity ¶

pose_diversity(
    poses: Sequence[Pose], *, heavy_atoms_only: bool = True
) -> dict[str, float]

Summary statistics over the pairwise RMSD distribution of an ensemble.

Use as a quick diagnostic for "did the docking actually explore?" A run where every pose is within 0.5 Å of every other pose probably converged on one binding mode; a run with mean pairwise RMSD > 3 Å found multiple modes.

Parameters:

Name	Type	Description	Default
`poses`	`Sequence[Pose]`	Sequence of poses, as for :func:`pairwise_rmsd`.	required
`heavy_atoms_only`	`bool`	see :func:`pairwise_rmsd`.	`True`

Returns:

Type	Description
`dict[str, float]`	A dict with keys:
`dict[str, float]`	`"min"`: minimum off-diagonal pairwise RMSD.
`dict[str, float]`	`"max"`: maximum pairwise RMSD.
`dict[str, float]`	`"mean"`: mean of the upper triangle (excluding diagonal).
`dict[str, float]`	`"median"`: median of the upper triangle.
`dict[str, float]`	`"std"`: standard deviation of the upper triangle.
`dict[str, float]`	`"n_poses"`: number of poses in the ensemble.
`dict[str, float]`	For an ensemble of size 1, all distance statistics are `0.0`.

Raises:

Type	Description
`ValueError`	If poses don't share an atom ordering or any pose has no atoms.

boltzmann_weights ¶

boltzmann_weights(
    scores: Sequence[float]
    | NDArray[floating]
    | Sequence[Pose],
    *,
    temperature: float = KT_298K_KCAL_PER_MOL,
    lower_is_better: bool = True,
) -> NDArray[np.float64]

Compute Boltzmann weights from a vector of scores.

The returned weights satisfy sum(w) == 1.0 and are computed in a numerically stable way (subtracting the min/max score before exponentiating).

Parameters:

Name	Type	Description	Default
`scores`	`Sequence[float] \| NDArray[floating] \| Sequence[Pose]`	Either a sequence of scalar scores, or a sequence of :class:`molforge.docking.Pose` objects (in which case the `score` attribute is used). NumPy arrays work directly.	required
`temperature`	`float`	The thermal energy `kT` in the same units as `scores`. Defaults to `kT` at 298 K in kcal/mol (`0.593`), appropriate for Vina-style docking scores. Larger values → softer (more uniform) weights; smaller values → winner-takes-all. Must be strictly positive.	`KT_298K_KCAL_PER_MOL`
`lower_is_better`	`bool`	If `True` (default), scores are treated as energies — lower (more negative) means better, gets higher weight. If `False`, larger means better (suits ML confidence scores). The flag flips the sign internally.	`True`

Returns:

Type	Description
`NDArray[float64]`	A `(n,)` float64 array of weights summing to 1.

Raises:

Type	Description
`ValueError`	If `temperature` is non-positive, `scores` is empty, or any score is non-finite.

Example

from molforge.ensembles import boltzmann_weights

Three docking scores in kcal/mol (Vina convention).¶

weights = boltzmann_weights([-9.5, -8.2, -7.1]) weights.sum() 1.0 weights[0] > weights[2] # best score gets highest weight True

resample ¶

resample(
    poses: Sequence[Pose],
    n_samples: int,
    *,
    weights: NDArray[floating] | None = None,
    rng: Generator | None = None,
    replace: bool = True,
) -> list[Pose]

Draw n_samples poses with replacement, optionally weighted.

Useful for downstream uncertainty estimation (bootstrap any pose- derived metric over the resampled population) or for converting a Boltzmann-weighted ensemble into an unweighted population suitable for tools that don't accept weights.

Parameters:

Name	Type	Description	Default
`poses`	`Sequence[Pose]`	The source ensemble. Order is preserved across calls with the same `rng` seed.	required
`n_samples`	`int`	How many draws to return. Must be ≥ 1.	required
`weights`	`NDArray[floating] \| None`	A `(len(poses),)` array of weights summing to 1 (e.g. from :func:`boltzmann_weights`). If `None`, uniform weights are used.	`None`
`rng`	`Generator \| None`	A NumPy `Generator`. If `None`, a fresh one is created with default seeding (non-reproducible). Pass an explicit `np.random.default_rng(seed)` for reproducibility.	`None`
`replace`	`bool`	Whether to sample with replacement. Default `True` (which is what Boltzmann resampling means). Setting `False` requires `n_samples <= len(poses)`.	`True`

Returns:

Type	Description
`list[Pose]`	A list of `n_samples` :class:`Pose` objects drawn from
`list[Pose]`	`poses` according to `weights`. Poses may be repeated when
`list[Pose]`	sampling with replacement; objects are not copied (each
`list[Pose]`	returned reference points at the original pose).

Raises:

Type	Description
`ValueError`	If `poses` is empty, `n_samples < 1`, or `weights` has the wrong length / doesn't sum to ~1.