Skip to content

molforge.ensembles

ensembles

Ensemble analysis: weighted statistics over collections of structures.

An ensemble is what you have when one input gives you many outputs:

  • A docking run returns many plausible poses.
  • An MD trajectory returns many time-snapshots of one system.
  • A generative model (RFdiffusion, ProteinMPNN sampling) returns many candidates.

The questions you typically ask of an ensemble are the same regardless of which kind it is:

  1. How should I weight the members? Boltzmann-weight by score, uniform-weight, or use external weights.
  2. How diverse is the ensemble? Pairwise-distance statistics tell you whether the sampling actually explored conformational space.
  3. Which members cluster together? Reduces N noisy samples to k representative groups.
  4. What does the population look like in space? Density maps, contact frequencies.
  5. Can I draw a single representative member? Medoid pick or weighted average.

This module focuses on pose ensembles in v1 (i.e. docking output — the same ligand in N different binding-mode candidates against one receptor) because that's the immediately useful case. The design deliberately admits MD trajectories and structural ensembles as future extensions: boltzmann_weights takes raw scores so it works for any kind of score, and the clustering / density functions take generic coordinate arrays alongside the convenience overloads that take Pose objects.

What's here:

Weighting: - :func:boltzmann_weights — softmax weights from scores. - :func:resample — weighted bootstrap.

Geometry: - :func:pairwise_rmsd — N×N RMSD matrix for ligand heavy atoms. - :func:pose_diversity — summary statistics over pairwise RMSDs.

Clustering: - :func:pose_clusters — hierarchical RMSD clustering with medoids.

Spatial: - :func:binding_site_density — 3D histogram of ligand atom positions, optionally Boltzmann-weighted.

Consensus: - :func:consensus_pose — pick a representative (medoid or weighted-mean) pose.

Limitations of v1:

  • Pose RMSD is order-sensitive. Atom orderings are assumed to be consistent across poses (which is true for poses returned from a single docking run by Vina, DiffDock, etc.). For symmetric ligands the resulting RMSD is an upper bound on the true symmetry-aware RMSD; for the common case of asymmetric drug-like ligands this is fine. A symmetry-aware RMSD using RDKit graph matching would be a natural future addition.
  • Receptor is fixed. Ensemble functions treat the receptor as a single conformation; poses differ only in ligand placement. Mixed-receptor ensembles (e.g. one ligand across an MD ensemble of receptor frames) need a separate, slightly different surface.

pose_clusters

pose_clusters(
    poses: Sequence[Pose],
    *,
    cutoff: float = 2.0,
    heavy_atoms_only: bool = True,
) -> PoseClusteringResult

Hierarchical (average-linkage) clustering of poses by RMSD.

Two poses end up in the same cluster if their average linkage to each other (= mean pairwise RMSD across the current cluster membership during agglomeration) stays below cutoff.

Parameters:

Name Type Description Default
poses Sequence[Pose]

Sequence of poses to cluster. Must share an atom ordering — see :func:pairwise_rmsd.

required
cutoff float

RMSD threshold in Å (default 2.0, a common docking- community value for "same binding mode"). Larger ⇒ fewer, looser clusters; smaller ⇒ more, tighter clusters.

2.0
heavy_atoms_only bool

see :func:pairwise_rmsd.

True

Returns:

Name Type Description
A PoseClusteringResult

class:PoseClusteringResult with cluster labels, the

PoseClusteringResult

class:PoseCluster objects (sorted largest-first), and the

PoseClusteringResult

RMSD matrix used internally.

Raises:

Type Description
ValueError

If cutoff <= 0, poses is empty, or poses don't share an atom ordering.

Example

from molforge.ensembles import pose_clusters result = pose_clusters(docking.poses, cutoff=2.0) print(f"{result.n_clusters} distinct binding modes")

Get the medoid of the biggest cluster:

biggest = result.clusters[0] representative = docking.poses[biggest.medoid]

consensus_pose

consensus_pose(
    poses: Sequence[Pose],
    *,
    weights: NDArray[floating] | None = None,
    method: str = "medoid",
    heavy_atoms_only: bool = True,
) -> Pose

Pick or build a single pose representing the ensemble.

Parameters:

Name Type Description Default
poses Sequence[Pose]

Sequence of poses, all with the same atom ordering.

required
weights NDArray[floating] | None

(n_poses,) array of weights. Default None = uniform. Typically from :func:boltzmann_weights. For method="medoid", weights bias which pose is chosen as representative (a high-weight outlier won't be picked unless it's also geometrically central). For method="mean", weights are used directly in the weighted average of coordinates.

None
method str

One of:

  • "medoid" (default): pick an actual pose from the ensemble that minimizes its summed weighted RMSD to the rest. Returns the original Pose object unchanged.
  • "mean": synthesize a new pose with coordinates that are the weighted average across the ensemble. The receptor and metadata are copied from the first pose; the score is set to the weighted average of input scores. Bond geometry is not guaranteed valid.
'medoid'
heavy_atoms_only bool

see :func:pairwise_rmsd. Only affects the medoid choice (mean-averaging always uses all atoms).

True

Returns:

Type Description
Pose

A single :class:Pose. For "medoid" this is one of the

Pose

input poses, returned by reference. For "mean" it's a

Pose

new deep-copied :class:Pose with averaged coordinates.

Raises:

Type Description
ValueError

If poses is empty, method is unrecognized, or weights has the wrong length / doesn't sum to ~1.

Example

from molforge.ensembles import boltzmann_weights, consensus_pose w = boltzmann_weights(docking.poses)

Pick the most "central" high-affinity pose.

representative = consensus_pose(docking.poses, weights=w)

binding_site_density

binding_site_density(
    poses: Sequence[Pose],
    *,
    spacing: float = 1.0,
    padding: float = 4.0,
    weights: NDArray[floating] | None = None,
    heavy_atoms_only: bool = True,
    origin: NDArray[floating] | None = None,
    shape: tuple[int, int, int] | None = None,
) -> DensityGrid

Compute a 3D spatial density of ligand atom positions across the ensemble.

Each ligand atom in each pose contributes weights[i] (default uniform = 1/n_poses) to the grid cell containing it. Atoms landing outside the grid are silently discarded; the returned total_weight tells you how much weight made it in.

By default the grid is auto-sized: it's the axis-aligned bounding box of all heavy ligand atoms across all poses, padded by padding Å on each side, with the requested spacing. For differential / comparative analysis (where you want two ensembles on the same grid), pass explicit origin and shape.

Parameters:

Name Type Description Default
poses Sequence[Pose]

Sequence of poses. Ligand atoms across all poses determine the bounding box by default.

required
spacing float

Cubic grid spacing in Å. Default 1.0 (resolves to something a bit finer than typical ligand atom-atom distances but coarser than a true density map).

1.0
padding float

Bounding-box padding in Å on each side, applied only when origin is not given. Default 4.0.

4.0
weights NDArray[floating] | None

(n_poses,) array of weights. Default None = uniform 1/n_poses (so total weight ≤ 1.0). Typically comes from :func:boltzmann_weights.

None
heavy_atoms_only bool

see :func:pairwise_rmsd.

True
origin NDArray[floating] | None

(3,) Cartesian coordinate of the grid's (0, 0, 0) corner. If provided, shape must also be provided, and the auto-bounding-box logic is bypassed.

None
shape tuple[int, int, int] | None

(nx, ny, nz) explicit grid shape; required if and only if origin is provided.

None

Returns:

Name Type Description
A DensityGrid

class:DensityGrid.

Raises:

Type Description
ValueError

If spacing <= 0, poses is empty, weights has the wrong length, or only one of origin / shape is provided.

Example

from molforge.ensembles import boltzmann_weights, binding_site_density weights = boltzmann_weights(docking.poses) grid = binding_site_density(docking.poses, spacing=0.5, weights=weights) hot_spot = np.unravel_index(grid.density.argmax(), grid.shape) grid.coordinate_of(hot_spot) # the most-occupied point in Å array([12.3, 45.6, 78.9], dtype=float32)

pairwise_rmsd

pairwise_rmsd(
    poses: Sequence[Pose], *, heavy_atoms_only: bool = True
) -> NDArray[np.float32]

Compute the N×N pairwise heavy-atom RMSD matrix over a pose ensemble.

Parameters:

Name Type Description Default
poses Sequence[Pose]

Sequence of :class:molforge.docking.Pose objects, all of which must have the same number of atoms in the same order. (Poses returned by a single docking run normally do.)

required
heavy_atoms_only bool

If True (default), hydrogens are stripped before computing distances. Most docking engines run on heavy atoms anyway, so for Vina output this is a no-op; for poses from other sources it's a sensible default.

True

Returns:

Type Description
NDArray[float32]

A symmetric (n, n) float32 matrix with zero diagonal. Entry

NDArray[float32]

[i, j] is the RMSD between pose i and pose j in

NDArray[float32]

the input units (Å for biology).

Raises:

Type Description
ValueError

If poses don't share an atom ordering (different heavy-atom counts) or if any pose has no atoms.

Example

from molforge.ensembles import pairwise_rmsd rmsd_matrix = pairwise_rmsd(result.poses) rmsd_matrix[0, 1] # RMSD between poses 0 and 1 2.34

pose_diversity

pose_diversity(
    poses: Sequence[Pose], *, heavy_atoms_only: bool = True
) -> dict[str, float]

Summary statistics over the pairwise RMSD distribution of an ensemble.

Use as a quick diagnostic for "did the docking actually explore?" A run where every pose is within 0.5 Å of every other pose probably converged on one binding mode; a run with mean pairwise RMSD > 3 Å found multiple modes.

Parameters:

Name Type Description Default
poses Sequence[Pose]

Sequence of poses, as for :func:pairwise_rmsd.

required
heavy_atoms_only bool

see :func:pairwise_rmsd.

True

Returns:

Type Description
dict[str, float]

A dict with keys:

dict[str, float]
  • "min": minimum off-diagonal pairwise RMSD.
dict[str, float]
  • "max": maximum pairwise RMSD.
dict[str, float]
  • "mean": mean of the upper triangle (excluding diagonal).
dict[str, float]
  • "median": median of the upper triangle.
dict[str, float]
  • "std": standard deviation of the upper triangle.
dict[str, float]
  • "n_poses": number of poses in the ensemble.
dict[str, float]

For an ensemble of size 1, all distance statistics are 0.0.

Raises:

Type Description
ValueError

If poses don't share an atom ordering or any pose has no atoms.

boltzmann_weights

boltzmann_weights(
    scores: Sequence[float]
    | NDArray[floating]
    | Sequence[Pose],
    *,
    temperature: float = KT_298K_KCAL_PER_MOL,
    lower_is_better: bool = True,
) -> NDArray[np.float64]

Compute Boltzmann weights from a vector of scores.

The returned weights satisfy sum(w) == 1.0 and are computed in a numerically stable way (subtracting the min/max score before exponentiating).

Parameters:

Name Type Description Default
scores Sequence[float] | NDArray[floating] | Sequence[Pose]

Either a sequence of scalar scores, or a sequence of :class:molforge.docking.Pose objects (in which case the score attribute is used). NumPy arrays work directly.

required
temperature float

The thermal energy kT in the same units as scores. Defaults to kT at 298 K in kcal/mol (0.593), appropriate for Vina-style docking scores. Larger values → softer (more uniform) weights; smaller values → winner-takes-all. Must be strictly positive.

KT_298K_KCAL_PER_MOL
lower_is_better bool

If True (default), scores are treated as energies — lower (more negative) means better, gets higher weight. If False, larger means better (suits ML confidence scores). The flag flips the sign internally.

True

Returns:

Type Description
NDArray[float64]

A (n,) float64 array of weights summing to 1.

Raises:

Type Description
ValueError

If temperature is non-positive, scores is empty, or any score is non-finite.

Example

from molforge.ensembles import boltzmann_weights

Three docking scores in kcal/mol (Vina convention).

weights = boltzmann_weights([-9.5, -8.2, -7.1]) weights.sum() 1.0 weights[0] > weights[2] # best score gets highest weight True

resample

resample(
    poses: Sequence[Pose],
    n_samples: int,
    *,
    weights: NDArray[floating] | None = None,
    rng: Generator | None = None,
    replace: bool = True,
) -> list[Pose]

Draw n_samples poses with replacement, optionally weighted.

Useful for downstream uncertainty estimation (bootstrap any pose- derived metric over the resampled population) or for converting a Boltzmann-weighted ensemble into an unweighted population suitable for tools that don't accept weights.

Parameters:

Name Type Description Default
poses Sequence[Pose]

The source ensemble. Order is preserved across calls with the same rng seed.

required
n_samples int

How many draws to return. Must be ≥ 1.

required
weights NDArray[floating] | None

A (len(poses),) array of weights summing to 1 (e.g. from :func:boltzmann_weights). If None, uniform weights are used.

None
rng Generator | None

A NumPy Generator. If None, a fresh one is created with default seeding (non-reproducible). Pass an explicit np.random.default_rng(seed) for reproducibility.

None
replace bool

Whether to sample with replacement. Default True (which is what Boltzmann resampling means). Setting False requires n_samples <= len(poses).

True

Returns:

Type Description
list[Pose]

A list of n_samples :class:Pose objects drawn from

list[Pose]

poses according to weights. Poses may be repeated when

list[Pose]

sampling with replacement; objects are not copied (each

list[Pose]

returned reference points at the original pose).

Raises:

Type Description
ValueError

If poses is empty, n_samples < 1, or weights has the wrong length / doesn't sum to ~1.