Skip to content

Ensembles

molforge.ensembles provides weighted statistics over collections of related structures — the most common case being docked poses: one ligand against one receptor, returned by a docking engine as N plausible binding modes ranked by score.

The seven public functions, organized by concern:

Concern Function Returns
Weighting boltzmann_weights (n,) weights summing to 1
Weighting resample Bootstrap of pose objects
Geometry pairwise_rmsd (n, n) RMSD matrix
Geometry pose_diversity Dict of summary stats
Clustering pose_clusters PoseClusteringResult
Spatial binding_site_density DensityGrid
Consensus consensus_pose A single Pose

The canonical pipeline

from molforge.ensembles import (
    boltzmann_weights, pose_diversity, pose_clusters,
    binding_site_density, consensus_pose,
)

# 1. Got a DockingResult from somewhere.
from molforge.wrappers.docking import Vina
result = Vina().dock(receptor, ligand)
poses = result.poses

# 2. Weight by score (lower = better, Vina convention).
weights = boltzmann_weights(poses)

# 3. Did the docking actually explore? Quick diagnostic.
stats = pose_diversity(poses)
print(f"mean pairwise RMSD: {stats['mean']:.2f} Å")

# 4. How many distinct binding modes?
clusters = pose_clusters(poses, cutoff=2.0)
print(f"{clusters.n_clusters} distinct modes; biggest has {clusters.clusters[0].size}")

# 5. Where does the ligand spend its mass in space?
grid = binding_site_density(poses, weights=weights, spacing=0.5)

# 6. One representative pose for downstream use.
representative = consensus_pose(poses, weights=weights, method="medoid")

Boltzmann weights

The fundamental operation: turn a vector of scores into a probability distribution. For Vina-style scores in kcal/mol with the default temperature kT = 0.593 kcal/mol (room temperature):

weights = boltzmann_weights([-9.5, -8.2, -7.1])
# array([0.83, 0.10, 0.07])  — approximately

The temperature parameter controls softness:

boltzmann_weights(scores, temperature=10.0)   # softer, more uniform
boltzmann_weights(scores, temperature=0.1)    # sharper, winner-takes-all

For ML-derived scores where larger is better (DiffDock confidence, EquiDock scores), pass lower_is_better=False.

The function accepts either a numeric sequence, a NumPy array, or a sequence of Pose objects (in which case the score attribute is read automatically).

Pose clustering

Hierarchical average-linkage RMSD clustering. Two poses end up in the same cluster if their average-linkage RMSD stays below cutoff (default 2.0 Å, a common docking-community value for "same binding mode"):

result = pose_clusters(poses, cutoff=2.0)
result.n_clusters             # int
result.labels                 # (n_poses,) int array
result.clusters               # list of PoseCluster, biggest first

biggest = result.clusters[0]
biggest.size                  # 7
biggest.medoid                # 3 (pose index)
biggest.mean_intra_rmsd       # 0.85

The clusterer is pure NumPy with no scipy dependency. The algorithm is O(n³) which is fine for ensemble sizes from a single docking run (typically n ≲ 20).

Density maps

binding_site_density accumulates ligand heavy-atom positions across the ensemble into a 3D spatial grid. Each pose contributes weight[i] per atom; the default spacing is 1.0 Å and the box is auto-sized to cover all poses with 4 Å padding:

grid = binding_site_density(poses, weights=weights, spacing=0.5)
grid.density.shape           # (nx, ny, nz)
grid.origin                  # (3,) — Cartesian corner in Å
grid.spacing                 # float

# Find the highest-occupied cell:
import numpy as np
hot = np.unravel_index(grid.density.argmax(), grid.density.shape)
hot_xyz = grid.coordinate_of(hot)

For comparative analysis (e.g. ensemble A vs ensemble B on the same grid), pass explicit origin and shape.

Consensus poses

Two strategies:

  • method="medoid" (default) — pick an actual pose from the ensemble that minimizes its weighted summed RMSD to the rest. The output is a real pose, guaranteed chemically valid.
  • method="mean" — synthesize a new pose by averaging coordinates across the ensemble. Bond geometries are not preserved, so this is only meaningful when the input ensemble has already been tight-clustered to similar conformations.
medoid = consensus_pose(poses, weights=weights)               # safe default
synth  = consensus_pose(cluster_members, method="mean")       # for tight clusters

What v1 doesn't do

A few known limitations worth flagging:

  • Pose RMSD is order-sensitive. Atom orderings must be consistent across poses, which is true for poses from a single docking run. For symmetric ligands (benzene, p-phenyls) the reported RMSD is an upper bound on the symmetry-aware RMSD. A future enhancement could use RDKit graph matching when the [docking] extra is installed.
  • Receptor is treated as fixed. Ensemble functions assume one receptor conformation; poses differ only in ligand placement. Mixed-receptor ensembles (e.g. one ligand across MD frames of the receptor) need a different API.
  • No scipy dependency. Clustering is hand-rolled in NumPy. For very large ensembles (n > 200, e.g. MD-derived) scipy's optimized linkage would be faster; that's a natural future extension.

Reference