molforge.prep¶

prep ¶

Structure-preparation utilities — turn raw PDBs into MD-ready systems.

A molforge user with a PDB file from AlphaFold, RoseTTAFold, the RCSB, or a cryo-EM deposition almost always needs the same kind of clean-up before running molecular dynamics:

Drop crystallographic clutter — buffer salts, glycerol, cryoprotectants, sometimes the ligand. (Or keep them, depending on what you're simulating.)
Fix missing heavy atoms — X-ray structures often have partial side chains where the electron density was weak; AlphaFold output doesn't suffer from this but other structure-prediction tools can.
Cap free termini — terminal residues with bare N/C ends aren't standard amino acids as far as a force field is concerned. Capping with ACE / NME makes them tractable.
Add hydrogens at the right pH — most PDBs are heavy-atom-only; force fields need every hydrogen explicit, with the right protonation state for the side chain at the system's pH.

This subpackage provides:

:func:remove_heterogens — drop non-standard residues (waters, ions, ligands, crystallization additives) on a configurable allow-list.
:func:fix_missing_atoms — rebuild missing heavy atoms with PDBFixer's rotamer library.
:func:add_caps — terminate free amine / carboxyl ends with ACE / NME caps.
:func:add_hydrogens — add hydrogens at a given pH using OpenMM's Modeller.
:func:prepare_for_md — the convenience entry point that chains the above with sensible defaults for an "AlphaFold-PDB-to-MD" workflow.

Heavy deps (OpenMM, PDBFixer) are loaded lazily inside the functions that need them — importing :mod:molforge.prep itself does not require either. Functions that need them raise a clean :class:molforge.md.MDEngineNotInstalledError with install instructions when the deps are absent.

Install once with::

pip install 'molforge[prep]'

For composable use, call the individual functions in whatever order fits your case. For the common case, :func:prepare_for_md does the right thing.

remove_heterogens ¶

remove_heterogens(
    protein: Protein,
    *,
    keep_water: bool = False,
    keep_ions: bool = False,
    keep_ligands: bool = False,
    keep: frozenset[str] | set[str] | None = None,
) -> Protein

Return a new :class:Protein with non-standard atoms removed.

By default, only canonical amino acids and nucleotides are kept. Waters, ions, ligands, and other heterogens are dropped. The keep_* flags add categories back; keep is an explicit residue-name allow-list for anything else (a specific cofactor you want to preserve, for example).

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	The input structure.	required
`keep_water`	`bool`	Keep water residues (HOH, WAT, SOL, ...). MD workflows usually solvate explicitly, so the default is `False` — even crystallographic waters get replaced.	`False`
`keep_ions`	`bool`	Keep monatomic ions (Na+, Cl-, Mg2+, Zn2+, ...). Defaults to `False`; some structural ions (e.g. zinc in a zinc finger) you'll want to keep — pass `True` or use the explicit `keep` allow-list.	`False`
`keep_ligands`	`bool`	Keep atoms marked as `entity_type == "ligand"`. Useful when you want to MD a protein-ligand complex.	`False`
`keep`	`frozenset[str] \| set[str] \| None`	Explicit residue-name allow-list. Any residue whose name (case-insensitive) is in this set is kept regardless of the toggles above.	`None`

Returns:

Type	Description
`Protein`	A new :class:`Protein` containing only the atoms that passed
`Protein`	the filter. The original protein is not modified.

Example

protein = mf.load("4hhb.pdb") clean = remove_heterogens(protein, keep_ions=True)

Hemoglobin's HEM cofactors would also be dropped here.¶

If we want them: keep={"HEM"}.¶

clean = remove_heterogens(protein, keep={"HEM"})

add_caps ¶

add_caps(
    protein: Protein,
    *,
    n_cap: str = "ACE",
    c_cap: str = "NME",
) -> Protein

Cap free termini with neutral blocking groups.

Adds ACE (acetyl) at every chain's N-terminus and NME (N-methyl amide) at every C-terminus, masking the charged free-amine / free-carboxyl that force fields don't have templates for. The capping atoms are placed by PDBFixer's rotamer routine.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	The input structure.	required
`n_cap`	`str`	Residue name for the N-terminal cap. Default `"ACE"` (acetyl). Pass `None` (or an empty string) to skip N-terminal capping.	`'ACE'`
`c_cap`	`str`	Residue name for the C-terminal cap. Default `"NME"` (N-methyl amide). Pass `None` to skip.	`'NME'`

Returns:

Type	Description
`Protein`	A new :class:`Protein` with capping residues added. The input
`Protein`	is not modified.

Raises:

Type	Description
`MDEngineNotInstalledError`	If PDBFixer or OpenMM is not installed.

Notes

Capping is per chain. A multi-chain protein gets one cap of each kind per chain. Chains already terminated by something non-standard (a cyclic peptide, a disulfide-bonded terminus) will get capped anyway — pre-clean those by hand if it matters.

Example

from molforge.prep import add_caps p_capped = add_caps(my_protein)

fix_missing_atoms ¶

fix_missing_atoms(
    protein: Protein,
    *,
    fix_missing_residues: bool = False,
    replace_nonstandard: bool = True,
) -> Protein

Rebuild missing heavy atoms with PDBFixer's rotamer library.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	The input structure. Typically an X-ray or cryo-EM PDB with side-chain atoms missing in flexible regions.	required
`fix_missing_residues`	`bool`	If `True`, also fill in entire missing residues (e.g. a disordered loop). Off by default because de-novo loop modelling is risky — the rebuilt geometry is often unphysical. Turn on only when you know the missing stretch is short and well-constrained.	`False`
`replace_nonstandard`	`bool`	If `True` (default), non-standard residues like selenomethionine (MSE) are replaced with their canonical counterparts (MET). Most force fields don't have templates for non-standard residues, so this is usually what you want.	`True`

Returns:

Type	Description
`Protein`	A new :class:`Protein` with the fixes applied. The input is
`Protein`	not modified.

Raises:

Type	Description
`MDEngineNotInstalledError`	If PDBFixer or OpenMM is not installed.

Example

import molforge as mf from molforge.prep import fix_missing_atoms p = mf.fetch("1abc") p_fixed = fix_missing_atoms(p)

prepare_for_md ¶

prepare_for_md(
    protein: Protein,
    *,
    pH: float = 7.4,
    keep_water: bool = False,
    keep_ions: bool = False,
    keep_ligands: bool = False,
    keep: frozenset[str] | set[str] | None = None,
    fix_missing_residues: bool = False,
    replace_nonstandard: bool = True,
    add_caps_to_termini: bool = True,
    add_explicit_hydrogens: bool = True,
    force_field: str = "amber14",
) -> Protein

Convert a raw PDB into an MD-ready :class:Protein.

Chains four pre-MD steps in the order they should run:

:func:~molforge.prep.remove_heterogens — drop crystallographic clutter (waters, buffer salts, ligands) unless the caller asks to keep them.
:func:~molforge.prep.fix_missing_atoms — rebuild missing heavy atoms (and, optionally, missing residues).
:func:~molforge.prep.add_caps — cap free termini with ACE / NME so the force field can template them.
:func:~molforge.prep.add_hydrogens — add explicit hydrogens at the requested pH.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	The input structure (typically heavy-atom-only).	required
`pH`	`float`	pH at which to assign protonation states for step 4 (default 7.4 — physiological).	`7.4`
`keep_water`	`bool`	Forwarded to :func:`remove_heterogens`. Default `False` — MD setups solvate explicitly.	`False`
`keep_ions`	`bool`	Forwarded to :func:`remove_heterogens`. Default `False` — turn on for structures with bound ions you need to preserve (e.g. zinc fingers, metalloenzymes).	`False`
`keep_ligands`	`bool`	Forwarded to :func:`remove_heterogens`. Default `False` — turn on for protein-ligand MD.	`False`
`keep`	`frozenset[str] \| set[str] \| None`	Forwarded to :func:`remove_heterogens`. Explicit residue-name allow-list for cofactors and other entities you want to preserve.	`None`
`fix_missing_residues`	`bool`	Forwarded to :func:`fix_missing_atoms`. Default `False` — de-novo loop modelling is risky.	`False`
`replace_nonstandard`	`bool`	Forwarded to :func:`fix_missing_atoms`. Default `True` — replace MSE → MET, etc.	`True`
`add_caps_to_termini`	`bool`	If `True` (default), add ACE/NME caps. Set `False` if you've already capped the structure or you want charged termini.	`True`
`add_explicit_hydrogens`	`bool`	If `True` (default), add hydrogens. Set `False` if you've already protonated the structure.	`True`
`force_field`	`str`	Force-field name passed to :func:`add_hydrogens`. Default `"amber14"`.	`'amber14'`

Returns:

Type	Description
`Protein`	A new :class:`Protein` ready to be passed to an MD engine's
`Protein`	`prepare` method. The input is not modified.

Raises:

Type	Description
`MDEngineNotInstalledError`	If OpenMM / PDBFixer is required (any step beyond `remove_heterogens`) and not installed.

Example

import molforge as mf from molforge.prep import prepare_for_md from molforge.wrappers.md import OpenMM

raw = mf.load("alphafold_output.pdb") system = prepare_for_md(raw, pH=7.4) sim = OpenMM().prepare(system) sim = OpenMM().minimize(sim) traj = OpenMM().run(sim, n_steps=50_000, save_every=500)

Notes

Order matters. Heterogen removal first means we don't waste cycles fixing or hydrogenating atoms we're about to throw away. Capping before protonation means the cap residues get their own hydrogens placed correctly. Missing-atom completion before capping means residues whose terminal atoms were absent in the input aren't capped on top of incomplete backbones.

add_hydrogens ¶

add_hydrogens(
    protein: Protein,
    *,
    pH: float = 7.4,
    force_field: str = "amber14",
) -> Protein

Add hydrogens to a :class:Protein at the given pH.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	The input structure. Typically heavy-atom-only — standard PDB / AlphaFold / docking-engine output.	required
`pH`	`float`	The pH at which to assign protonation states. Default 7.4 (physiological). Histidine is the residue this matters most for: at pH 7.4 most His are HID (neutral, H on δ), occasionally HIE (neutral, H on ε), rarely HIP (charged). Modeller picks per the side-chain environment.	`7.4`
`force_field`	`str`	An OpenMM force-field name (see :data:`_FORCE_FIELD_FILES`) or any XML filename OpenMM can find. Determines the residue templates Modeller consults — the default (`amber14`) covers all standard amino acids.	`'amber14'`

Returns:

Type	Description
`Protein`	A new :class:`Protein` with hydrogens added. The input is not
`Protein`	modified. Calling on a structure that already has explicit
`Protein`	hydrogens is a no-op (returns an equivalent :class:`Protein`).

Raises:

Type	Description
`MDEngineNotInstalledError`	If OpenMM is not installed.

Example

import molforge as mf from molforge.prep import add_hydrogens p = mf.load("alphafold_output.pdb") # heavy atoms only p_h = add_hydrogens(p, pH=7.4) p_h.atom_array.n_atoms > p.atom_array.n_atoms True