Skip to content

molforge.prep

prep

Structure-preparation utilities — turn raw PDBs into MD-ready systems.

A molforge user with a PDB file from AlphaFold, RoseTTAFold, the RCSB, or a cryo-EM deposition almost always needs the same kind of clean-up before running molecular dynamics:

  1. Drop crystallographic clutter — buffer salts, glycerol, cryoprotectants, sometimes the ligand. (Or keep them, depending on what you're simulating.)
  2. Fix missing heavy atoms — X-ray structures often have partial side chains where the electron density was weak; AlphaFold output doesn't suffer from this but other structure-prediction tools can.
  3. Cap free termini — terminal residues with bare N/C ends aren't standard amino acids as far as a force field is concerned. Capping with ACE / NME makes them tractable.
  4. Add hydrogens at the right pH — most PDBs are heavy-atom-only; force fields need every hydrogen explicit, with the right protonation state for the side chain at the system's pH.

This subpackage provides:

  • :func:remove_heterogens — drop non-standard residues (waters, ions, ligands, crystallization additives) on a configurable allow-list.
  • :func:fix_missing_atoms — rebuild missing heavy atoms with PDBFixer's rotamer library.
  • :func:add_caps — terminate free amine / carboxyl ends with ACE / NME caps.
  • :func:add_hydrogens — add hydrogens at a given pH using OpenMM's Modeller.
  • :func:prepare_for_md — the convenience entry point that chains the above with sensible defaults for an "AlphaFold-PDB-to-MD" workflow.

Heavy deps (OpenMM, PDBFixer) are loaded lazily inside the functions that need them — importing :mod:molforge.prep itself does not require either. Functions that need them raise a clean :class:molforge.md.MDEngineNotInstalledError with install instructions when the deps are absent.

Install once with::

pip install 'molforge[prep]'

For composable use, call the individual functions in whatever order fits your case. For the common case, :func:prepare_for_md does the right thing.

remove_heterogens

remove_heterogens(
    protein: Protein,
    *,
    keep_water: bool = False,
    keep_ions: bool = False,
    keep_ligands: bool = False,
    keep: frozenset[str] | set[str] | None = None,
) -> Protein

Return a new :class:Protein with non-standard atoms removed.

By default, only canonical amino acids and nucleotides are kept. Waters, ions, ligands, and other heterogens are dropped. The keep_* flags add categories back; keep is an explicit residue-name allow-list for anything else (a specific cofactor you want to preserve, for example).

Parameters:

Name Type Description Default
protein Protein

The input structure.

required
keep_water bool

Keep water residues (HOH, WAT, SOL, ...). MD workflows usually solvate explicitly, so the default is False — even crystallographic waters get replaced.

False
keep_ions bool

Keep monatomic ions (Na+, Cl-, Mg2+, Zn2+, ...). Defaults to False; some structural ions (e.g. zinc in a zinc finger) you'll want to keep — pass True or use the explicit keep allow-list.

False
keep_ligands bool

Keep atoms marked as entity_type == "ligand". Useful when you want to MD a protein-ligand complex.

False
keep frozenset[str] | set[str] | None

Explicit residue-name allow-list. Any residue whose name (case-insensitive) is in this set is kept regardless of the toggles above.

None

Returns:

Type Description
Protein

A new :class:Protein containing only the atoms that passed

Protein

the filter. The original protein is not modified.

Example

protein = mf.load("4hhb.pdb") clean = remove_heterogens(protein, keep_ions=True)

Hemoglobin's HEM cofactors would also be dropped here.

If we want them: keep={"HEM"}.

clean = remove_heterogens(protein, keep={"HEM"})

add_caps

add_caps(
    protein: Protein,
    *,
    n_cap: str = "ACE",
    c_cap: str = "NME",
) -> Protein

Cap free termini with neutral blocking groups.

Adds ACE (acetyl) at every chain's N-terminus and NME (N-methyl amide) at every C-terminus, masking the charged free-amine / free-carboxyl that force fields don't have templates for. The capping atoms are placed by PDBFixer's rotamer routine.

Parameters:

Name Type Description Default
protein Protein

The input structure.

required
n_cap str

Residue name for the N-terminal cap. Default "ACE" (acetyl). Pass None (or an empty string) to skip N-terminal capping.

'ACE'
c_cap str

Residue name for the C-terminal cap. Default "NME" (N-methyl amide). Pass None to skip.

'NME'

Returns:

Type Description
Protein

A new :class:Protein with capping residues added. The input

Protein

is not modified.

Raises:

Type Description
MDEngineNotInstalledError

If PDBFixer or OpenMM is not installed.

Notes

Capping is per chain. A multi-chain protein gets one cap of each kind per chain. Chains already terminated by something non-standard (a cyclic peptide, a disulfide-bonded terminus) will get capped anyway — pre-clean those by hand if it matters.

Example

from molforge.prep import add_caps p_capped = add_caps(my_protein)

fix_missing_atoms

fix_missing_atoms(
    protein: Protein,
    *,
    fix_missing_residues: bool = False,
    replace_nonstandard: bool = True,
) -> Protein

Rebuild missing heavy atoms with PDBFixer's rotamer library.

Parameters:

Name Type Description Default
protein Protein

The input structure. Typically an X-ray or cryo-EM PDB with side-chain atoms missing in flexible regions.

required
fix_missing_residues bool

If True, also fill in entire missing residues (e.g. a disordered loop). Off by default because de-novo loop modelling is risky — the rebuilt geometry is often unphysical. Turn on only when you know the missing stretch is short and well-constrained.

False
replace_nonstandard bool

If True (default), non-standard residues like selenomethionine (MSE) are replaced with their canonical counterparts (MET). Most force fields don't have templates for non-standard residues, so this is usually what you want.

True

Returns:

Type Description
Protein

A new :class:Protein with the fixes applied. The input is

Protein

not modified.

Raises:

Type Description
MDEngineNotInstalledError

If PDBFixer or OpenMM is not installed.

Example

import molforge as mf from molforge.prep import fix_missing_atoms p = mf.fetch("1abc") p_fixed = fix_missing_atoms(p)

prepare_for_md

prepare_for_md(
    protein: Protein,
    *,
    pH: float = 7.4,
    keep_water: bool = False,
    keep_ions: bool = False,
    keep_ligands: bool = False,
    keep: frozenset[str] | set[str] | None = None,
    fix_missing_residues: bool = False,
    replace_nonstandard: bool = True,
    add_caps_to_termini: bool = True,
    add_explicit_hydrogens: bool = True,
    force_field: str = "amber14",
) -> Protein

Convert a raw PDB into an MD-ready :class:Protein.

Chains four pre-MD steps in the order they should run:

  1. :func:~molforge.prep.remove_heterogens — drop crystallographic clutter (waters, buffer salts, ligands) unless the caller asks to keep them.
  2. :func:~molforge.prep.fix_missing_atoms — rebuild missing heavy atoms (and, optionally, missing residues).
  3. :func:~molforge.prep.add_caps — cap free termini with ACE / NME so the force field can template them.
  4. :func:~molforge.prep.add_hydrogens — add explicit hydrogens at the requested pH.

Parameters:

Name Type Description Default
protein Protein

The input structure (typically heavy-atom-only).

required
pH float

pH at which to assign protonation states for step 4 (default 7.4 — physiological).

7.4
keep_water bool

Forwarded to :func:remove_heterogens. Default False — MD setups solvate explicitly.

False
keep_ions bool

Forwarded to :func:remove_heterogens. Default False — turn on for structures with bound ions you need to preserve (e.g. zinc fingers, metalloenzymes).

False
keep_ligands bool

Forwarded to :func:remove_heterogens. Default False — turn on for protein-ligand MD.

False
keep frozenset[str] | set[str] | None

Forwarded to :func:remove_heterogens. Explicit residue-name allow-list for cofactors and other entities you want to preserve.

None
fix_missing_residues bool

Forwarded to :func:fix_missing_atoms. Default False — de-novo loop modelling is risky.

False
replace_nonstandard bool

Forwarded to :func:fix_missing_atoms. Default True — replace MSE → MET, etc.

True
add_caps_to_termini bool

If True (default), add ACE/NME caps. Set False if you've already capped the structure or you want charged termini.

True
add_explicit_hydrogens bool

If True (default), add hydrogens. Set False if you've already protonated the structure.

True
force_field str

Force-field name passed to :func:add_hydrogens. Default "amber14".

'amber14'

Returns:

Type Description
Protein

A new :class:Protein ready to be passed to an MD engine's

Protein

prepare method. The input is not modified.

Raises:

Type Description
MDEngineNotInstalledError

If OpenMM / PDBFixer is required (any step beyond remove_heterogens) and not installed.

Example

import molforge as mf from molforge.prep import prepare_for_md from molforge.wrappers.md import OpenMM

raw = mf.load("alphafold_output.pdb") system = prepare_for_md(raw, pH=7.4) sim = OpenMM().prepare(system) sim = OpenMM().minimize(sim) traj = OpenMM().run(sim, n_steps=50_000, save_every=500)

Notes

Order matters. Heterogen removal first means we don't waste cycles fixing or hydrogenating atoms we're about to throw away. Capping before protonation means the cap residues get their own hydrogens placed correctly. Missing-atom completion before capping means residues whose terminal atoms were absent in the input aren't capped on top of incomplete backbones.

add_hydrogens

add_hydrogens(
    protein: Protein,
    *,
    pH: float = 7.4,
    force_field: str = "amber14",
) -> Protein

Add hydrogens to a :class:Protein at the given pH.

Parameters:

Name Type Description Default
protein Protein

The input structure. Typically heavy-atom-only — standard PDB / AlphaFold / docking-engine output.

required
pH float

The pH at which to assign protonation states. Default 7.4 (physiological). Histidine is the residue this matters most for: at pH 7.4 most His are HID (neutral, H on δ), occasionally HIE (neutral, H on ε), rarely HIP (charged). Modeller picks per the side-chain environment.

7.4
force_field str

An OpenMM force-field name (see :data:_FORCE_FIELD_FILES) or any XML filename OpenMM can find. Determines the residue templates Modeller consults — the default (amber14) covers all standard amino acids.

'amber14'

Returns:

Type Description
Protein

A new :class:Protein with hydrogens added. The input is not

Protein

modified. Calling on a structure that already has explicit

Protein

hydrogens is a no-op (returns an equivalent :class:Protein).

Raises:

Type Description
MDEngineNotInstalledError

If OpenMM is not installed.

Example

import molforge as mf from molforge.prep import add_hydrogens p = mf.load("alphafold_output.pdb") # heavy atoms only p_h = add_hydrogens(p, pH=7.4) p_h.atom_array.n_atoms > p.atom_array.n_atoms True