molforge.prep¶
prep ¶
Structure-preparation utilities — turn raw PDBs into MD-ready systems.
A molforge user with a PDB file from AlphaFold, RoseTTAFold, the RCSB, or a cryo-EM deposition almost always needs the same kind of clean-up before running molecular dynamics:
- Drop crystallographic clutter — buffer salts, glycerol, cryoprotectants, sometimes the ligand. (Or keep them, depending on what you're simulating.)
- Fix missing heavy atoms — X-ray structures often have partial side chains where the electron density was weak; AlphaFold output doesn't suffer from this but other structure-prediction tools can.
- Cap free termini — terminal residues with bare N/C ends aren't standard amino acids as far as a force field is concerned. Capping with ACE / NME makes them tractable.
- Add hydrogens at the right pH — most PDBs are heavy-atom-only; force fields need every hydrogen explicit, with the right protonation state for the side chain at the system's pH.
This subpackage provides:
- :func:
remove_heterogens— drop non-standard residues (waters, ions, ligands, crystallization additives) on a configurable allow-list. - :func:
fix_missing_atoms— rebuild missing heavy atoms with PDBFixer's rotamer library. - :func:
add_caps— terminate free amine / carboxyl ends with ACE / NME caps. - :func:
add_hydrogens— add hydrogens at a given pH using OpenMM's Modeller. - :func:
prepare_for_md— the convenience entry point that chains the above with sensible defaults for an "AlphaFold-PDB-to-MD" workflow.
Heavy deps (OpenMM, PDBFixer) are loaded lazily inside the functions
that need them — importing :mod:molforge.prep itself does not
require either. Functions that need them raise a clean
:class:molforge.md.MDEngineNotInstalledError with install
instructions when the deps are absent.
Install once with::
pip install 'molforge[prep]'
For composable use, call the individual functions in whatever order
fits your case. For the common case, :func:prepare_for_md does the
right thing.
remove_heterogens ¶
remove_heterogens(
protein: Protein,
*,
keep_water: bool = False,
keep_ions: bool = False,
keep_ligands: bool = False,
keep: frozenset[str] | set[str] | None = None,
) -> Protein
Return a new :class:Protein with non-standard atoms removed.
By default, only canonical amino acids and nucleotides are kept.
Waters, ions, ligands, and other heterogens are dropped. The
keep_* flags add categories back; keep is an explicit
residue-name allow-list for anything else (a specific cofactor
you want to preserve, for example).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
The input structure. |
required |
keep_water
|
bool
|
Keep water residues (HOH, WAT, SOL, ...). MD
workflows usually solvate explicitly, so the default is
|
False
|
keep_ions
|
bool
|
Keep monatomic ions (Na+, Cl-, Mg2+, Zn2+, ...).
Defaults to |
False
|
keep_ligands
|
bool
|
Keep atoms marked as |
False
|
keep
|
frozenset[str] | set[str] | None
|
Explicit residue-name allow-list. Any residue whose name (case-insensitive) is in this set is kept regardless of the toggles above. |
None
|
Returns:
| Type | Description |
|---|---|
Protein
|
A new :class: |
Protein
|
the filter. The original protein is not modified. |
add_caps ¶
Cap free termini with neutral blocking groups.
Adds ACE (acetyl) at every chain's N-terminus and NME (N-methyl amide) at every C-terminus, masking the charged free-amine / free-carboxyl that force fields don't have templates for. The capping atoms are placed by PDBFixer's rotamer routine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
The input structure. |
required |
n_cap
|
str
|
Residue name for the N-terminal cap. Default |
'ACE'
|
c_cap
|
str
|
Residue name for the C-terminal cap. Default |
'NME'
|
Returns:
| Type | Description |
|---|---|
Protein
|
A new :class: |
Protein
|
is not modified. |
Raises:
| Type | Description |
|---|---|
MDEngineNotInstalledError
|
If PDBFixer or OpenMM is not installed. |
Notes
Capping is per chain. A multi-chain protein gets one cap of each kind per chain. Chains already terminated by something non-standard (a cyclic peptide, a disulfide-bonded terminus) will get capped anyway — pre-clean those by hand if it matters.
Example
from molforge.prep import add_caps p_capped = add_caps(my_protein)
fix_missing_atoms ¶
fix_missing_atoms(
protein: Protein,
*,
fix_missing_residues: bool = False,
replace_nonstandard: bool = True,
) -> Protein
Rebuild missing heavy atoms with PDBFixer's rotamer library.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
The input structure. Typically an X-ray or cryo-EM PDB with side-chain atoms missing in flexible regions. |
required |
fix_missing_residues
|
bool
|
If |
False
|
replace_nonstandard
|
bool
|
If |
True
|
Returns:
| Type | Description |
|---|---|
Protein
|
A new :class: |
Protein
|
not modified. |
Raises:
| Type | Description |
|---|---|
MDEngineNotInstalledError
|
If PDBFixer or OpenMM is not installed. |
Example
import molforge as mf from molforge.prep import fix_missing_atoms p = mf.fetch("1abc") p_fixed = fix_missing_atoms(p)
prepare_for_md ¶
prepare_for_md(
protein: Protein,
*,
pH: float = 7.4,
keep_water: bool = False,
keep_ions: bool = False,
keep_ligands: bool = False,
keep: frozenset[str] | set[str] | None = None,
fix_missing_residues: bool = False,
replace_nonstandard: bool = True,
add_caps_to_termini: bool = True,
add_explicit_hydrogens: bool = True,
force_field: str = "amber14",
) -> Protein
Convert a raw PDB into an MD-ready :class:Protein.
Chains four pre-MD steps in the order they should run:
- :func:
~molforge.prep.remove_heterogens— drop crystallographic clutter (waters, buffer salts, ligands) unless the caller asks to keep them. - :func:
~molforge.prep.fix_missing_atoms— rebuild missing heavy atoms (and, optionally, missing residues). - :func:
~molforge.prep.add_caps— cap free termini with ACE / NME so the force field can template them. - :func:
~molforge.prep.add_hydrogens— add explicit hydrogens at the requested pH.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
The input structure (typically heavy-atom-only). |
required |
pH
|
float
|
pH at which to assign protonation states for step 4 (default 7.4 — physiological). |
7.4
|
keep_water
|
bool
|
Forwarded to :func: |
False
|
keep_ions
|
bool
|
Forwarded to :func: |
False
|
keep_ligands
|
bool
|
Forwarded to :func: |
False
|
keep
|
frozenset[str] | set[str] | None
|
Forwarded to :func: |
None
|
fix_missing_residues
|
bool
|
Forwarded to :func: |
False
|
replace_nonstandard
|
bool
|
Forwarded to :func: |
True
|
add_caps_to_termini
|
bool
|
If |
True
|
add_explicit_hydrogens
|
bool
|
If |
True
|
force_field
|
str
|
Force-field name passed to
:func: |
'amber14'
|
Returns:
| Type | Description |
|---|---|
Protein
|
A new :class: |
Protein
|
|
Raises:
| Type | Description |
|---|---|
MDEngineNotInstalledError
|
If OpenMM / PDBFixer is required
(any step beyond |
Example
import molforge as mf from molforge.prep import prepare_for_md from molforge.wrappers.md import OpenMM
raw = mf.load("alphafold_output.pdb") system = prepare_for_md(raw, pH=7.4) sim = OpenMM().prepare(system) sim = OpenMM().minimize(sim) traj = OpenMM().run(sim, n_steps=50_000, save_every=500)
Notes
Order matters. Heterogen removal first means we don't waste cycles fixing or hydrogenating atoms we're about to throw away. Capping before protonation means the cap residues get their own hydrogens placed correctly. Missing-atom completion before capping means residues whose terminal atoms were absent in the input aren't capped on top of incomplete backbones.
add_hydrogens ¶
Add hydrogens to a :class:Protein at the given pH.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
The input structure. Typically heavy-atom-only — standard PDB / AlphaFold / docking-engine output. |
required |
pH
|
float
|
The pH at which to assign protonation states. Default 7.4 (physiological). Histidine is the residue this matters most for: at pH 7.4 most His are HID (neutral, H on δ), occasionally HIE (neutral, H on ε), rarely HIP (charged). Modeller picks per the side-chain environment. |
7.4
|
force_field
|
str
|
An OpenMM force-field name (see
:data: |
'amber14'
|
Returns:
| Type | Description |
|---|---|
Protein
|
A new :class: |
Protein
|
modified. Calling on a structure that already has explicit |
Protein
|
hydrogens is a no-op (returns an equivalent :class: |
Raises:
| Type | Description |
|---|---|
MDEngineNotInstalledError
|
If OpenMM is not installed. |
Example
import molforge as mf from molforge.prep import add_hydrogens p = mf.load("alphafold_output.pdb") # heavy atoms only p_h = add_hydrogens(p, pH=7.4) p_h.atom_array.n_atoms > p.atom_array.n_atoms True