molforge.core¶
core ¶
Core data model: hierarchical and linear views of protein structure.
The :class:AtomArray is the canonical representation — a flat,
NumPy-backed array of all atoms. The hierarchical classes
(:class:Protein, :class:Chain, :class:Residue, :class:Atom) are
lightweight views that read and write through to the array.
Typical usage:
>>> from molforge.core import Protein, AtomArray
>>> protein = Protein(atom_array=AtomArray(0), name="example")
>>> protein.n_atoms
0
Atom ¶
View of a single atom in an :class:AtomArray.
Attributes are read/written through to the underlying array, so
mutating an Atom mutates the source-of-truth representation.
AtomArray ¶
Flat, NumPy-backed array of atoms.
This is the canonical representation; hierarchical views read from
these arrays. All per-atom fields have shape (N,) except
coords which has shape (N, 3).
Example
aa = AtomArray.empty(3) aa.element[:] = ["C", "N", "O"] aa.coords[0] = [1.0, 2.0, 3.0] len(aa) 3
Create an empty array of n atoms, all fields at default values.
chain_starts
property
¶
Indices of the first atom of each chain, in order.
A chain boundary is any change in chain_id or model_id.
residue_starts
property
¶
Indices of the first atom of each residue, in order.
A residue boundary is any change in
(chain_id, residue_id, insertion_code, model_id).
empty
classmethod
¶
Alias for AtomArray(n) — more readable at call sites.
from_dict
classmethod
¶
Construct from a dict of equal-length arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, NDArray[Any]]
|
Mapping field-name -> array. Must include |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
ValueError
|
If array lengths disagree. |
append ¶
Return a new array with other concatenated after this one.
select ¶
Return a new AtomArray containing only atoms where mask is True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask
|
BoolArray
|
Boolean array of length |
required |
where ¶
Build a boolean mask from equality filters on any field.
Example
mask = aa.where(chain_id="A", atom_name="CA") ca_atoms = aa.select(mask)
iter_residue_slices ¶
Yield a slice for each residue's atoms (in array order).
iter_chain_slices ¶
Yield a slice for each chain's atoms (in array order).
Chain ¶
View over a chain's atoms inside an :class:AtomArray.
sequence
property
¶
One-letter sequence for this chain (standard AAs + non-canonical mappings).
Non-amino-acid residues (ligands, water, ions) are skipped.
Unknown residues become "X".
ProteinMetadata ¶
Bases: TypedDict
Typed view of the documented :attr:Protein.metadata keys.
Every key is optional (total=False). This is a typing aid only —
Protein.metadata remains a plain dict[str, Any] at runtime,
and keys outside this set are still permitted (without stability
guarantees). Annotate a local variable as ProteinMetadata to get
editor / mypy support for the documented vocabulary.
Protein ¶
Protein(
atom_array: AtomArray | None = None,
*,
name: str = "",
metadata: dict[str, Any] | None = None,
)
A protein (or protein complex) structure.
Protein owns a single :class:AtomArray (atom_array) which is
the canonical data store. Hierarchical accessors (chains,
residues, etc.) read from it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
atom_array
|
AtomArray | None
|
The flat array of atoms backing this protein. If omitted, an empty array is used. |
None
|
name
|
str
|
Optional identifier (e.g. PDB ID). |
''
|
metadata
|
dict[str, Any] | None
|
Free-form key/value metadata (resolution, header,
engine confidence, ...). The dict accepts any keys, but the
names molforge's own parsers and engine wrappers use form a
stable, documented vocabulary — see
:mod: |
None
|
sequence
property
¶
Concatenated one-letter sequence across all protein/nucleic chains.
Chains are joined with "/" to make boundaries visible.
Non-polymer chains (ligand, water, ion) are skipped.
get_chain ¶
Look up a chain by (chain_id, model_id).
select ¶
Return a new Protein containing only atoms matching the filters.
Filters are forwarded to :meth:AtomArray.where.
Example
Keep only chain A, protein atoms¶
sub = protein.select(chain_id="A", entity_type="protein")
protein_only ¶
Return a new Protein containing only polymer protein atoms.
Drops ligands, waters, ions, and nucleic acids.
Residue ¶
is_standard_amino_acid ¶
Return True for the 20 canonical amino acids.
three_to_one ¶
Convert a 3-letter residue name to one-letter code.
Falls back to unknown for residues outside the canonical and known
non-canonical tables. Nucleotides are handled too.
Metadata vocabulary¶
The documented key vocabulary for Protein.metadata — string
constants and the ProteinMetadata TypedDict.
metadata_keys ¶
Documented key vocabulary for :attr:molforge.core.Protein.metadata.
Protein.metadata is a free-form dict[str, Any] by design — it
carries whatever a parser or engine wants to attach, including
open-ended things like PDB REMARK records. Keeping it a plain dict
means no breaking change for code that writes arbitrary keys.
But "free-form" shouldn't mean "undocumented". The keys below are the ones molforge's own parsers and engine wrappers produce, and they form the contract: downstream code can rely on these names and value types being stable across the 1.x series. Keys outside this list are still permitted but carry no stability guarantee.
Two things to use here:
- String constants (
PDB_ID,MEAN_CONFIDENCE, ...). Prefer these over bare string literals when reading or writing metadata, so a typo is aNameErrorat import time rather than a silently missing key at runtime. - :class:
ProteinMetadata— aTypedDict(total=False, every key optional) that documents the value type of each key. It's a typing aid only:Protein.metadatais still a plaindictat runtime, but annotating a local asProteinMetadatagives editors andmypythe key/type information.
Key groups:
- Structural-IO header keys — set by :func:
molforge.io.read_pdband :func:molforge.io.read_ciffrom file header records. - Uniform folding-engine keys — set by every folding-engine
wrapper (ESMFold, AlphaFold, Boltz, RoseTTAFold) so downstream code
can read prediction confidence without knowing which engine ran.
:func:
molforge.io.load_alphafoldalso populates these. - Engine-specific folding keys — set by some folding wrappers but not all; presence depends on the engine.
TITLE
module-attribute
¶
Free-text structure title from the PDB TITLE / mmCIF _struct.title (str).
CLASSIFICATION
module-attribute
¶
PDB HEADER classification field, e.g. "HYDROLASE" (str).
DEPOSITION_DATE
module-attribute
¶
Deposition date string as it appears in the PDB HEADER (str).
EXPERIMENTAL_METHOD
module-attribute
¶
Experimental method, e.g. "X-RAY DIFFRACTION" (str).
RESOLUTION
module-attribute
¶
Resolution in Angstrom (float). Absent for non-diffraction structures.
ENGINE
module-attribute
¶
Name of the folding engine that produced the structure (str), e.g.
"ESMFold", "AlphaFold", "Boltz", "RoseTTAFold".
SOURCE_SEQUENCE
module-attribute
¶
The one-letter input sequence the engine folded (str).
CONFIDENCE_PER_RESIDUE
module-attribute
¶
(L,) float32 array of per-residue pLDDT-style confidence (0-100).
CONFIDENCE_PER_ATOM
module-attribute
¶
(N_atoms,) float32 array of per-atom confidence (0-100).
MEAN_CONFIDENCE
module-attribute
¶
Scalar mean per-residue confidence (float, 0-100).
SOURCE
module-attribute
¶
Provenance tag (str). Set to "alphafold" by
:func:molforge.io.load_alphafold.
MODEL_NAME
module-attribute
¶
Engine-internal model identifier (str). Set by ESMFold.
MODEL_TYPE
module-attribute
¶
Model-type identifier (str). Set by AlphaFold, e.g. "monomer".
MODEL_VERSION
module-attribute
¶
Model-version identifier (str). Set by Boltz, e.g. "boltz2".
JOB_NAME
module-attribute
¶
Job name used for engine output files (str). Set by RoseTTAFold.
USE_MSA_SERVER
module-attribute
¶
Whether an MSA server was used (bool). Set by Boltz.
PTM
module-attribute
¶
Predicted TM-score for the whole structure (float). Set by Boltz.
IPTM
module-attribute
¶
Interface predicted TM-score (float). Set by Boltz; meaningful for complexes.
CONFIDENCE_SCORE
module-attribute
¶
Composite confidence score (float). Set by Boltz.
PAE
module-attribute
¶
(L, L) predicted aligned error matrix (float array). Set by RoseTTAFold.
PDE
module-attribute
¶
(L, L) predicted distance error matrix (float array). Set by RoseTTAFold.
PAE_INTER
module-attribute
¶
Scalar mean inter-chain PAE (float). RoseTTAFold's headline interface metric; values below ~10 indicate a high-quality interface.
PAE_PROT
module-attribute
¶
Scalar mean PAE over protein residues only (float). Set by RoseTTAFold.
MEAN_PAE
module-attribute
¶
Scalar mean of the full PAE matrix (float). Set by RoseTTAFold.
MEAN_PLDDT
module-attribute
¶
Scalar mean pLDDT (float). Set by RoseTTAFold and
:func:molforge.io.load_alphafold. Equivalent to :data:MEAN_CONFIDENCE;
the latter is the cross-engine-uniform name and should be preferred.
PLDDT
module-attribute
¶
(N_atoms,) float32 per-atom pLDDT. Legacy key set by
:func:molforge.io.load_alphafold; :data:CONFIDENCE_PER_ATOM is the
cross-engine-uniform name and should be preferred.
PLDDT_PER_RESIDUE
module-attribute
¶
(L,) float32 per-residue pLDDT. Legacy key set by
:func:molforge.io.load_alphafold; :data:CONFIDENCE_PER_RESIDUE is the
cross-engine-uniform name and should be preferred.
ProteinMetadata ¶
Bases: TypedDict
Typed view of the documented :attr:Protein.metadata keys.
Every key is optional (total=False). This is a typing aid only —
Protein.metadata remains a plain dict[str, Any] at runtime,
and keys outside this set are still permitted (without stability
guarantees). Annotate a local variable as ProteinMetadata to get
editor / mypy support for the documented vocabulary.