Skip to content

molforge.core

core

Core data model: hierarchical and linear views of protein structure.

The :class:AtomArray is the canonical representation — a flat, NumPy-backed array of all atoms. The hierarchical classes (:class:Protein, :class:Chain, :class:Residue, :class:Atom) are lightweight views that read and write through to the array.

Typical usage:

>>> from molforge.core import Protein, AtomArray
>>> protein = Protein(atom_array=AtomArray(0), name="example")
>>> protein.n_atoms
0

Atom

Atom(
    array: AtomArray,
    index: int,
    *,
    parent: Residue | None = None,
)

View of a single atom in an :class:AtomArray.

Attributes are read/written through to the underlying array, so mutating an Atom mutates the source-of-truth representation.

index property

index: int

The atom's index into the parent :class:AtomArray.

parent property

parent: Residue | None

The containing residue, if known.

coord property writable

coord: NDArray[float32]

The atom's 3-D coordinate as a (3,) float32 view (mutable).

is_backbone property

is_backbone: bool

True if this is a standard protein backbone atom (N, CA, C, O, OXT).

is_hetero property

is_hetero: bool

True if this atom comes from a HETATM record.

AtomArray

AtomArray(n: int = 0)

Flat, NumPy-backed array of atoms.

This is the canonical representation; hierarchical views read from these arrays. All per-atom fields have shape (N,) except coords which has shape (N, 3).

Example

aa = AtomArray.empty(3) aa.element[:] = ["C", "N", "O"] aa.coords[0] = [1.0, 2.0, 3.0] len(aa) 3

Create an empty array of n atoms, all fields at default values.

chain_starts property

chain_starts: IntArray

Indices of the first atom of each chain, in order.

A chain boundary is any change in chain_id or model_id.

residue_starts property

residue_starts: IntArray

Indices of the first atom of each residue, in order.

A residue boundary is any change in (chain_id, residue_id, insertion_code, model_id).

empty classmethod

empty(n: int) -> AtomArray

Alias for AtomArray(n) — more readable at call sites.

from_dict classmethod

from_dict(data: dict[str, NDArray[Any]]) -> AtomArray

Construct from a dict of equal-length arrays.

Parameters:

Name Type Description Default
data dict[str, NDArray[Any]]

Mapping field-name -> array. Must include coords; missing fields are filled with their schema defaults.

required

Raises:

Type Description
KeyError

If coords is missing.

ValueError

If array lengths disagree.

append

append(other: AtomArray) -> AtomArray

Return a new array with other concatenated after this one.

select

select(mask: BoolArray) -> AtomArray

Return a new AtomArray containing only atoms where mask is True.

Parameters:

Name Type Description Default
mask BoolArray

Boolean array of length len(self).

required

where

where(**filters: object) -> BoolArray

Build a boolean mask from equality filters on any field.

Example

mask = aa.where(chain_id="A", atom_name="CA") ca_atoms = aa.select(mask)

iter_residue_slices

iter_residue_slices() -> Iterable[slice]

Yield a slice for each residue's atoms (in array order).

iter_chain_slices

iter_chain_slices() -> Iterable[slice]

Yield a slice for each chain's atoms (in array order).

Chain

Chain(
    array: AtomArray,
    start: int,
    end: int,
    *,
    parent: Protein | None = None,
)

View over a chain's atoms inside an :class:AtomArray.

residues property

residues: list[Residue]

All residues in this chain, in N-to-C order.

sequence property

sequence: str

One-letter sequence for this chain (standard AAs + non-canonical mappings).

Non-amino-acid residues (ligands, water, ions) are skipped. Unknown residues become "X".

coords property

coords: NDArray[float32]

All atom coordinates for this chain, shape (n_atoms, 3).

ProteinMetadata

Bases: TypedDict

Typed view of the documented :attr:Protein.metadata keys.

Every key is optional (total=False). This is a typing aid only — Protein.metadata remains a plain dict[str, Any] at runtime, and keys outside this set are still permitted (without stability guarantees). Annotate a local variable as ProteinMetadata to get editor / mypy support for the documented vocabulary.

Protein

Protein(
    atom_array: AtomArray | None = None,
    *,
    name: str = "",
    metadata: dict[str, Any] | None = None,
)

A protein (or protein complex) structure.

Protein owns a single :class:AtomArray (atom_array) which is the canonical data store. Hierarchical accessors (chains, residues, etc.) read from it.

Parameters:

Name Type Description Default
atom_array AtomArray | None

The flat array of atoms backing this protein. If omitted, an empty array is used.

None
name str

Optional identifier (e.g. PDB ID).

''
metadata dict[str, Any] | None

Free-form key/value metadata (resolution, header, engine confidence, ...). The dict accepts any keys, but the names molforge's own parsers and engine wrappers use form a stable, documented vocabulary — see :mod:molforge.core.metadata_keys and the :class:~molforge.core.metadata_keys.ProteinMetadata TypedDict. Keys outside that vocabulary are permitted but carry no cross-version stability guarantee.

None

chains property

chains: list[Chain]

All chains in this protein, in array order.

coords property

coords: NDArray[float32]

All atom coordinates, shape (n_atoms, 3).

sequence property

sequence: str

Concatenated one-letter sequence across all protein/nucleic chains.

Chains are joined with "/" to make boundaries visible. Non-polymer chains (ligand, water, ion) are skipped.

get_chain

get_chain(chain_id: str, model_id: int = 0) -> Chain

Look up a chain by (chain_id, model_id).

sequences

sequences() -> dict[str, str]

Per-chain one-letter sequences keyed by chain_id.

select

select(**filters: object) -> Protein

Return a new Protein containing only atoms matching the filters.

Filters are forwarded to :meth:AtomArray.where.

Example
Keep only chain A, protein atoms

sub = protein.select(chain_id="A", entity_type="protein")

protein_only

protein_only() -> Protein

Return a new Protein containing only polymer protein atoms.

Drops ligands, waters, ions, and nucleic acids.

remove_water

remove_water() -> Protein

Return a new Protein with all water atoms removed.

Residue

Residue(
    array: AtomArray,
    start: int,
    end: int,
    *,
    parent: Chain | None = None,
)

View over a residue's atoms inside an :class:AtomArray.

name property

name: str

Three-letter residue code (e.g. "ALA").

seq_id property

seq_id: int

Author-assigned residue sequence number.

entity_type property

entity_type: str

e.g. "protein", "dna", "rna", "ligand", "water", "ion".

atoms property

atoms: list[Atom]

All atoms in this residue, as :class:Atom views.

coords property

coords: NDArray[float32]

All atom coordinates for this residue, shape (n_atoms, 3).

slice property

slice: slice

Underlying array slice covering this residue's atoms.

one_letter property

one_letter: str

Single-letter amino-acid (or nucleotide) code; "X" for unknown.

is_hetero property

is_hetero: bool

True if any atom in this residue is a HETATM record.

is_ion

is_ion(resname: str) -> bool

Return True for common monatomic / small ion residue names.

is_standard_amino_acid

is_standard_amino_acid(resname: str) -> bool

Return True for the 20 canonical amino acids.

is_water

is_water(resname: str) -> bool

Return True for common water residue names.

three_to_one

three_to_one(resname: str, *, unknown: str = 'X') -> str

Convert a 3-letter residue name to one-letter code.

Falls back to unknown for residues outside the canonical and known non-canonical tables. Nucleotides are handled too.

Metadata vocabulary

The documented key vocabulary for Protein.metadata — string constants and the ProteinMetadata TypedDict.

metadata_keys

Documented key vocabulary for :attr:molforge.core.Protein.metadata.

Protein.metadata is a free-form dict[str, Any] by design — it carries whatever a parser or engine wants to attach, including open-ended things like PDB REMARK records. Keeping it a plain dict means no breaking change for code that writes arbitrary keys.

But "free-form" shouldn't mean "undocumented". The keys below are the ones molforge's own parsers and engine wrappers produce, and they form the contract: downstream code can rely on these names and value types being stable across the 1.x series. Keys outside this list are still permitted but carry no stability guarantee.

Two things to use here:

  • String constants (PDB_ID, MEAN_CONFIDENCE, ...). Prefer these over bare string literals when reading or writing metadata, so a typo is a NameError at import time rather than a silently missing key at runtime.
  • :class:ProteinMetadata — a TypedDict (total=False, every key optional) that documents the value type of each key. It's a typing aid only: Protein.metadata is still a plain dict at runtime, but annotating a local as ProteinMetadata gives editors and mypy the key/type information.

Key groups:

  • Structural-IO header keys — set by :func:molforge.io.read_pdb and :func:molforge.io.read_cif from file header records.
  • Uniform folding-engine keys — set by every folding-engine wrapper (ESMFold, AlphaFold, Boltz, RoseTTAFold) so downstream code can read prediction confidence without knowing which engine ran. :func:molforge.io.load_alphafold also populates these.
  • Engine-specific folding keys — set by some folding wrappers but not all; presence depends on the engine.

PDB_ID module-attribute

PDB_ID = 'pdb_id'

4-character PDB accession code, e.g. "1ABC" (str).

TITLE module-attribute

TITLE = 'title'

Free-text structure title from the PDB TITLE / mmCIF _struct.title (str).

CLASSIFICATION module-attribute

CLASSIFICATION = 'classification'

PDB HEADER classification field, e.g. "HYDROLASE" (str).

DEPOSITION_DATE module-attribute

DEPOSITION_DATE = 'deposition_date'

Deposition date string as it appears in the PDB HEADER (str).

EXPERIMENTAL_METHOD module-attribute

EXPERIMENTAL_METHOD = 'experimental_method'

Experimental method, e.g. "X-RAY DIFFRACTION" (str).

RESOLUTION module-attribute

RESOLUTION = 'resolution'

Resolution in Angstrom (float). Absent for non-diffraction structures.

ENGINE module-attribute

ENGINE = 'engine'

Name of the folding engine that produced the structure (str), e.g. "ESMFold", "AlphaFold", "Boltz", "RoseTTAFold".

SOURCE_SEQUENCE module-attribute

SOURCE_SEQUENCE = 'source_sequence'

The one-letter input sequence the engine folded (str).

CONFIDENCE_PER_RESIDUE module-attribute

CONFIDENCE_PER_RESIDUE = 'confidence_per_residue'

(L,) float32 array of per-residue pLDDT-style confidence (0-100).

CONFIDENCE_PER_ATOM module-attribute

CONFIDENCE_PER_ATOM = 'confidence_per_atom'

(N_atoms,) float32 array of per-atom confidence (0-100).

MEAN_CONFIDENCE module-attribute

MEAN_CONFIDENCE = 'mean_confidence'

Scalar mean per-residue confidence (float, 0-100).

SOURCE module-attribute

SOURCE = 'source'

Provenance tag (str). Set to "alphafold" by :func:molforge.io.load_alphafold.

MODEL_NAME module-attribute

MODEL_NAME = 'model_name'

Engine-internal model identifier (str). Set by ESMFold.

MODEL_TYPE module-attribute

MODEL_TYPE = 'model_type'

Model-type identifier (str). Set by AlphaFold, e.g. "monomer".

MODEL_VERSION module-attribute

MODEL_VERSION = 'model_version'

Model-version identifier (str). Set by Boltz, e.g. "boltz2".

JOB_NAME module-attribute

JOB_NAME = 'job_name'

Job name used for engine output files (str). Set by RoseTTAFold.

USE_MSA_SERVER module-attribute

USE_MSA_SERVER = 'use_msa_server'

Whether an MSA server was used (bool). Set by Boltz.

PTM module-attribute

PTM = 'ptm'

Predicted TM-score for the whole structure (float). Set by Boltz.

IPTM module-attribute

IPTM = 'iptm'

Interface predicted TM-score (float). Set by Boltz; meaningful for complexes.

CONFIDENCE_SCORE module-attribute

CONFIDENCE_SCORE = 'confidence_score'

Composite confidence score (float). Set by Boltz.

PAE module-attribute

PAE = 'pae'

(L, L) predicted aligned error matrix (float array). Set by RoseTTAFold.

PDE module-attribute

PDE = 'pde'

(L, L) predicted distance error matrix (float array). Set by RoseTTAFold.

PAE_INTER module-attribute

PAE_INTER = 'pae_inter'

Scalar mean inter-chain PAE (float). RoseTTAFold's headline interface metric; values below ~10 indicate a high-quality interface.

PAE_PROT module-attribute

PAE_PROT = 'pae_prot'

Scalar mean PAE over protein residues only (float). Set by RoseTTAFold.

MEAN_PAE module-attribute

MEAN_PAE = 'mean_pae'

Scalar mean of the full PAE matrix (float). Set by RoseTTAFold.

MEAN_PLDDT module-attribute

MEAN_PLDDT = 'mean_plddt'

Scalar mean pLDDT (float). Set by RoseTTAFold and :func:molforge.io.load_alphafold. Equivalent to :data:MEAN_CONFIDENCE; the latter is the cross-engine-uniform name and should be preferred.

PLDDT module-attribute

PLDDT = 'plddt'

(N_atoms,) float32 per-atom pLDDT. Legacy key set by :func:molforge.io.load_alphafold; :data:CONFIDENCE_PER_ATOM is the cross-engine-uniform name and should be preferred.

PLDDT_PER_RESIDUE module-attribute

PLDDT_PER_RESIDUE = 'plddt_per_residue'

(L,) float32 per-residue pLDDT. Legacy key set by :func:molforge.io.load_alphafold; :data:CONFIDENCE_PER_RESIDUE is the cross-engine-uniform name and should be preferred.

ProteinMetadata

Bases: TypedDict

Typed view of the documented :attr:Protein.metadata keys.

Every key is optional (total=False). This is a typing aid only — Protein.metadata remains a plain dict[str, Any] at runtime, and keys outside this set are still permitted (without stability guarantees). Annotate a local variable as ProteinMetadata to get editor / mypy support for the documented vocabulary.