molforge.core¶

core ¶

Core data model: hierarchical and linear views of protein structure.

The :class:AtomArray is the canonical representation — a flat, NumPy-backed array of all atoms. The hierarchical classes (:class:Protein, :class:Chain, :class:Residue, :class:Atom) are lightweight views that read and write through to the array.

Typical usage:

>>> from molforge.core import Protein, AtomArray
>>> protein = Protein(atom_array=AtomArray(0), name="example")
>>> protein.n_atoms
0

Atom ¶

Atom(
    array: AtomArray,
    index: int,
    *,
    parent: Residue | None = None,
)

View of a single atom in an :class:AtomArray.

Attributes are read/written through to the underlying array, so mutating an Atom mutates the source-of-truth representation.

index `property` ¶

index: int

The atom's index into the parent :class:AtomArray.

parent `property` ¶

parent: Residue | None

The containing residue, if known.

coord `property` `writable` ¶

coord: NDArray[float32]

The atom's 3-D coordinate as a (3,) float32 view (mutable).

is_backbone `property` ¶

is_backbone: bool

True if this is a standard protein backbone atom (N, CA, C, O, OXT).

is_hetero `property` ¶

is_hetero: bool

True if this atom comes from a HETATM record.

AtomArray ¶

AtomArray(n: int = 0)

Flat, NumPy-backed array of atoms.

This is the canonical representation; hierarchical views read from these arrays. All per-atom fields have shape (N,) except coords which has shape (N, 3).

Example

aa = AtomArray.empty(3) aa.element[:] = ["C", "N", "O"] aa.coords[0] = [1.0, 2.0, 3.0] len(aa) 3

Create an empty array of n atoms, all fields at default values.

chain_starts `property` ¶

chain_starts: IntArray

Indices of the first atom of each chain, in order.

A chain boundary is any change in chain_id or model_id.

residue_starts `property` ¶

residue_starts: IntArray

Indices of the first atom of each residue, in order.

A residue boundary is any change in (chain_id, residue_id, insertion_code, model_id).

empty `classmethod` ¶

empty(n: int) -> AtomArray

Alias for AtomArray(n) — more readable at call sites.

from_dict `classmethod` ¶

from_dict(data: dict[str, NDArray[Any]]) -> AtomArray

Construct from a dict of equal-length arrays.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, NDArray[Any]]`	Mapping field-name -> array. Must include `coords`; missing fields are filled with their schema defaults.	required

Raises:

Type	Description
`KeyError`	If `coords` is missing.
`ValueError`	If array lengths disagree.

append ¶

append(other: AtomArray) -> AtomArray

Return a new array with other concatenated after this one.

select ¶

select(mask: BoolArray) -> AtomArray

Return a new AtomArray containing only atoms where mask is True.

Parameters:

Name	Type	Description	Default
`mask`	`BoolArray`	Boolean array of length `len(self)`.	required

where ¶

where(**filters: object) -> BoolArray

Build a boolean mask from equality filters on any field.

Example

mask = aa.where(chain_id="A", atom_name="CA") ca_atoms = aa.select(mask)

iter_residue_slices ¶

iter_residue_slices() -> Iterable[slice]

Yield a slice for each residue's atoms (in array order).

iter_chain_slices ¶

iter_chain_slices() -> Iterable[slice]

Yield a slice for each chain's atoms (in array order).

Chain ¶

Chain(
    array: AtomArray,
    start: int,
    end: int,
    *,
    parent: Protein | None = None,
)

View over a chain's atoms inside an :class:AtomArray.

residues `property` ¶

residues: list[Residue]

All residues in this chain, in N-to-C order.

sequence `property` ¶

sequence: str

One-letter sequence for this chain (standard AAs + non-canonical mappings).

Non-amino-acid residues (ligands, water, ions) are skipped. Unknown residues become "X".

coords `property` ¶

coords: NDArray[float32]

All atom coordinates for this chain, shape (n_atoms, 3).

ProteinMetadata ¶

Bases: TypedDict

Typed view of the documented :attr:Protein.metadata keys.

Every key is optional (total=False). This is a typing aid only — Protein.metadata remains a plain dict[str, Any] at runtime, and keys outside this set are still permitted (without stability guarantees). Annotate a local variable as ProteinMetadata to get editor / mypy support for the documented vocabulary.

Protein ¶

Protein(
    atom_array: AtomArray | None = None,
    *,
    name: str = "",
    metadata: dict[str, Any] | None = None,
)

A protein (or protein complex) structure.

Protein owns a single :class:AtomArray (atom_array) which is the canonical data store. Hierarchical accessors (chains, residues, etc.) read from it.

Parameters:

Name	Type	Description	Default
`atom_array`	`AtomArray \| None`	The flat array of atoms backing this protein. If omitted, an empty array is used.	`None`
`name`	`str`	Optional identifier (e.g. PDB ID).	`''`
`metadata`	`dict[str, Any] \| None`	Free-form key/value metadata (resolution, header, engine confidence, ...). The dict accepts any keys, but the names molforge's own parsers and engine wrappers use form a stable, documented vocabulary — see :mod:`molforge.core.metadata_keys` and the :class:`~molforge.core.metadata_keys.ProteinMetadata` TypedDict. Keys outside that vocabulary are permitted but carry no cross-version stability guarantee.	`None`

chains `property` ¶

chains: list[Chain]

All chains in this protein, in array order.

coords `property` ¶

coords: NDArray[float32]

All atom coordinates, shape (n_atoms, 3).

sequence `property` ¶

sequence: str

Concatenated one-letter sequence across all protein/nucleic chains.

Chains are joined with "/" to make boundaries visible. Non-polymer chains (ligand, water, ion) are skipped.

get_chain ¶

get_chain(chain_id: str, model_id: int = 0) -> Chain

Look up a chain by (chain_id, model_id).

sequences ¶

sequences() -> dict[str, str]

Per-chain one-letter sequences keyed by chain_id.

select ¶

select(**filters: object) -> Protein

Return a new Protein containing only atoms matching the filters.

Filters are forwarded to :meth:AtomArray.where.

Example

Keep only chain A, protein atoms¶

sub = protein.select(chain_id="A", entity_type="protein")

protein_only ¶

protein_only() -> Protein

Return a new Protein containing only polymer protein atoms.

Drops ligands, waters, ions, and nucleic acids.

remove_water ¶

remove_water() -> Protein

Return a new Protein with all water atoms removed.

Residue ¶

Residue(
    array: AtomArray,
    start: int,
    end: int,
    *,
    parent: Chain | None = None,
)

View over a residue's atoms inside an :class:AtomArray.

name `property` ¶

name: str

Three-letter residue code (e.g. "ALA").

seq_id `property` ¶

seq_id: int

Author-assigned residue sequence number.

entity_type `property` ¶

entity_type: str

e.g. "protein", "dna", "rna", "ligand", "water", "ion".

atoms `property` ¶

atoms: list[Atom]

All atoms in this residue, as :class:Atom views.

coords `property` ¶

coords: NDArray[float32]

All atom coordinates for this residue, shape (n_atoms, 3).

slice `property` ¶

slice: slice

Underlying array slice covering this residue's atoms.

one_letter `property` ¶

one_letter: str

Single-letter amino-acid (or nucleotide) code; "X" for unknown.

is_hetero `property` ¶

is_hetero: bool

True if any atom in this residue is a HETATM record.

is_ion ¶

is_ion(resname: str) -> bool

Return True for common monatomic / small ion residue names.

is_standard_amino_acid ¶

is_standard_amino_acid(resname: str) -> bool

Return True for the 20 canonical amino acids.

is_water ¶

is_water(resname: str) -> bool

Return True for common water residue names.

three_to_one ¶

three_to_one(resname: str, *, unknown: str = 'X') -> str

Convert a 3-letter residue name to one-letter code.

Falls back to unknown for residues outside the canonical and known non-canonical tables. Nucleotides are handled too.

Metadata vocabulary¶

The documented key vocabulary for Protein.metadata — string constants and the ProteinMetadata TypedDict.

metadata_keys ¶

Documented key vocabulary for :attr:molforge.core.Protein.metadata.

Protein.metadata is a free-form dict[str, Any] by design — it carries whatever a parser or engine wants to attach, including open-ended things like PDB REMARK records. Keeping it a plain dict means no breaking change for code that writes arbitrary keys.

But "free-form" shouldn't mean "undocumented". The keys below are the ones molforge's own parsers and engine wrappers produce, and they form the contract: downstream code can rely on these names and value types being stable across the 1.x series. Keys outside this list are still permitted but carry no stability guarantee.

Two things to use here:

String constants (PDB_ID, MEAN_CONFIDENCE, ...). Prefer these over bare string literals when reading or writing metadata, so a typo is a NameError at import time rather than a silently missing key at runtime.
:class:ProteinMetadata — a TypedDict (total=False, every key optional) that documents the value type of each key. It's a typing aid only: Protein.metadata is still a plain dict at runtime, but annotating a local as ProteinMetadata gives editors and mypy the key/type information.

Key groups:

Structural-IO header keys — set by :func:molforge.io.read_pdb and :func:molforge.io.read_cif from file header records.
Uniform folding-engine keys — set by every folding-engine wrapper (ESMFold, AlphaFold, Boltz, RoseTTAFold) so downstream code can read prediction confidence without knowing which engine ran. :func:molforge.io.load_alphafold also populates these.
Engine-specific folding keys — set by some folding wrappers but not all; presence depends on the engine.

PDB_ID `module-attribute` ¶

PDB_ID = 'pdb_id'

4-character PDB accession code, e.g. "1ABC" (str).

TITLE `module-attribute` ¶

TITLE = 'title'

Free-text structure title from the PDB TITLE / mmCIF _struct.title (str).

CLASSIFICATION `module-attribute` ¶

CLASSIFICATION = 'classification'

PDB HEADER classification field, e.g. "HYDROLASE" (str).

DEPOSITION_DATE `module-attribute` ¶

DEPOSITION_DATE = 'deposition_date'

Deposition date string as it appears in the PDB HEADER (str).

EXPERIMENTAL_METHOD `module-attribute` ¶

EXPERIMENTAL_METHOD = 'experimental_method'

Experimental method, e.g. "X-RAY DIFFRACTION" (str).

RESOLUTION `module-attribute` ¶

RESOLUTION = 'resolution'

Resolution in Angstrom (float). Absent for non-diffraction structures.

ENGINE `module-attribute` ¶

ENGINE = 'engine'

Name of the folding engine that produced the structure (str), e.g. "ESMFold", "AlphaFold", "Boltz", "RoseTTAFold".

SOURCE_SEQUENCE `module-attribute` ¶

SOURCE_SEQUENCE = 'source_sequence'

The one-letter input sequence the engine folded (str).

CONFIDENCE_PER_RESIDUE `module-attribute` ¶

CONFIDENCE_PER_RESIDUE = 'confidence_per_residue'

(L,) float32 array of per-residue pLDDT-style confidence (0-100).

CONFIDENCE_PER_ATOM `module-attribute` ¶

CONFIDENCE_PER_ATOM = 'confidence_per_atom'

(N_atoms,) float32 array of per-atom confidence (0-100).

MEAN_CONFIDENCE `module-attribute` ¶

MEAN_CONFIDENCE = 'mean_confidence'

Scalar mean per-residue confidence (float, 0-100).

SOURCE `module-attribute` ¶

SOURCE = 'source'

Provenance tag (str). Set to "alphafold" by :func:molforge.io.load_alphafold.

MODEL_NAME `module-attribute` ¶

MODEL_NAME = 'model_name'

Engine-internal model identifier (str). Set by ESMFold.

MODEL_TYPE `module-attribute` ¶

MODEL_TYPE = 'model_type'

Model-type identifier (str). Set by AlphaFold, e.g. "monomer".

MODEL_VERSION `module-attribute` ¶

MODEL_VERSION = 'model_version'

Model-version identifier (str). Set by Boltz, e.g. "boltz2".

JOB_NAME `module-attribute` ¶

JOB_NAME = 'job_name'

Job name used for engine output files (str). Set by RoseTTAFold.

USE_MSA_SERVER `module-attribute` ¶

USE_MSA_SERVER = 'use_msa_server'

Whether an MSA server was used (bool). Set by Boltz.

PTM `module-attribute` ¶

PTM = 'ptm'

Predicted TM-score for the whole structure (float). Set by Boltz.

IPTM `module-attribute` ¶

IPTM = 'iptm'

Interface predicted TM-score (float). Set by Boltz; meaningful for complexes.

CONFIDENCE_SCORE `module-attribute` ¶

CONFIDENCE_SCORE = 'confidence_score'

Composite confidence score (float). Set by Boltz.

PAE `module-attribute` ¶

PAE = 'pae'

(L, L) predicted aligned error matrix (float array). Set by RoseTTAFold.

PDE `module-attribute` ¶

PDE = 'pde'

(L, L) predicted distance error matrix (float array). Set by RoseTTAFold.

PAE_INTER `module-attribute` ¶

PAE_INTER = 'pae_inter'

Scalar mean inter-chain PAE (float). RoseTTAFold's headline interface metric; values below ~10 indicate a high-quality interface.

PAE_PROT `module-attribute` ¶

PAE_PROT = 'pae_prot'

Scalar mean PAE over protein residues only (float). Set by RoseTTAFold.

MEAN_PAE `module-attribute` ¶

MEAN_PAE = 'mean_pae'

Scalar mean of the full PAE matrix (float). Set by RoseTTAFold.

MEAN_PLDDT `module-attribute` ¶

MEAN_PLDDT = 'mean_plddt'

Scalar mean pLDDT (float). Set by RoseTTAFold and :func:molforge.io.load_alphafold. Equivalent to :data:MEAN_CONFIDENCE; the latter is the cross-engine-uniform name and should be preferred.

PLDDT `module-attribute` ¶

PLDDT = 'plddt'

(N_atoms,) float32 per-atom pLDDT. Legacy key set by :func:molforge.io.load_alphafold; :data:CONFIDENCE_PER_ATOM is the cross-engine-uniform name and should be preferred.

PLDDT_PER_RESIDUE `module-attribute` ¶

PLDDT_PER_RESIDUE = 'plddt_per_residue'

(L,) float32 per-residue pLDDT. Legacy key set by :func:molforge.io.load_alphafold; :data:CONFIDENCE_PER_RESIDUE is the cross-engine-uniform name and should be preferred.

ProteinMetadata ¶

Bases: TypedDict

Typed view of the documented :attr:Protein.metadata keys.

Every key is optional (total=False). This is a typing aid only — Protein.metadata remains a plain dict[str, Any] at runtime, and keys outside this set are still permitted (without stability guarantees). Annotate a local variable as ProteinMetadata to get editor / mypy support for the documented vocabulary.

molforge.core¶

core ¶

Atom ¶

index property ¶

parent property ¶

coord property writable ¶

is_backbone property ¶

is_hetero property ¶

AtomArray ¶

chain_starts property ¶

residue_starts property ¶

empty classmethod ¶

from_dict classmethod ¶

append ¶

select ¶

where ¶

iter_residue_slices ¶

iter_chain_slices ¶

Chain ¶

residues property ¶

sequence property ¶

coords property ¶

ProteinMetadata ¶

Protein ¶

chains property ¶

coords property ¶

sequence property ¶

get_chain ¶

sequences ¶

select ¶

Keep only chain A, protein atoms¶

protein_only ¶

remove_water ¶

Residue ¶

name property ¶

seq_id property ¶

entity_type property ¶

atoms property ¶

coords property ¶

slice property ¶

one_letter property ¶

is_hetero property ¶

is_ion ¶

is_standard_amino_acid ¶

is_water ¶

three_to_one ¶

Metadata vocabulary¶

metadata_keys ¶

PDB_ID module-attribute ¶

TITLE module-attribute ¶

CLASSIFICATION module-attribute ¶

DEPOSITION_DATE module-attribute ¶

EXPERIMENTAL_METHOD module-attribute ¶

RESOLUTION module-attribute ¶

ENGINE module-attribute ¶

SOURCE_SEQUENCE module-attribute ¶

CONFIDENCE_PER_RESIDUE module-attribute ¶

CONFIDENCE_PER_ATOM module-attribute ¶

MEAN_CONFIDENCE module-attribute ¶

SOURCE module-attribute ¶

MODEL_NAME module-attribute ¶

MODEL_TYPE module-attribute ¶

MODEL_VERSION module-attribute ¶

JOB_NAME module-attribute ¶

USE_MSA_SERVER module-attribute ¶

PTM module-attribute ¶

IPTM module-attribute ¶

CONFIDENCE_SCORE module-attribute ¶

PAE module-attribute ¶

PDE module-attribute ¶

PAE_INTER module-attribute ¶

PAE_PROT module-attribute ¶

MEAN_PAE module-attribute ¶

MEAN_PLDDT module-attribute ¶

PLDDT module-attribute ¶

PLDDT_PER_RESIDUE module-attribute ¶

ProteinMetadata ¶

index `property` ¶

parent `property` ¶

coord `property` `writable` ¶

is_backbone `property` ¶

is_hetero `property` ¶

chain_starts `property` ¶

residue_starts `property` ¶

empty `classmethod` ¶

from_dict `classmethod` ¶

residues `property` ¶

sequence `property` ¶

coords `property` ¶

chains `property` ¶

coords `property` ¶

sequence `property` ¶

name `property` ¶

seq_id `property` ¶

entity_type `property` ¶

atoms `property` ¶

coords `property` ¶

slice `property` ¶

one_letter `property` ¶

is_hetero `property` ¶

PDB_ID `module-attribute` ¶

TITLE `module-attribute` ¶

CLASSIFICATION `module-attribute` ¶

DEPOSITION_DATE `module-attribute` ¶

EXPERIMENTAL_METHOD `module-attribute` ¶

RESOLUTION `module-attribute` ¶

ENGINE `module-attribute` ¶

SOURCE_SEQUENCE `module-attribute` ¶

CONFIDENCE_PER_RESIDUE `module-attribute` ¶

CONFIDENCE_PER_ATOM `module-attribute` ¶

MEAN_CONFIDENCE `module-attribute` ¶

SOURCE `module-attribute` ¶

MODEL_NAME `module-attribute` ¶

MODEL_TYPE `module-attribute` ¶

MODEL_VERSION `module-attribute` ¶

JOB_NAME `module-attribute` ¶

USE_MSA_SERVER `module-attribute` ¶

PTM `module-attribute` ¶

IPTM `module-attribute` ¶

CONFIDENCE_SCORE `module-attribute` ¶

PAE `module-attribute` ¶

PDE `module-attribute` ¶

PAE_INTER `module-attribute` ¶

PAE_PROT `module-attribute` ¶

MEAN_PAE `module-attribute` ¶

MEAN_PLDDT `module-attribute` ¶

PLDDT `module-attribute` ¶

PLDDT_PER_RESIDUE `module-attribute` ¶