Skip to content

Data model

The data model is the heart of molforge. Every wrapper, parser, and analysis function reads from and writes to the same set of types, so chaining tools doesn't require conversion code.

There are two views of the same data:

  • A hierarchical view — ProteinChainResidueAtom — for code that reasons about biology.
  • A linear view — AtomArray — a struct-of-arrays NumPy container for vectorized analysis and ML.

Both views read from the same backing arrays. There is no copy, no synchronization layer, and no "convert to the other view" step.

AtomArray — the canonical store

AtomArray is a flat, NumPy-backed container holding one row per atom. Columns include:

Field Dtype Notes
coords float32 (N, 3) Cartesian coordinates in Å.
atom_name <U4 Standard PDB atom name, e.g. "CA".
element <U2 One- or two-letter symbol.
residue_name <U3 Three-letter residue code.
residue_id int32 Author residue number.
chain_id <U2 Chain identifier.
entity_type <U10 "protein", "ligand", "ion", "water".
b_factor float32 Temperature factor.
occupancy float32 Crystallographic occupancy.
altloc <U1 Alternate-location indicator.

(See molforge.core.AtomArray for the full schema.) Because everything is a NumPy array, masking and slicing are O(1) and vectorized:

ca = protein.atom_array.coords[protein.atom_array.atom_name == "CA"]
heavy = protein.atom_array.coords[protein.atom_array.element != "H"]

Hierarchical accessors

The hierarchical view is a thin layer of views over the same AtomArray. Each Chain, Residue, and Atom holds a (start, end) index pair into the parent array — they don't own data:

protein["A"]                       # Chain (lookup by id)
protein.chains[0]                  # Chain (lookup by position)
protein["A"].residues[42]          # Residue
protein["A"].residues[42].atoms["CA"]   # Atom

This means mutating a residue's coordinates mutates the underlying AtomArray, and analyses that operate on the linear view see the change immediately. It also means Chain/Residue/Atom objects are cheap to create — they're essentially typed pointers.

Metadata

Protein.metadata is a free-form dict[str, Any] for things that don't fit cleanly into the structural schema: resolution, experimental method, PDB header lines, prediction confidence (e.g. "confidence_per_residue" set by load_alphafold).

API status

metadata is intentionally untyped today. A typed ProteinMetadata dataclass is under consideration for a future release — see the API audit issue. Treat keys you set as conventions, not contracts.

Entity types

The entity_type column on AtomArray distinguishes protein atoms from ligands, ions, and waters. PDB and mmCIF parsers set this automatically; you can also filter manually:

arr = protein.atom_array
protein_atoms = arr[arr.entity_type == "protein"]
ligands       = arr[arr.entity_type == "ligand"]
ions          = arr[arr.entity_type == "ion"]

This is what makes heterogeneous content first-class — antibody glycans, drug-target ligands, structural waters, and metal ions all coexist in one Protein without special-casing.

Reference