Data model¶

The data model is the heart of molforge. Every wrapper, parser, and analysis function reads from and writes to the same set of types, so chaining tools doesn't require conversion code.

There are two views of the same data:

A hierarchical view — Protein → Chain → Residue → Atom — for code that reasons about biology.
A linear view — AtomArray — a struct-of-arrays NumPy container for vectorized analysis and ML.

Both views read from the same backing arrays. There is no copy, no synchronization layer, and no "convert to the other view" step.

`AtomArray` — the canonical store¶

AtomArray is a flat, NumPy-backed container holding one row per atom. Columns include:

Field	Dtype	Notes
`coords`	`float32` `(N, 3)`	Cartesian coordinates in Å.
`atom_name`	`<U4`	Standard PDB atom name, e.g. `"CA"`.
`element`	`<U2`	One- or two-letter symbol.
`residue_name`	`<U3`	Three-letter residue code.
`residue_id`	`int32`	Author residue number.
`chain_id`	`<U2`	Chain identifier.
`entity_type`	`<U10`	`"protein"`, `"ligand"`, `"ion"`, `"water"`.
`b_factor`	`float32`	Temperature factor.
`occupancy`	`float32`	Crystallographic occupancy.
`altloc`	`<U1`	Alternate-location indicator.

(See molforge.core.AtomArray for the full schema.) Because everything is a NumPy array, masking and slicing are O(1) and vectorized:

ca = protein.atom_array.coords[protein.atom_array.atom_name == "CA"]
heavy = protein.atom_array.coords[protein.atom_array.element != "H"]

Hierarchical accessors¶

The hierarchical view is a thin layer of views over the same AtomArray. Each Chain, Residue, and Atom holds a (start, end) index pair into the parent array — they don't own data:

protein["A"]                       # Chain (lookup by id)
protein.chains[0]                  # Chain (lookup by position)
protein["A"].residues[42]          # Residue
protein["A"].residues[42].atoms["CA"]   # Atom

This means mutating a residue's coordinates mutates the underlying AtomArray, and analyses that operate on the linear view see the change immediately. It also means Chain/Residue/Atom objects are cheap to create — they're essentially typed pointers.

Metadata¶

Protein.metadata is a free-form dict[str, Any] for things that don't fit cleanly into the structural schema: resolution, experimental method, PDB header lines, prediction confidence (e.g. "confidence_per_residue" set by load_alphafold).

API status

metadata is intentionally untyped today. A typed ProteinMetadata dataclass is under consideration for a future release — see the API audit issue. Treat keys you set as conventions, not contracts.

Entity types¶

The entity_type column on AtomArray distinguishes protein atoms from ligands, ions, and waters. PDB and mmCIF parsers set this automatically; you can also filter manually:

arr = protein.atom_array
protein_atoms = arr[arr.entity_type == "protein"]
ligands       = arr[arr.entity_type == "ligand"]
ions          = arr[arr.entity_type == "ion"]

This is what makes heterogeneous content first-class — antibody glycans, drug-target ligands, structural waters, and metal ions all coexist in one Protein without special-casing.

Reference¶

molforge.core — the full API for the data model.