molforge.io¶
io ¶
File I/O for molforge.
This subpackage provides parsers and writers for the file formats you'll
encounter across structural-biology workflows. The top-level entry
points are :func:load, :func:save, and :func:fetch, which dispatch
to the appropriate handler based on the file extension.
Supported formats:
- PDB (
.pdb,.ent) — full read/write, the universal default. - mmCIF / PDBx (
.cif,.mmcif) — full read/write; recommended for structures with >99,999 atoms (PDB's hard limit). - FASTA (
.fasta,.fa,.faa,.fna) — sequence read/write. - SDF (
.sdf,.mol) — small-molecule exchange; full read/write of V2000 (coordinates, elements, title, property block). V3000 is not yet supported. - MOL2 (
.mol2) — Tripos small-molecule exchange; full read/write of the ATOM section (coordinates, elements via Tripos type prefix, atom names, partial charges, substructure info). - PDBQT (
.pdbqt) — AutoDock / Vina format; full read/write of ATOM records with per-atom partial charges and AutoDock atom types, reusing the PDB reader for the leading columns. ROOT / BRANCH / TORSDOF rotatable-bond markers are read-tolerated; round-tripping preserves coordinates, charges, and types. - PQR (
.pqr) — APBS / PDB2PQR with explicit per-atom charges and radii. The leading PDB-compatible columns are parsed as fixed-position; the charge and radius are whitespace-split from the trailing fields (PQR is not strictly fixed-column past the coordinates). Radii are attached toprotein.metadata["radii"].
Convenience helpers:
- :func:
fetch— pull a structure by PDB ID from RCSB or AlphaFold. - :func:
load_alphafold— load an AlphaFold prediction, exposing pLDDT as a first-class field rather than buried in B-factor.
Example
import molforge as mf protein = mf.load("1ubq.pdb") mf.save(protein, "1ubq_clean.pdb")
FastaRecord
dataclass
¶
A single record from a FASTA file.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
str
|
The first whitespace-delimited token after |
description |
str
|
The rest of the header line, if any. |
sequence |
str
|
The concatenated, whitespace-stripped sequence. |
metadata |
dict[str, str]
|
Free-form metadata (e.g. for downstream tools). |
CIFParseError ¶
Bases: ValueError
Raised when an mmCIF file cannot be parsed.
CIFWriteError ¶
Bases: ValueError
Raised when an in-memory structure cannot be serialized to mmCIF.
PDBParseError ¶
Bases: ValueError
Raised when a PDB file cannot be parsed.
PDBWriteError ¶
Bases: ValueError
Raised when an in-memory structure cannot be serialized to PDB.
fetch ¶
fetch(
pdb_id: str,
*,
source: str = "rcsb",
format: str = "pdb",
timeout: float = 30.0,
) -> Protein
Fetch a structure by ID from a remote source.
Downloads the structure over HTTPS and parses it into a
:class:~molforge.core.Protein. Uses only the standard library
(:mod:urllib), so it adds no dependency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdb_id
|
str
|
4-character PDB ID (for |
required |
source
|
str
|
|
'rcsb'
|
format
|
str
|
|
'pdb'
|
timeout
|
float
|
Network timeout in seconds for the download. |
30.0
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
Protein
|
class: |
Protein
|
file. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
OSError
|
If the download fails — network error, timeout, or a
non-existent ID (which the server returns as HTTP 404).
The underlying :class: |
Example
from molforge.io import fetch protein = fetch("1ABC") # RCSB, PDB format af = fetch("P00520", source="alphafold") # AlphaFold DB
load ¶
Load a structure or sequence file.
Format is inferred from the extension unless format is given.
Additional kwargs are forwarded to the underlying reader.
Returns:
| Name | Type | Description |
|---|---|---|
A |
object
|
class: |
object
|
class: |
save ¶
save(
obj: object,
path: str | PathLike[str],
*,
format: str | None = None,
**kwargs: object,
) -> None
Save a structure or list of FASTA records to disk.
Format is inferred from the extension unless format is given.
read_fasta ¶
Read a FASTA file from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | PathLike[str]
|
Path to a |
required |
Returns:
| Type | Description |
|---|---|
list[FastaRecord]
|
A list of :class: |
read_fasta_string ¶
Parse FASTA-formatted text, yielding :class:FastaRecord objects.
Memory-efficient: yields records one at a time rather than building a list up front.
write_fasta ¶
write_fasta(
records: Iterable[FastaRecord | tuple[str, str]],
path: str | PathLike[str],
*,
line_width: int = 80,
) -> None
Write FASTA records to disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
Iterable[FastaRecord | tuple[str, str]]
|
Iterable of :class: |
required |
path
|
str | PathLike[str]
|
Destination path; |
required |
line_width
|
int
|
Maximum sequence characters per line. Set to |
80
|
write_fasta_string ¶
write_fasta_string(
records: Iterable[FastaRecord | tuple[str, str]],
*,
line_width: int = 80,
) -> str
Serialize records as FASTA-formatted text.
read_cif ¶
read_cif(
path: str | PathLike[str],
*,
include_hydrogens: bool = True,
altloc: str = "highest_occupancy",
) -> Protein
Read an mmCIF / PDBx file from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | PathLike[str]
|
Path to a |
required |
include_hydrogens
|
bool
|
If False, drop hydrogen atoms during parsing. |
True
|
altloc
|
str
|
Altloc-resolution strategy (same as :func: |
'highest_occupancy'
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
Protein
|
class: |
Protein
|
populated with |
|
Protein
|
and |
Raises:
| Type | Description |
|---|---|
CIFParseError
|
If the file is malformed or has no |
FileNotFoundError
|
If the path doesn't exist. |
read_cif_string ¶
read_cif_string(
text: str,
*,
include_hydrogens: bool = True,
altloc: str = "highest_occupancy",
) -> Protein
Parse mmCIF-formatted text into a :class:Protein.
See :func:read_cif for argument semantics.
write_cif ¶
Write a :class:Protein to an mmCIF file.
write_cif_string ¶
Serialize a :class:Protein as mmCIF text.
Produces a compact CIF with a data_<id> header, the structure's
metadata (where present), and a complete _atom_site loop. Round-trips
cleanly through :func:read_cif_string.
read_pdb ¶
read_pdb(
path: str | PathLike[str],
*,
model: int | None = None,
include_hydrogens: bool = True,
altloc: str = "highest_occupancy",
) -> Protein
Read a PDB file from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | PathLike[str]
|
Path to a |
required |
model
|
int | None
|
Which model to load from a multi-model file. |
None
|
include_hydrogens
|
bool
|
If |
True
|
altloc
|
str
|
Strategy for resolving alternate location indicators.
|
'highest_occupancy'
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
Protein
|
class: |
Protein
|
|
|
Protein
|
and EXPDTA records found. |
Raises:
| Type | Description |
|---|---|
PDBParseError
|
If the file is malformed. |
FileNotFoundError
|
If the path doesn't exist. |
read_pdb_string ¶
read_pdb_string(
text: str,
*,
model: int | None = None,
include_hydrogens: bool = True,
altloc: str = "highest_occupancy",
) -> Protein
Parse a PDB-formatted string into a :class:Protein.
See :func:read_pdb for argument semantics.
write_pdb ¶
Write a :class:Protein to a PDB file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
the structure to serialize. |
required |
path
|
str | PathLike[str]
|
destination path. |
required |
write_end
|
bool
|
emit a final |
True
|
Raises:
| Type | Description |
|---|---|
PDBWriteError
|
If the structure exceeds PDB's hard limits (>99,999 atoms or >9,999 residues per chain). |
write_pdb_string ¶
Serialize a :class:Protein into a PDB-formatted string.
is_alphafold_pdb ¶
Detect whether a PDB file/string is an AlphaFold prediction.
Heuristic: looks for ALPHAFOLD, PREDICTED MODEL, or ESMFOLD
in the first 100 lines of HEADER / TITLE / REMARK records.
load_alphafold ¶
Load an AlphaFold prediction, exposing pLDDT as metadata.
The protein is read via :func:molforge.io.read_pdb, then its
metadata is populated with confidence information under two
sets of keys:
- Uniform folding-engine keys (preferred) — the same keys
every molforge folding-engine wrapper sets, so downstream code
can read confidence without caring which engine ran:
confidence_per_atom,confidence_per_residue,mean_confidence, andengine(="AlphaFold"). - Legacy AlphaFold-specific keys (retained for backward
compatibility):
plddt,plddt_per_residue,mean_plddt,source(="alphafold").
The two sets carry the same values; new code should prefer the
uniform keys. See :mod:molforge.core.metadata_keys for the
documented vocabulary.
The B-factor column is left intact for compatibility with downstream tools that still expect to find pLDDT there.