molforge.io¶

io ¶

File I/O for molforge.

This subpackage provides parsers and writers for the file formats you'll encounter across structural-biology workflows. The top-level entry points are :func:load, :func:save, and :func:fetch, which dispatch to the appropriate handler based on the file extension.

Supported formats:

PDB (.pdb, .ent) — full read/write, the universal default.
mmCIF / PDBx (.cif, .mmcif) — full read/write; recommended for structures with >99,999 atoms (PDB's hard limit).
FASTA (.fasta, .fa, .faa, .fna) — sequence read/write.
SDF (.sdf, .mol) — small-molecule exchange; full read/write of V2000 (coordinates, elements, title, property block). V3000 is not yet supported.
MOL2 (.mol2) — Tripos small-molecule exchange; full read/write of the ATOM section (coordinates, elements via Tripos type prefix, atom names, partial charges, substructure info).
PDBQT (.pdbqt) — AutoDock / Vina format; full read/write of ATOM records with per-atom partial charges and AutoDock atom types, reusing the PDB reader for the leading columns. ROOT / BRANCH / TORSDOF rotatable-bond markers are read-tolerated; round-tripping preserves coordinates, charges, and types.
PQR (.pqr) — APBS / PDB2PQR with explicit per-atom charges and radii. The leading PDB-compatible columns are parsed as fixed-position; the charge and radius are whitespace-split from the trailing fields (PQR is not strictly fixed-column past the coordinates). Radii are attached to protein.metadata["radii"].

Convenience helpers:

:func:fetch — pull a structure by PDB ID from RCSB or AlphaFold.
:func:load_alphafold — load an AlphaFold prediction, exposing pLDDT as a first-class field rather than buried in B-factor.

Example

import molforge as mf protein = mf.load("1ubq.pdb") mf.save(protein, "1ubq_clean.pdb")

FastaRecord `dataclass` ¶

FastaRecord(
    id: str,
    sequence: str,
    description: str = "",
    metadata: dict[str, str] = dict(),
)

A single record from a FASTA file.

Attributes:

Name	Type	Description
`id`	`str`	The first whitespace-delimited token after `>` on the header line.
`description`	`str`	The rest of the header line, if any.
`sequence`	`str`	The concatenated, whitespace-stripped sequence.
`metadata`	`dict[str, str]`	Free-form metadata (e.g. for downstream tools).

header: str

The full header line including ID and description.

CIFParseError ¶

Bases: ValueError

Raised when an mmCIF file cannot be parsed.

CIFWriteError ¶

Bases: ValueError

Raised when an in-memory structure cannot be serialized to mmCIF.

PDBParseError ¶

Bases: ValueError

Raised when a PDB file cannot be parsed.

PDBWriteError ¶

Bases: ValueError

Raised when an in-memory structure cannot be serialized to PDB.

fetch ¶

fetch(
    pdb_id: str,
    *,
    source: str = "rcsb",
    format: str = "pdb",
    timeout: float = 30.0,
) -> Protein

Fetch a structure by ID from a remote source.

Downloads the structure over HTTPS and parses it into a :class:~molforge.core.Protein. Uses only the standard library (:mod:urllib), so it adds no dependency.

Parameters:

Name	Type	Description	Default
`pdb_id`	`str`	4-character PDB ID (for `source="rcsb"`) or UniProt accession (for `source="alphafold"`). Case-insensitive for RCSB.	required
`source`	`str`	`"rcsb"` for the RCSB Protein Data Bank, or `"alphafold"` for the AlphaFold Protein Structure Database.	`'rcsb'`
`format`	`str`	`"pdb"` or `"cif"`. AlphaFold DB only serves `"pdb"` and `"cif"`; both are supported.	`'pdb'`
`timeout`	`float`	Network timeout in seconds for the download.	`30.0`

Returns:

Name	Type	Description
`A`	`Protein`	class:`~molforge.core.Protein` parsed from the downloaded
	`Protein`	file.

Raises:

Type	Description
`ValueError`	If `source` or `format` is unrecognized, or `pdb_id` is empty.
`OSError`	If the download fails — network error, timeout, or a non-existent ID (which the server returns as HTTP 404). The underlying :class:`urllib.error.URLError` / :class:`~urllib.error.HTTPError` is chained as the cause.

Example

from molforge.io import fetch protein = fetch("1ABC") # RCSB, PDB format af = fetch("P00520", source="alphafold") # AlphaFold DB

load ¶

load(
    path: str | PathLike[str],
    *,
    format: str | None = None,
    **kwargs: object,
) -> object

Load a structure or sequence file.

Format is inferred from the extension unless format is given. Additional kwargs are forwarded to the underlying reader.

Returns:

Name	Type	Description
`A`	`object`	class:`molforge.core.Protein` for structure formats, a list of
	`object`	class:`molforge.io.FastaRecord` for FASTA.

save ¶

save(
    obj: object,
    path: str | PathLike[str],
    *,
    format: str | None = None,
    **kwargs: object,
) -> None

Save a structure or list of FASTA records to disk.

Format is inferred from the extension unless format is given.

read_fasta ¶

read_fasta(path: str | PathLike[str]) -> list[FastaRecord]

Read a FASTA file from disk.

Parameters:

Name	Type	Description	Default
`path`	`str \| PathLike[str]`	Path to a `.fasta` / `.fa` / `.faa` / `.fna` file. `.gz` suffix triggers gzip decompression.	required

Returns:

Type	Description
`list[FastaRecord]`	A list of :class:`FastaRecord` objects, in file order.

read_fasta_string ¶

read_fasta_string(text: str) -> Iterator[FastaRecord]

Parse FASTA-formatted text, yielding :class:FastaRecord objects.

Memory-efficient: yields records one at a time rather than building a list up front.

write_fasta ¶

write_fasta(
    records: Iterable[FastaRecord | tuple[str, str]],
    path: str | PathLike[str],
    *,
    line_width: int = 80,
) -> None

Write FASTA records to disk.

Parameters:

Name	Type	Description	Default
`records`	`Iterable[FastaRecord \| tuple[str, str]]`	Iterable of :class:`FastaRecord` or `(id, sequence)` tuples.	required
`path`	`str \| PathLike[str]`	Destination path; `.gz` triggers gzip.	required
`line_width`	`int`	Maximum sequence characters per line. Set to `0` to emit each sequence on a single line.	`80`

write_fasta_string ¶

write_fasta_string(
    records: Iterable[FastaRecord | tuple[str, str]],
    *,
    line_width: int = 80,
) -> str

Serialize records as FASTA-formatted text.

read_cif ¶

read_cif(
    path: str | PathLike[str],
    *,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Read an mmCIF / PDBx file from disk.

Parameters:

Name	Type	Description	Default
`path`	`str \| PathLike[str]`	Path to a `.cif` or `.mmcif` file. `.gz` extension triggers gzip decompression.	required
`include_hydrogens`	`bool`	If False, drop hydrogen atoms during parsing.	`True`
`altloc`	`str`	Altloc-resolution strategy (same as :func:`read_pdb`): `"highest_occupancy"`, `"first"`, `"all"`, or a single alternate-location identifier (e.g. `"A"`).	`'highest_occupancy'`

Returns:

Name	Type	Description
`A`	`Protein`	class:`Protein` with the parsed structure. `metadata` is
	`Protein`	populated with `pdb_id`, `title`, `experimental_method`,
	`Protein`	and `resolution` where available.

Raises:

Type	Description
`CIFParseError`	If the file is malformed or has no `_atom_site` loop.
`FileNotFoundError`	If the path doesn't exist.

read_cif_string ¶

read_cif_string(
    text: str,
    *,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Parse mmCIF-formatted text into a :class:Protein.

See :func:read_cif for argument semantics.

write_cif ¶

write_cif(
    protein: Protein, path: str | PathLike[str]
) -> None

Write a :class:Protein to an mmCIF file.

write_cif_string ¶

write_cif_string(protein: Protein) -> str

Serialize a :class:Protein as mmCIF text.

Produces a compact CIF with a data_<id> header, the structure's metadata (where present), and a complete _atom_site loop. Round-trips cleanly through :func:read_cif_string.

read_pdb ¶

read_pdb(
    path: str | PathLike[str],
    *,
    model: int | None = None,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Read a PDB file from disk.

Parameters:

Name	Type	Description	Default
`path`	`str \| PathLike[str]`	Path to a `.pdb` file (may be gzipped if extension is `.gz`).	required
`model`	`int \| None`	Which model to load from a multi-model file. `None` (default) loads all models. `0` is the first model. Pass an int to load a specific model.	`None`
`include_hydrogens`	`bool`	If `False`, drop hydrogen atoms during parsing.	`True`
`altloc`	`str`	Strategy for resolving alternate location indicators. `"highest_occupancy"` (default): keep the altloc with the highest occupancy per atom name. `"first"`: keep the first altloc encountered, drop the rest. `"all"`: keep all altlocs (atoms will share residue_id but differ on altloc field). A single-character string (e.g. `"A"`): keep only that altloc and the default (blank).	`'highest_occupancy'`

Returns:

Name	Type	Description
`A`	`Protein`	class:`Protein` holding the parsed structure. The protein's
	`Protein`	`metadata` dict is populated with any HEADER, TITLE, RESOLUTION,
	`Protein`	and EXPDTA records found.

Raises:

Type	Description
`PDBParseError`	If the file is malformed.
`FileNotFoundError`	If the path doesn't exist.

read_pdb_string ¶

read_pdb_string(
    text: str,
    *,
    model: int | None = None,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Parse a PDB-formatted string into a :class:Protein.

See :func:read_pdb for argument semantics.

write_pdb ¶

write_pdb(
    protein: Protein,
    path: str | PathLike[str],
    *,
    write_end: bool = True,
) -> None

Write a :class:Protein to a PDB file.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	the structure to serialize.	required
`path`	`str \| PathLike[str]`	destination path. `.gz` suffix triggers gzip compression.	required
`write_end`	`bool`	emit a final `END` record.	`True`

Raises:

Type	Description
`PDBWriteError`	If the structure exceeds PDB's hard limits (>99,999 atoms or >9,999 residues per chain).

write_pdb_string ¶

write_pdb_string(
    protein: Protein, *, write_end: bool = True
) -> str

Serialize a :class:Protein into a PDB-formatted string.

is_alphafold_pdb ¶

is_alphafold_pdb(text_or_path: str | PathLike[str]) -> bool

Detect whether a PDB file/string is an AlphaFold prediction.

Heuristic: looks for ALPHAFOLD, PREDICTED MODEL, or ESMFOLD in the first 100 lines of HEADER / TITLE / REMARK records.

load_alphafold ¶

load_alphafold(path: str | PathLike[str]) -> Protein

Load an AlphaFold prediction, exposing pLDDT as metadata.

The protein is read via :func:molforge.io.read_pdb, then its metadata is populated with confidence information under two sets of keys:

Uniform folding-engine keys (preferred) — the same keys every molforge folding-engine wrapper sets, so downstream code can read confidence without caring which engine ran: confidence_per_atom, confidence_per_residue, mean_confidence, and engine (= "AlphaFold").
Legacy AlphaFold-specific keys (retained for backward compatibility): plddt, plddt_per_residue, mean_plddt, source (= "alphafold").

The two sets carry the same values; new code should prefer the uniform keys. See :mod:molforge.core.metadata_keys for the documented vocabulary.

The B-factor column is left intact for compatibility with downstream tools that still expect to find pLDDT there.

molforge.io¶

io ¶

FastaRecord dataclass ¶

header property ¶

CIFParseError ¶

CIFWriteError ¶

PDBParseError ¶

PDBWriteError ¶

fetch ¶

load ¶

save ¶

read_fasta ¶

read_fasta_string ¶

write_fasta ¶

write_fasta_string ¶

read_cif ¶

read_cif_string ¶

write_cif ¶

write_cif_string ¶

read_pdb ¶

read_pdb_string ¶

write_pdb ¶

write_pdb_string ¶

is_alphafold_pdb ¶

load_alphafold ¶

FastaRecord `dataclass` ¶

header `property` ¶