Skip to content

molforge.io

io

File I/O for molforge.

This subpackage provides parsers and writers for the file formats you'll encounter across structural-biology workflows. The top-level entry points are :func:load, :func:save, and :func:fetch, which dispatch to the appropriate handler based on the file extension.

Supported formats:

  • PDB (.pdb, .ent) — full read/write, the universal default.
  • mmCIF / PDBx (.cif, .mmcif) — full read/write; recommended for structures with >99,999 atoms (PDB's hard limit).
  • FASTA (.fasta, .fa, .faa, .fna) — sequence read/write.
  • SDF (.sdf, .mol) — small-molecule exchange; full read/write of V2000 (coordinates, elements, title, property block). V3000 is not yet supported.
  • MOL2 (.mol2) — Tripos small-molecule exchange; full read/write of the ATOM section (coordinates, elements via Tripos type prefix, atom names, partial charges, substructure info).
  • PDBQT (.pdbqt) — AutoDock / Vina format; full read/write of ATOM records with per-atom partial charges and AutoDock atom types, reusing the PDB reader for the leading columns. ROOT / BRANCH / TORSDOF rotatable-bond markers are read-tolerated; round-tripping preserves coordinates, charges, and types.
  • PQR (.pqr) — APBS / PDB2PQR with explicit per-atom charges and radii. The leading PDB-compatible columns are parsed as fixed-position; the charge and radius are whitespace-split from the trailing fields (PQR is not strictly fixed-column past the coordinates). Radii are attached to protein.metadata["radii"].

Convenience helpers:

  • :func:fetch — pull a structure by PDB ID from RCSB or AlphaFold.
  • :func:load_alphafold — load an AlphaFold prediction, exposing pLDDT as a first-class field rather than buried in B-factor.
Example

import molforge as mf protein = mf.load("1ubq.pdb") mf.save(protein, "1ubq_clean.pdb")

FastaRecord dataclass

FastaRecord(
    id: str,
    sequence: str,
    description: str = "",
    metadata: dict[str, str] = dict(),
)

A single record from a FASTA file.

Attributes:

Name Type Description
id str

The first whitespace-delimited token after > on the header line.

description str

The rest of the header line, if any.

sequence str

The concatenated, whitespace-stripped sequence.

metadata dict[str, str]

Free-form metadata (e.g. for downstream tools).

header property

header: str

The full header line including ID and description.

CIFParseError

Bases: ValueError

Raised when an mmCIF file cannot be parsed.

CIFWriteError

Bases: ValueError

Raised when an in-memory structure cannot be serialized to mmCIF.

PDBParseError

Bases: ValueError

Raised when a PDB file cannot be parsed.

PDBWriteError

Bases: ValueError

Raised when an in-memory structure cannot be serialized to PDB.

fetch

fetch(
    pdb_id: str,
    *,
    source: str = "rcsb",
    format: str = "pdb",
    timeout: float = 30.0,
) -> Protein

Fetch a structure by ID from a remote source.

Downloads the structure over HTTPS and parses it into a :class:~molforge.core.Protein. Uses only the standard library (:mod:urllib), so it adds no dependency.

Parameters:

Name Type Description Default
pdb_id str

4-character PDB ID (for source="rcsb") or UniProt accession (for source="alphafold"). Case-insensitive for RCSB.

required
source str

"rcsb" for the RCSB Protein Data Bank, or "alphafold" for the AlphaFold Protein Structure Database.

'rcsb'
format str

"pdb" or "cif". AlphaFold DB only serves "pdb" and "cif"; both are supported.

'pdb'
timeout float

Network timeout in seconds for the download.

30.0

Returns:

Name Type Description
A Protein

class:~molforge.core.Protein parsed from the downloaded

Protein

file.

Raises:

Type Description
ValueError

If source or format is unrecognized, or pdb_id is empty.

OSError

If the download fails — network error, timeout, or a non-existent ID (which the server returns as HTTP 404). The underlying :class:urllib.error.URLError / :class:~urllib.error.HTTPError is chained as the cause.

Example

from molforge.io import fetch protein = fetch("1ABC") # RCSB, PDB format af = fetch("P00520", source="alphafold") # AlphaFold DB

load

load(
    path: str | PathLike[str],
    *,
    format: str | None = None,
    **kwargs: object,
) -> object

Load a structure or sequence file.

Format is inferred from the extension unless format is given. Additional kwargs are forwarded to the underlying reader.

Returns:

Name Type Description
A object

class:molforge.core.Protein for structure formats, a list of

object

class:molforge.io.FastaRecord for FASTA.

save

save(
    obj: object,
    path: str | PathLike[str],
    *,
    format: str | None = None,
    **kwargs: object,
) -> None

Save a structure or list of FASTA records to disk.

Format is inferred from the extension unless format is given.

read_fasta

read_fasta(path: str | PathLike[str]) -> list[FastaRecord]

Read a FASTA file from disk.

Parameters:

Name Type Description Default
path str | PathLike[str]

Path to a .fasta / .fa / .faa / .fna file. .gz suffix triggers gzip decompression.

required

Returns:

Type Description
list[FastaRecord]

A list of :class:FastaRecord objects, in file order.

read_fasta_string

read_fasta_string(text: str) -> Iterator[FastaRecord]

Parse FASTA-formatted text, yielding :class:FastaRecord objects.

Memory-efficient: yields records one at a time rather than building a list up front.

write_fasta

write_fasta(
    records: Iterable[FastaRecord | tuple[str, str]],
    path: str | PathLike[str],
    *,
    line_width: int = 80,
) -> None

Write FASTA records to disk.

Parameters:

Name Type Description Default
records Iterable[FastaRecord | tuple[str, str]]

Iterable of :class:FastaRecord or (id, sequence) tuples.

required
path str | PathLike[str]

Destination path; .gz triggers gzip.

required
line_width int

Maximum sequence characters per line. Set to 0 to emit each sequence on a single line.

80

write_fasta_string

write_fasta_string(
    records: Iterable[FastaRecord | tuple[str, str]],
    *,
    line_width: int = 80,
) -> str

Serialize records as FASTA-formatted text.

read_cif

read_cif(
    path: str | PathLike[str],
    *,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Read an mmCIF / PDBx file from disk.

Parameters:

Name Type Description Default
path str | PathLike[str]

Path to a .cif or .mmcif file. .gz extension triggers gzip decompression.

required
include_hydrogens bool

If False, drop hydrogen atoms during parsing.

True
altloc str

Altloc-resolution strategy (same as :func:read_pdb): "highest_occupancy", "first", "all", or a single alternate-location identifier (e.g. "A").

'highest_occupancy'

Returns:

Name Type Description
A Protein

class:Protein with the parsed structure. metadata is

Protein

populated with pdb_id, title, experimental_method,

Protein

and resolution where available.

Raises:

Type Description
CIFParseError

If the file is malformed or has no _atom_site loop.

FileNotFoundError

If the path doesn't exist.

read_cif_string

read_cif_string(
    text: str,
    *,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Parse mmCIF-formatted text into a :class:Protein.

See :func:read_cif for argument semantics.

write_cif

write_cif(
    protein: Protein, path: str | PathLike[str]
) -> None

Write a :class:Protein to an mmCIF file.

write_cif_string

write_cif_string(protein: Protein) -> str

Serialize a :class:Protein as mmCIF text.

Produces a compact CIF with a data_<id> header, the structure's metadata (where present), and a complete _atom_site loop. Round-trips cleanly through :func:read_cif_string.

read_pdb

read_pdb(
    path: str | PathLike[str],
    *,
    model: int | None = None,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Read a PDB file from disk.

Parameters:

Name Type Description Default
path str | PathLike[str]

Path to a .pdb file (may be gzipped if extension is .gz).

required
model int | None

Which model to load from a multi-model file. None (default) loads all models. 0 is the first model. Pass an int to load a specific model.

None
include_hydrogens bool

If False, drop hydrogen atoms during parsing.

True
altloc str

Strategy for resolving alternate location indicators.

  • "highest_occupancy" (default): keep the altloc with the highest occupancy per atom name.
  • "first": keep the first altloc encountered, drop the rest.
  • "all": keep all altlocs (atoms will share residue_id but differ on altloc field).
  • A single-character string (e.g. "A"): keep only that altloc and the default (blank).
'highest_occupancy'

Returns:

Name Type Description
A Protein

class:Protein holding the parsed structure. The protein's

Protein

metadata dict is populated with any HEADER, TITLE, RESOLUTION,

Protein

and EXPDTA records found.

Raises:

Type Description
PDBParseError

If the file is malformed.

FileNotFoundError

If the path doesn't exist.

read_pdb_string

read_pdb_string(
    text: str,
    *,
    model: int | None = None,
    include_hydrogens: bool = True,
    altloc: str = "highest_occupancy",
) -> Protein

Parse a PDB-formatted string into a :class:Protein.

See :func:read_pdb for argument semantics.

write_pdb

write_pdb(
    protein: Protein,
    path: str | PathLike[str],
    *,
    write_end: bool = True,
) -> None

Write a :class:Protein to a PDB file.

Parameters:

Name Type Description Default
protein Protein

the structure to serialize.

required
path str | PathLike[str]

destination path. .gz suffix triggers gzip compression.

required
write_end bool

emit a final END record.

True

Raises:

Type Description
PDBWriteError

If the structure exceeds PDB's hard limits (>99,999 atoms or >9,999 residues per chain).

write_pdb_string

write_pdb_string(
    protein: Protein, *, write_end: bool = True
) -> str

Serialize a :class:Protein into a PDB-formatted string.

is_alphafold_pdb

is_alphafold_pdb(text_or_path: str | PathLike[str]) -> bool

Detect whether a PDB file/string is an AlphaFold prediction.

Heuristic: looks for ALPHAFOLD, PREDICTED MODEL, or ESMFOLD in the first 100 lines of HEADER / TITLE / REMARK records.

load_alphafold

load_alphafold(path: str | PathLike[str]) -> Protein

Load an AlphaFold prediction, exposing pLDDT as metadata.

The protein is read via :func:molforge.io.read_pdb, then its metadata is populated with confidence information under two sets of keys:

  • Uniform folding-engine keys (preferred) — the same keys every molforge folding-engine wrapper sets, so downstream code can read confidence without caring which engine ran: confidence_per_atom, confidence_per_residue, mean_confidence, and engine (= "AlphaFold").
  • Legacy AlphaFold-specific keys (retained for backward compatibility): plddt, plddt_per_residue, mean_plddt, source (= "alphafold").

The two sets carry the same values; new code should prefer the uniform keys. See :mod:molforge.core.metadata_keys for the documented vocabulary.

The B-factor column is left intact for compatibility with downstream tools that still expect to find pLDDT there.