molforge.sequence¶

sequence ¶

Sequence-level operations: alignment, mutations, composition, properties.

What's here: - :func:align / :func:needleman_wunsch / :func:smith_waterman — pairwise alignment with affine gaps and BLOSUM62/PAM250 matrices. - :func:identity — convenience for "what's the sequence identity" - :class:Mutation, :func:apply_mutation, :func:apply_mutations, :func:mutate_protein — point mutations on sequences and Proteins. - :func:composition, :func:molecular_weight, :func:gravy, :func:aromaticity, :func:length — sequence properties.

For sequence I/O see :mod:molforge.io (FASTA).

Alignment `dataclass` ¶

Alignment(
    aligned_a: str,
    aligned_b: str,
    score: float,
    identity: float,
    coverage_a: float,
    coverage_b: float,
    start_a: int,
    end_a: int,
    start_b: int,
    end_b: int,
)

Result of a pairwise sequence alignment.

Attributes:

Name	Type	Description
`aligned_a`	`str`	Top sequence with gaps inserted (`-` for gap).
`aligned_b`	`str`	Bottom sequence with gaps inserted.
`score`	`float`	Total alignment score.
`identity`	`float`	Fraction of aligned positions where residues match (excluding gap-vs-gap and gap-vs-residue positions).
`coverage_a`	`float`	Fraction of sequence A covered by the alignment.
`coverage_b`	`float`	Fraction of sequence B covered by the alignment.
`start_a`	`int`	Start index in original sequence A (inclusive).
`end_a`	`int`	End index in original sequence A (exclusive).
`start_b`	`int`	Start index in original sequence B (inclusive).
`end_b`	`int`	End index in original sequence B (exclusive).

length `property` ¶

length: int

Alignment length including gaps.

format ¶

format(width: int = 60) -> str

Format the alignment as a human-readable block.

Parameters:

Name	Type	Description	Default
`width`	`int`	Characters per row.	`60`

Returns:

Type	Description
`str`	Multi-line string showing aligned_a / matches / aligned_b
`str`	in stacked blocks of `width` columns.

Mutation `dataclass` ¶

Mutation(
    wild_type: str,
    position: int,
    mutant: str,
    chain_id: str | None = None,
)

A single point mutation.

Attributes:

Name	Type	Description
`wild_type`	`str`	One-letter code of the original residue.
`position`	`int`	1-indexed residue number (matches PDB `seq_id`).
`mutant`	`str`	One-letter code of the new residue.
`chain_id`	`str \| None`	Optional chain identifier (some workflows need it).

parse `classmethod` ¶

parse(spec: str) -> Mutation

Parse a mutation string like "A123V".

Supports chain prefix syntax "A:K42N" for chain-aware mutations.

Raises:

Type	Description
`ValueError`	If the spec doesn't match the expected format.

align ¶

align(
    a: str,
    b: str,
    *,
    mode: str = "global",
    matrix: str | None = "BLOSUM62",
    match: int = 2,
    mismatch: int = -1,
    gap_open: int = -10,
    gap_extend: int = -1,
) -> Alignment

Pairwise alignment entry point.

Parameters:

Name	Type	Description	Default
`mode`	`str`	`"global"` (Needleman-Wunsch) or `"local"` (Smith-Waterman).	`'global'`

See :func:needleman_wunsch / :func:smith_waterman for the rest.

identity ¶

identity(a: str, b: str, *, mode: str = 'global') -> float

Convenience: align and return only the identity score.

needleman_wunsch ¶

needleman_wunsch(
    a: str,
    b: str,
    *,
    matrix: str | None = "BLOSUM62",
    match: int = 2,
    mismatch: int = -1,
    gap_open: int = -10,
    gap_extend: int = -1,
) -> Alignment

Global pairwise alignment (Needleman-Wunsch with affine gaps).

Parameters:

Name	Type	Description	Default
`a`	`str`	First sequence to align.	required
`b`	`str`	Second sequence to align.	required
`matrix`	`str \| None`	Substitution matrix name (e.g. `"BLOSUM62"`, `"PAM250"`) or `None` to use `match` / `mismatch` instead.	`'BLOSUM62'`
`match`	`int`	Per-position match score when `matrix=None`.	`2`
`mismatch`	`int`	Per-position mismatch score when `matrix=None`.	`-1`
`gap_open`	`int`	Penalty for opening a new gap (added at gap start).	`-10`
`gap_extend`	`int`	Penalty for each additional gap position.	`-1`

Returns:

Name	Type	Description
`An`	`Alignment`	class:`Alignment` covering the full length of both sequences.

Notes

Implements affine gap penalties via the standard three-matrix formulation (M for match/mismatch, X for gap in A, Y for gap in B). Penalties are added to the score, so they should be negative.

smith_waterman ¶

smith_waterman(
    a: str,
    b: str,
    *,
    matrix: str | None = "BLOSUM62",
    match: int = 2,
    mismatch: int = -1,
    gap_open: int = -10,
    gap_extend: int = -1,
) -> Alignment

Local pairwise alignment (Smith-Waterman with affine gaps).

Finds the highest-scoring local subsequence pair.

Parameters:

Name	Type	Description	Default
`a`	`str`	First sequence to align.	required
`b`	`str`	Second sequence to align.	required
`matrix`	`str \| None`	see :func:`needleman_wunsch`.	`'BLOSUM62'`
`match`	`int`	see :func:`needleman_wunsch`.	`2`
`mismatch`	`int`	see :func:`needleman_wunsch`.	`-1`
`gap_open`	`int`	see :func:`needleman_wunsch`.	`-10`
`gap_extend`	`int`	see :func:`needleman_wunsch`.	`-1`

Returns:

Name	Type	Description
`An`	`Alignment`	class:`Alignment` covering only the best-scoring local region;
	`Alignment`	`start_` / `end_` give the bounds in the original sequences.

aromaticity ¶

aromaticity(sequence: str) -> float

Fraction of aromatic residues (F, W, Y).

A quick proxy for things like protein UV absorption.

gravy ¶

gravy(sequence: str) -> float

Grand average of hydropathy (GRAVY) — Kyte-Doolittle 1982.

Higher = more hydrophobic. Soluble globular proteins typically have GRAVY between -2 and +2.

length ¶

length(sequence: str) -> int

Number of standard amino acids in a sequence.

molecular_weight ¶

molecular_weight(sequence: str) -> float

Approximate molecular weight in Da.

Sums per-residue monoisotopic masses plus one water for the terminal H/OH. Non-standard residues are ignored.

available_matrices ¶

available_matrices() -> list[str]

List the named substitution matrices that ship with molforge.

get_matrix ¶

get_matrix(
    name: str,
) -> tuple[NDArray[np.int_], dict[str, int]]

Return (matrix, index_map) for a named substitution matrix.

Raises:

Type	Description
`KeyError`	If `name` is not a known matrix.

apply_mutation ¶

apply_mutation(
    sequence: str, mutation: Mutation | str
) -> str

Apply a single point mutation to a sequence string.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	One-letter amino-acid sequence.	required
`mutation`	`Mutation \| str`	A :class:`Mutation` object or mutation string (e.g. `"A123V"`).	required

Returns:

Type	Description
`str`	The mutated sequence.

Raises:

Type	Description
`ValueError`	If the position is out of range, or the wild-type residue doesn't match what's actually at that position.

apply_mutations ¶

apply_mutations(
    sequence: str, mutations: str | Iterable[Mutation | str]
) -> str

Apply multiple mutations to a sequence.

Accepts a slash-delimited string or an iterable of mutations.

mutate_protein ¶

mutate_protein(
    protein: Protein,
    mutation: Mutation | str,
    *,
    chain_id: str | None = None,
) -> Protein

Apply a sequence-level mutation to a :class:Protein.

Updates the residue_name field of the affected residue. Side chain atoms are not rebuilt — for full structural mutation, route through Rosetta, OpenMM, or a side-chain repacker. This function is for sequence-bookkeeping and downstream tools that re-fold or repack on their own.

Parameters:

Name	Type	Description	Default
`protein`	`Protein`	The structure to mutate (a copy is returned; original is unmodified).	required
`mutation`	`Mutation \| str`	A :class:`Mutation` or mutation string.	required
`chain_id`	`str \| None`	Which chain to mutate. If omitted, uses the chain embedded in the mutation string (`"A:K42N"`) or the first chain in the protein.	`None`

Returns:

Type	Description
`Protein`	A new :class:`Protein` with the residue name updated.

Raises:

Type	Description
`ValueError`	If the position doesn't exist, or the wild-type doesn't match the residue actually present.

parse_mutations ¶

parse_mutations(spec: str) -> list[Mutation]

Parse a slash- (or comma- or whitespace-) delimited mutation string.

Example

parse_mutations("A123V/T56K") [Mutation('A', 123, 'V'), Mutation('T', 56, 'K')]

molforge.sequence¶

sequence ¶

Alignment dataclass ¶

length property ¶

format ¶

Mutation dataclass ¶

parse classmethod ¶

align ¶

identity ¶

needleman_wunsch ¶

smith_waterman ¶

aromaticity ¶

gravy ¶

length ¶

molecular_weight ¶

available_matrices ¶

get_matrix ¶

apply_mutation ¶

apply_mutations ¶

mutate_protein ¶

parse_mutations ¶

Alignment `dataclass` ¶

length `property` ¶

Mutation `dataclass` ¶

parse `classmethod` ¶