Skip to content

molforge.sequence

sequence

Sequence-level operations: alignment, mutations, composition, properties.

What's here: - :func:align / :func:needleman_wunsch / :func:smith_waterman — pairwise alignment with affine gaps and BLOSUM62/PAM250 matrices. - :func:identity — convenience for "what's the sequence identity" - :class:Mutation, :func:apply_mutation, :func:apply_mutations, :func:mutate_protein — point mutations on sequences and Proteins. - :func:composition, :func:molecular_weight, :func:gravy, :func:aromaticity, :func:length — sequence properties.

For sequence I/O see :mod:molforge.io (FASTA).

Alignment dataclass

Alignment(
    aligned_a: str,
    aligned_b: str,
    score: float,
    identity: float,
    coverage_a: float,
    coverage_b: float,
    start_a: int,
    end_a: int,
    start_b: int,
    end_b: int,
)

Result of a pairwise sequence alignment.

Attributes:

Name Type Description
aligned_a str

Top sequence with gaps inserted (- for gap).

aligned_b str

Bottom sequence with gaps inserted.

score float

Total alignment score.

identity float

Fraction of aligned positions where residues match (excluding gap-vs-gap and gap-vs-residue positions).

coverage_a float

Fraction of sequence A covered by the alignment.

coverage_b float

Fraction of sequence B covered by the alignment.

start_a int

Start index in original sequence A (inclusive).

end_a int

End index in original sequence A (exclusive).

start_b int

Start index in original sequence B (inclusive).

end_b int

End index in original sequence B (exclusive).

length property

length: int

Alignment length including gaps.

format

format(width: int = 60) -> str

Format the alignment as a human-readable block.

Parameters:

Name Type Description Default
width int

Characters per row.

60

Returns:

Type Description
str

Multi-line string showing aligned_a / matches / aligned_b

str

in stacked blocks of width columns.

Mutation dataclass

Mutation(
    wild_type: str,
    position: int,
    mutant: str,
    chain_id: str | None = None,
)

A single point mutation.

Attributes:

Name Type Description
wild_type str

One-letter code of the original residue.

position int

1-indexed residue number (matches PDB seq_id).

mutant str

One-letter code of the new residue.

chain_id str | None

Optional chain identifier (some workflows need it).

parse classmethod

parse(spec: str) -> Mutation

Parse a mutation string like "A123V".

Supports chain prefix syntax "A:K42N" for chain-aware mutations.

Raises:

Type Description
ValueError

If the spec doesn't match the expected format.

align

align(
    a: str,
    b: str,
    *,
    mode: str = "global",
    matrix: str | None = "BLOSUM62",
    match: int = 2,
    mismatch: int = -1,
    gap_open: int = -10,
    gap_extend: int = -1,
) -> Alignment

Pairwise alignment entry point.

Parameters:

Name Type Description Default
mode str

"global" (Needleman-Wunsch) or "local" (Smith-Waterman).

'global'

See :func:needleman_wunsch / :func:smith_waterman for the rest.

identity

identity(a: str, b: str, *, mode: str = 'global') -> float

Convenience: align and return only the identity score.

needleman_wunsch

needleman_wunsch(
    a: str,
    b: str,
    *,
    matrix: str | None = "BLOSUM62",
    match: int = 2,
    mismatch: int = -1,
    gap_open: int = -10,
    gap_extend: int = -1,
) -> Alignment

Global pairwise alignment (Needleman-Wunsch with affine gaps).

Parameters:

Name Type Description Default
a str

First sequence to align.

required
b str

Second sequence to align.

required
matrix str | None

Substitution matrix name (e.g. "BLOSUM62", "PAM250") or None to use match / mismatch instead.

'BLOSUM62'
match int

Per-position match score when matrix=None.

2
mismatch int

Per-position mismatch score when matrix=None.

-1
gap_open int

Penalty for opening a new gap (added at gap start).

-10
gap_extend int

Penalty for each additional gap position.

-1

Returns:

Name Type Description
An Alignment

class:Alignment covering the full length of both sequences.

Notes

Implements affine gap penalties via the standard three-matrix formulation (M for match/mismatch, X for gap in A, Y for gap in B). Penalties are added to the score, so they should be negative.

smith_waterman

smith_waterman(
    a: str,
    b: str,
    *,
    matrix: str | None = "BLOSUM62",
    match: int = 2,
    mismatch: int = -1,
    gap_open: int = -10,
    gap_extend: int = -1,
) -> Alignment

Local pairwise alignment (Smith-Waterman with affine gaps).

Finds the highest-scoring local subsequence pair.

Parameters:

Name Type Description Default
a str

First sequence to align.

required
b str

Second sequence to align.

required
matrix str | None

see :func:needleman_wunsch.

'BLOSUM62'
match int

see :func:needleman_wunsch.

2
mismatch int

see :func:needleman_wunsch.

-1
gap_open int

see :func:needleman_wunsch.

-10
gap_extend int

see :func:needleman_wunsch.

-1

Returns:

Name Type Description
An Alignment

class:Alignment covering only the best-scoring local region;

Alignment

start_* / end_* give the bounds in the original sequences.

aromaticity

aromaticity(sequence: str) -> float

Fraction of aromatic residues (F, W, Y).

A quick proxy for things like protein UV absorption.

gravy

gravy(sequence: str) -> float

Grand average of hydropathy (GRAVY) — Kyte-Doolittle 1982.

Higher = more hydrophobic. Soluble globular proteins typically have GRAVY between -2 and +2.

length

length(sequence: str) -> int

Number of standard amino acids in a sequence.

molecular_weight

molecular_weight(sequence: str) -> float

Approximate molecular weight in Da.

Sums per-residue monoisotopic masses plus one water for the terminal H/OH. Non-standard residues are ignored.

available_matrices

available_matrices() -> list[str]

List the named substitution matrices that ship with molforge.

get_matrix

get_matrix(
    name: str,
) -> tuple[NDArray[np.int_], dict[str, int]]

Return (matrix, index_map) for a named substitution matrix.

Raises:

Type Description
KeyError

If name is not a known matrix.

apply_mutation

apply_mutation(
    sequence: str, mutation: Mutation | str
) -> str

Apply a single point mutation to a sequence string.

Parameters:

Name Type Description Default
sequence str

One-letter amino-acid sequence.

required
mutation Mutation | str

A :class:Mutation object or mutation string (e.g. "A123V").

required

Returns:

Type Description
str

The mutated sequence.

Raises:

Type Description
ValueError

If the position is out of range, or the wild-type residue doesn't match what's actually at that position.

apply_mutations

apply_mutations(
    sequence: str, mutations: str | Iterable[Mutation | str]
) -> str

Apply multiple mutations to a sequence.

Accepts a slash-delimited string or an iterable of mutations.

mutate_protein

mutate_protein(
    protein: Protein,
    mutation: Mutation | str,
    *,
    chain_id: str | None = None,
) -> Protein

Apply a sequence-level mutation to a :class:Protein.

Updates the residue_name field of the affected residue. Side chain atoms are not rebuilt — for full structural mutation, route through Rosetta, OpenMM, or a side-chain repacker. This function is for sequence-bookkeeping and downstream tools that re-fold or repack on their own.

Parameters:

Name Type Description Default
protein Protein

The structure to mutate (a copy is returned; original is unmodified).

required
mutation Mutation | str

A :class:Mutation or mutation string.

required
chain_id str | None

Which chain to mutate. If omitted, uses the chain embedded in the mutation string ("A:K42N") or the first chain in the protein.

None

Returns:

Type Description
Protein

A new :class:Protein with the residue name updated.

Raises:

Type Description
ValueError

If the position doesn't exist, or the wild-type doesn't match the residue actually present.

parse_mutations

parse_mutations(spec: str) -> list[Mutation]

Parse a slash- (or comma- or whitespace-) delimited mutation string.

Example

parse_mutations("A123V/T56K") [Mutation('A', 123, 'V'), Mutation('T', 56, 'K')]