molforge.sequence¶
sequence ¶
Sequence-level operations: alignment, mutations, composition, properties.
What's here:
- :func:align / :func:needleman_wunsch / :func:smith_waterman
— pairwise alignment with affine gaps and BLOSUM62/PAM250 matrices.
- :func:identity — convenience for "what's the sequence identity"
- :class:Mutation, :func:apply_mutation, :func:apply_mutations,
:func:mutate_protein — point mutations on sequences and Proteins.
- :func:composition, :func:molecular_weight, :func:gravy,
:func:aromaticity, :func:length — sequence properties.
For sequence I/O see :mod:molforge.io (FASTA).
Alignment
dataclass
¶
Alignment(
aligned_a: str,
aligned_b: str,
score: float,
identity: float,
coverage_a: float,
coverage_b: float,
start_a: int,
end_a: int,
start_b: int,
end_b: int,
)
Result of a pairwise sequence alignment.
Attributes:
| Name | Type | Description |
|---|---|---|
aligned_a |
str
|
Top sequence with gaps inserted ( |
aligned_b |
str
|
Bottom sequence with gaps inserted. |
score |
float
|
Total alignment score. |
identity |
float
|
Fraction of aligned positions where residues match (excluding gap-vs-gap and gap-vs-residue positions). |
coverage_a |
float
|
Fraction of sequence A covered by the alignment. |
coverage_b |
float
|
Fraction of sequence B covered by the alignment. |
start_a |
int
|
Start index in original sequence A (inclusive). |
end_a |
int
|
End index in original sequence A (exclusive). |
start_b |
int
|
Start index in original sequence B (inclusive). |
end_b |
int
|
End index in original sequence B (exclusive). |
format ¶
Format the alignment as a human-readable block.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
width
|
int
|
Characters per row. |
60
|
Returns:
| Type | Description |
|---|---|
str
|
Multi-line string showing aligned_a / matches / aligned_b |
str
|
in stacked blocks of |
Mutation
dataclass
¶
A single point mutation.
Attributes:
| Name | Type | Description |
|---|---|---|
wild_type |
str
|
One-letter code of the original residue. |
position |
int
|
1-indexed residue number (matches PDB |
mutant |
str
|
One-letter code of the new residue. |
chain_id |
str | None
|
Optional chain identifier (some workflows need it). |
parse
classmethod
¶
Parse a mutation string like "A123V".
Supports chain prefix syntax "A:K42N" for chain-aware
mutations.
Raises:
| Type | Description |
|---|---|
ValueError
|
If the spec doesn't match the expected format. |
align ¶
align(
a: str,
b: str,
*,
mode: str = "global",
matrix: str | None = "BLOSUM62",
match: int = 2,
mismatch: int = -1,
gap_open: int = -10,
gap_extend: int = -1,
) -> Alignment
Pairwise alignment entry point.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
str
|
|
'global'
|
See :func:needleman_wunsch / :func:smith_waterman for the rest.
identity ¶
Convenience: align and return only the identity score.
needleman_wunsch ¶
needleman_wunsch(
a: str,
b: str,
*,
matrix: str | None = "BLOSUM62",
match: int = 2,
mismatch: int = -1,
gap_open: int = -10,
gap_extend: int = -1,
) -> Alignment
Global pairwise alignment (Needleman-Wunsch with affine gaps).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
str
|
First sequence to align. |
required |
b
|
str
|
Second sequence to align. |
required |
matrix
|
str | None
|
Substitution matrix name (e.g. |
'BLOSUM62'
|
match
|
int
|
Per-position match score when |
2
|
mismatch
|
int
|
Per-position mismatch score when |
-1
|
gap_open
|
int
|
Penalty for opening a new gap (added at gap start). |
-10
|
gap_extend
|
int
|
Penalty for each additional gap position. |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
An |
Alignment
|
class: |
Notes
Implements affine gap penalties via the standard three-matrix formulation (M for match/mismatch, X for gap in A, Y for gap in B). Penalties are added to the score, so they should be negative.
smith_waterman ¶
smith_waterman(
a: str,
b: str,
*,
matrix: str | None = "BLOSUM62",
match: int = 2,
mismatch: int = -1,
gap_open: int = -10,
gap_extend: int = -1,
) -> Alignment
Local pairwise alignment (Smith-Waterman with affine gaps).
Finds the highest-scoring local subsequence pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
str
|
First sequence to align. |
required |
b
|
str
|
Second sequence to align. |
required |
matrix
|
str | None
|
see :func: |
'BLOSUM62'
|
match
|
int
|
see :func: |
2
|
mismatch
|
int
|
see :func: |
-1
|
gap_open
|
int
|
see :func: |
-10
|
gap_extend
|
int
|
see :func: |
-1
|
Returns:
| Name | Type | Description |
|---|---|---|
An |
Alignment
|
class: |
Alignment
|
|
aromaticity ¶
Fraction of aromatic residues (F, W, Y).
A quick proxy for things like protein UV absorption.
gravy ¶
Grand average of hydropathy (GRAVY) — Kyte-Doolittle 1982.
Higher = more hydrophobic. Soluble globular proteins typically have GRAVY between -2 and +2.
molecular_weight ¶
Approximate molecular weight in Da.
Sums per-residue monoisotopic masses plus one water for the terminal H/OH. Non-standard residues are ignored.
available_matrices ¶
List the named substitution matrices that ship with molforge.
get_matrix ¶
Return (matrix, index_map) for a named substitution matrix.
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
apply_mutation ¶
Apply a single point mutation to a sequence string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
One-letter amino-acid sequence. |
required |
mutation
|
Mutation | str
|
A :class: |
required |
Returns:
| Type | Description |
|---|---|
str
|
The mutated sequence. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the position is out of range, or the wild-type residue doesn't match what's actually at that position. |
apply_mutations ¶
Apply multiple mutations to a sequence.
Accepts a slash-delimited string or an iterable of mutations.
mutate_protein ¶
mutate_protein(
protein: Protein,
mutation: Mutation | str,
*,
chain_id: str | None = None,
) -> Protein
Apply a sequence-level mutation to a :class:Protein.
Updates the residue_name field of the affected residue. Side
chain atoms are not rebuilt — for full structural mutation, route
through Rosetta, OpenMM, or a side-chain repacker. This function is
for sequence-bookkeeping and downstream tools that re-fold or repack
on their own.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
protein
|
Protein
|
The structure to mutate (a copy is returned; original is unmodified). |
required |
mutation
|
Mutation | str
|
A :class: |
required |
chain_id
|
str | None
|
Which chain to mutate. If omitted, uses the chain
embedded in the mutation string ( |
None
|
Returns:
| Type | Description |
|---|---|
Protein
|
A new :class: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the position doesn't exist, or the wild-type doesn't match the residue actually present. |
parse_mutations ¶
Parse a slash- (or comma- or whitespace-) delimited mutation string.
Example
parse_mutations("A123V/T56K") [Mutation('A', 123, 'V'), Mutation('T', 56, 'K')]