Composition
The previous chapter argued that heterogeneity should be described by a regime chosen from a small taxonomy – Heterogeneity Regime 1: Discrete Ensemble, Heterogeneity Regime 2: Trajectory, Heterogeneity Regime 3: Continuous Landscape, with the factor graph regime as a sidebar – and that every descriptor attaches to a scope: a level of the physical Hierarchy or a grouping declared in Representation. This chapter is about the consequence: real systems almost never have just one heterogeneity descriptor. A crystallographic structure can simultaneously carry an atom-scoped B-factor, a residue-scoped altloc, and a chain-scoped TLS group; a cryo-EM reconstruction can carry an assembly-scoped ratcheting mode, a subunit-scoped rigid-body motion, and several entity-instance-scoped compositional states. The format has to support these descriptors coexisting, composing predictably, and being stored in different regimes and different materialization modes without any of them pinning the others to a particular choice.
The composition principle
The total configuration of the system in some sampled state \(s\) is the reference structure plus contributions from every scope along the Hierarchy path of each atom. In the notation of the previous chapter:
\[ x_i(s) = x_i^{\mathrm{ref}} + \sum_{\ell \in \mathrm{path}(i)} \Delta_i^{\ell}(s_\ell) \]
- \(i\) indexes a specific atom.
- \(x_i^{\mathrm{ref}}\) is the atom’s reference position, a single point in \(\mathbb{R}^3\).
- \(\mathrm{path}(i)\) is the list of Hierarchy ancestors of atom \(i\) – for example \([\text{atom }i, \text{residue }45, \text{chain }A, \text{assembly}]\) for a backbone atom.
- \(s\) is the total state tuple, with one entry per heterogeneity descriptor declared on the structure; \(s_\ell\) picks out the entry for the descriptor at level \(\ell\).
- \(\Delta_i^\ell(s_\ell)\) is the displacement vector in \(\mathbb{R}^3\) contributed to atom \(i\) by level \(\ell\) being in state \(s_\ell\). For a chain-scoped TLS group it is “apply the parameterized rigid-body transform to \(x_i^{\mathrm{ref}}\), then subtract \(x_i^{\mathrm{ref}}\)”; for a residue-scoped altloc in state \(B\) it is the stored delta for atom \(i\) in altloc \(B\); for an atom-scoped Gaussian jitter it is a sample from the per-atom displacement distribution.
The formula is agnostic to whether any particular \(s_\ell\) is discrete or continuous. A residue-scoped altloc variable takes values in \(\{A, B\}\); a chain-scoped rigid-body variable takes values in \(\mathbb{R}^3\); an atom-scoped isotropic ADP variable is an implicit Gaussian. All three produce displacement vectors that add linearly, so the same composition rule handles mixed-regime structures without special cases.
Visually, the contributions stack along the Hierarchy path of the atom:
\[ \underbrace{x_i(s)}_{\text{rendered}} \;=\; \underbrace{x_i^{\mathrm{ref}}}_{\text{reference}} \;+\; \underbrace{\Delta_i^{\text{assembly}}(s_{\text{asm}})}_{\text{e.g. ratcheting mode, 8D}} \;+\; \underbrace{\Delta_i^{\text{chain}}(s_{\text{ch}})}_{\text{e.g. TLS rigid-body}} \;+\; \underbrace{\Delta_i^{\text{residue}}(s_{\text{res}})}_{\text{e.g. altloc A/B}} \;+\; \underbrace{\Delta_i^{\text{atom}}(s_{\text{atm}})}_{\text{e.g. ADP jitter}} \]
Each \(\Delta_i^{\ell}\) is indexed by the state variable at its own scope and contributes a vector in \(\mathbb{R}^3\). The underbraces spell out what a typical deposition actually stores at each scope: a Regime 3 continuous mode at the assembly, a Regime 3 rigid-body parameterization at the chain, a Regime 1 discrete choice at the residue, and a Regime 3 Gaussian at the atom. Four regimes, one formula.
There are two direct consequences. First, the regime chosen at one scope does not constrain the regime at any other scope: a chain-level continuous descriptor and a residue-level discrete descriptor coexist without friction. Second, the materialization mode is also per-scope: a chain-level descriptor can be stored as parameters (Materialization Mode C: Generative / External Reference) while a residue-level descriptor is stored as sparse deltas (Materialization Mode B: Delta Encoding) and a few named system-wide snapshots are stored as full enumerations (Materialization Mode A: Full Enumeration). Composition happens at render time, not at store time.
Assumption: independence or hierarchical nesting
We assume heterogeneity descriptors at different scopes are either independent or hierarchically nested, with explicit cross-scope factor potentials allowed as a rarely-used escape hatch. The two common cases:
Independence. The joint distribution factors: \(p(s) = \prod_\ell p(s_\ell)\). Storage is linear in the sum of per-scope state spaces, not in their product. The chain-level TLS parameters, residue-level altlocs, and atom-level jitter of a typical crystal are usually assumed independent under this model, which is exactly how crystallographic refinement treats them.
Hierarchical nesting. The state space of a child-scope descriptor is conditional on the parent-scope descriptor. The canonical case is a compositional-ligand \(\supset\) conformational-loop nesting: when the chain has ligand bound (compositional state \(X\)), the loop can take conformations \(\{A, B\}\); when the ligand is absent (compositional state \(Y\)), the loop takes conformation \(\{C\}\) only. The child descriptor carries a pointer to the parent state that activates it, and invalid combinations are rejected at render time. This is a restricted Cartesian product – a DAG of (scope, state) nodes with edges encoding which parent-child combinations are legal.
Explicit cross-scope factor. A factor \(\psi(s_\ell, s_m)\) that weights combinations of states across two scopes without strictly forbidding them. This is the factor graph formalism from the architecture chapter, now potentially acting between levels rather than within a single level. We allow it but expect it to be rare: in practice, biological heterogeneity is structured and is better encoded via nesting than via generic coupling tables.
The independence-or-nesting assumption is what makes storage scale with the sum, not the product, of scope state spaces. Without it, the format has to fall back on storing full Cartesian state tables, which is Heterogeneity Regime 1: Discrete Ensemble at the whole-system scope and defeats the point of scoping descriptors.
Worked examples
Five scenarios at increasing complexity. Each lists every heterogeneity descriptor, its scope, its regime, and its materialization mode, and notes the coupling pattern. The last two are the interesting ones – they mix all three heterogeneity regimes across the Hierarchy in a single deposition.
Example 1: Crystallographic kinase with inhibitor, 1.6 Å
A typical medium-resolution crystal structure. Two domains that rock relative to each other across the unit cell, sidechain rotamer flexibility in a subset of surface residues, and residual thermal motion everywhere.
| Scope | Descriptor | Heterogeneity regime | Materialization mode | Stored |
|---|---|---|---|---|
| chain, N-terminal domain | rigid-body TLS | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 20 TLS params |
| chain, C-terminal domain | rigid-body TLS | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 20 TLS params |
| residue (~30 residues) | altloc (2 states each) | Regime 1: Discrete Ensemble | Mode B: Delta Encoding | sparse deltas per residue |
| atom (all atoms) | isotropic jitter | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 1 scalar per atom |
Coupling: assumed independent across descriptors, which is how crystallographic refinement produces them. A render is obtained by sampling TLS parameters for each domain, independently sampling altloc for each of the \(\sim 30\) variable residues, and sampling isotropic jitter per atom. Total storage is \(O(N_{\mathrm{atom}})\), not \(O(N_{\mathrm{altloc\ states}} \cdot N_{\mathrm{atom}})\).
This is the most common layout and would cover a large fraction of existing PDB entries if mapped directly. Today it is deposited as a single-model multiconformer entry with an altloc column, TLS blocks, and a B-factor column – three descriptors encoded in three unrelated syntactic conventions. The architecture proposed here makes them three instances of the same abstraction.
Example 2: Protein with ambiguous ligand pose, training data for a docking model
The same binding site is observed with three distinct ligand poses across a batch of crystals, each pose associated with a different loop conformation of a nearby flexible region. This is an ML-training-data use case: the model consumer wants every valid combination as a training example, not an averaged blur.
| Scope | Descriptor | Heterogeneity regime | Materialization mode | Stored |
|---|---|---|---|---|
| entity instance (ligand) | compositional pose (3 states) | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | full ligand coords \(\times 3\) |
| residue range 120–130 | loop conformation | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | full loop coords, conditional on pose |
| atom (all atoms) | isotropic jitter | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 1 scalar per atom |
Coupling: the loop-conformation descriptor is nested under the ligand-pose descriptor. Pose 1 activates loop state open, pose 2 activates half-open, pose 3 activates closed. A render first chooses a ligand pose, then chooses the unique loop state parented by that pose, then samples atom-level jitter.
The nesting collapses what would naively be a \(3 \times 3 = 9\)-state joint into 3 legal combinations, with storage linear in the size of the three loop conformations rather than quadratic. More importantly, an ML consumer iterating over states gets exactly the biologically meaningful combinations and no unphysical cross-products. The compositional/conformational hierarchy is visible in the data and legible at training time.
Example 3: 70S ribosome cryo-EM reconstruction
Large assembly with simultaneously active heterogeneity at several scopes: global ratcheting between the small and large subunits, local rRNA helix breathing, variable tRNA occupancy at three sites, and domain-level rearrangements within each subunit.
| Scope | Descriptor | Heterogeneity regime | Materialization mode | Stored |
|---|---|---|---|---|
| assembly (subunit pair) | ratcheting + head swivel | Regime 3: Continuous Landscape (low-dim) | Mode C: Generative | 2-axis basis + \(N_{\mathrm{sample}}\) coefficient vectors |
| subunit (small) | domain rigid-body motion | Regime 3: Continuous Landscape (low-dim) | Mode C: Generative | \(k\) modes + coefficients |
| subunit (large) | domain rigid-body motion | Regime 3: Continuous Landscape (low-dim) | Mode C: Generative | \(k\) modes + coefficients |
| secondary structure (rRNA helices, ~20 helices) | TLS breathing | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 20 TLS params per helix |
| entity instance (tRNA at A, P, E sites) | occupancy (absent / present-tRNA-X / present-tRNA-Y) | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | full tRNA coords when present |
| entity instance (tRNA, conditional on present) | pose within site | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | handful of full poses per tRNA |
| protein-protein interface (several) | mini-TLS | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 20 params per interface |
| atom (all atoms) | isotropic jitter | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 1 scalar per atom |
Coupling: the tRNA pose descriptor at each site is nested under the occupancy descriptor for that site (“you have a pose only if there is a tRNA there”). The domain motions within each subunit may or may not be coupled to the global ratchet depending on how the data was analyzed; the format supports either by nesting the subunit-level descriptor under the assembly-level one, or by leaving them independent.
This layout demonstrates the payoff of the architecture at scale. A ribosome has tens of thousands of atoms and a rich, multi-scale heterogeneity that is the entire point of the biological system. Flattening it into a single-model coords[state, atom, 3] tensor throws away the structure; flattening it into a Heterogeneity Regime 1: Discrete Ensemble of explicit full-assembly conformers produces combinatorially many states that are not what the microscope measured. Storing each kind of variation at its natural scope, in its natural regime, with the materialization that fits, preserves both the compression and the semantic structure.
Example 4: GPCR with MD trajectory and ligand-binding substates
A G-protein-coupled receptor is studied by a combination of crystallography and molecular dynamics. A reference crystal structure defines the Hierarchy; a 2 µs all-atom MD trajectory captures the local dynamics around the orthosteric binding site; and three discrete ligand-bound substates are fit to density in the MD-sampled conformational basin. This exercises all three heterogeneity regimes in a single deposition.
| Scope | Descriptor | Heterogeneity regime | Materialization mode | Stored |
|---|---|---|---|---|
| assembly | full MD trajectory | Regime 2: Trajectory | Mode C: External Reference | pointer to H5MD file, 2 µs at 10 ps stride = \(2 \times 10^5\) frames |
| assembly | tICA-derived 3-state MSM | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | 3 representative structures + \(3 \times 3\) rate matrix |
| entity instance (ligand) | binding substate (3 poses) | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | 3 ligand conformations |
| residue range (TM6 kink, residues 298–312) | backbone twist | Regime 3: Continuous Landscape (1D) | Mode C: Generative | 1 mode vector + \(N_{\mathrm{sample}}\) scalar coefficients |
| atom (all atoms) | anisotropic ADP | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 6 params per atom |
Coupling: the MSM descriptor is marked as derived from the trajectory descriptor (not nested – a provenance link, not a state-space restriction). The ligand substate descriptor is nested under the MSM: substate 1 appears only in metastable state A, substate 2 in state B, substate 3 in either B or C. The TM6 twist and the ADPs are independent of everything else.
A consumer reading this deposition has several legitimate views: a trajectory-only view (just the Regime 2 payload) for training an MD surrogate, an ensemble view (the 3 MSM states \(\times\) the nested ligand substates) for docking studies, a single static view (pick one MSM state, one substate, one twist coefficient, fold in atom-scope ADPs as B-factors) for PyMOL. None of these views requires converting the others.
Example 5: Cryo-ET subtomogram-averaged ribosome with multi-scale heterogeneity
A single-cell cryo-electron tomography reconstruction of bacterial ribosomes in situ: a few thousand particle instances embedded in the tomogram, each with its own rigid-body pose in the cell, each contributing to a subtomogram average that resolves the conformational landscape of translation. This is a case where the Hierarchy has an extra instance-scope axis above the assembly (the particles), and heterogeneity shows up at nearly every level.
| Scope | Descriptor | Heterogeneity regime | Materialization mode | Stored |
|---|---|---|---|---|
| particle (instance in tomogram) | rigid-body pose | Regime 3: Continuous Landscape (6-DOF) | Mode C: Generative | quaternion + translation per particle, \(\sim 4000\) particles |
| particle (instance) | functional-state class | Regime 1: Discrete Ensemble | particle \(\to\) class index | class label per particle (4 classes) |
| assembly (averaged) | ratcheting latent | Regime 3: Continuous Landscape (8D, cryoDRGN) | Mode C: Generative | latent coords per particle + decoder checkpoint |
| entity instance (A-site tRNA) | occupancy + identity | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | 5 possibilities (absent, initiator, three elongators) |
| entity instance (P-site tRNA) | occupancy + identity | Regime 1: Discrete Ensemble | Mode A: Full Enumeration | 5 possibilities |
| secondary structure (intersubunit bridges) | hinge angle | Regime 3: Continuous Landscape (1D) | Mode C: Generative | 1 mode + coefficients, derived from cryoDRGN latent |
| atom (all atoms of the averaged model) | isotropic jitter | Regime 3: Continuous Landscape (degenerate) | Mode C: Generative | 1 scalar per atom |
Coupling: the functional-state class (a discrete descriptor per particle) is correlated with but not nested under the continuous cryoDRGN latent – class labels were assigned by clustering in latent space, so the class descriptor is a named-landmark overlay on the continuous landscape, sharing provenance with the Regime 3 descriptor. The A-site and P-site tRNA descriptors are jointly constrained by a cross-scope factor (\(\psi(s_A, s_P)\): certain combinations like “initiator in A-site plus elongator in P-site” are disallowed by biology) – this is the rare case where the factor graph formalism genuinely earns its keep. The intersubunit hinge angle is marked as derived from the ratcheting latent; a consumer that reads the latent can regenerate hinges, while a consumer that only wants rigid-body positions can use the hinge descriptor directly.
The Hierarchy here has one level above the assembly – the particle instance – because each particle has its own 6-DOF pose and its own class assignment. Per-particle heterogeneity (pose, class) lives at that scope; per-averaged-structure heterogeneity (latent, hinge, ADP) lives at the assembly and below. The format treats these uniformly: a descriptor is just a (scope, regime, mode) triple, and adding a new scope level above the assembly does not break anything below.
What composition buys
Three things, in decreasing order of concreteness:
Storage linearity. Independence and nesting both keep total storage to the sum of per-scope costs rather than the product. An ensemble with heterogeneity at five different scopes and a few states each does not become a Cartesian product of twenty-thousand conformers.
Regime independence. Each scope picks the regime that fits its physics. A structure is not pushed into Regime 1: Discrete Ensemble just because one of its scopes is discrete, nor into Regime 3: Continuous Landscape just because another is continuous. The reverse pressure is also absent: there is no incentive to collapse everything into a single uniform regime for consistency.
Semantic legibility. Consumers – whether a visualization tool, an ML dataloader, or a refinement program – can ask “what varies at chain scope,” “what varies at residue scope,” “what varies at atom scope” and get back descriptors typed by scope. A PyMOL-like consumer can ignore everything above residue scope if it wants a single-model render. A training pipeline can iterate over the nested compositional/conformational combinations without flattening to 50,000 explicit states. A refinement engine can update the chain-scope TLS parameters without touching the residue-scope altlocs.
The next chapter walks through the annotations and computed-pass machinery that sits on top of this layered core, and how established tools (Zarr, DataFusion, TileDB, HDF5) map onto the storage side of the four layers plus the composition rule above.