Composition

The previous chapter argued that heterogeneity should be described by a regime chosen from a small taxonomy – Heterogeneity Regime 1: Discrete Ensemble, Heterogeneity Regime 2: Trajectory, Heterogeneity Regime 3: Continuous Landscape, with the factor graph regime as a sidebar – and that every descriptor attaches to a scope: a level of the physical Hierarchy or a grouping declared in Representation. This chapter is about the consequence: real systems almost never have just one heterogeneity descriptor. A crystallographic structure can simultaneously carry an atom-scoped B-factor, a residue-scoped altloc, and a chain-scoped TLS group; a cryo-EM reconstruction can carry an assembly-scoped ratcheting mode, a subunit-scoped rigid-body motion, and several entity-instance-scoped compositional states. The format has to support these descriptors coexisting, composing predictably, and being stored in different regimes and different materialization modes without any of them pinning the others to a particular choice.

The composition principle

The total configuration of the system in some sampled state \(s\) is the reference structure plus contributions from every scope along the Hierarchy path of each atom. In the notation of the previous chapter:

\[ x_i(s) = x_i^{\mathrm{ref}} + \sum_{\ell \in \mathrm{path}(i)} \Delta_i^{\ell}(s_\ell) \]

\(i\) indexes a specific atom.
\(x_i^{\mathrm{ref}}\) is the atom’s reference position, a single point in \(\mathbb{R}^3\).
\(\mathrm{path}(i)\) is the list of Hierarchy ancestors of atom \(i\) – for example \([\text{atom }i, \text{residue }45, \text{chain }A, \text{assembly}]\) for a backbone atom.
\(s\) is the total state tuple, with one entry per heterogeneity descriptor declared on the structure; \(s_\ell\) picks out the entry for the descriptor at level \(\ell\).
\(\Delta_i^\ell(s_\ell)\) is the displacement vector in \(\mathbb{R}^3\) contributed to atom \(i\) by level \(\ell\) being in state \(s_\ell\). For a chain-scoped TLS group it is “apply the parameterized rigid-body transform to \(x_i^{\mathrm{ref}}\), then subtract \(x_i^{\mathrm{ref}}\)”; for a residue-scoped altloc in state \(B\) it is the stored delta for atom \(i\) in altloc \(B\); for an atom-scoped Gaussian jitter it is a sample from the per-atom displacement distribution.

The formula is agnostic to whether any particular \(s_\ell\) is discrete or continuous. A residue-scoped altloc variable takes values in \(\{A, B\}\); a chain-scoped rigid-body variable takes values in \(\mathbb{R}^3\); an atom-scoped isotropic ADP variable is an implicit Gaussian. All three produce displacement vectors that add linearly, so the same composition rule handles mixed-regime structures without special cases.

Visually, the contributions stack along the Hierarchy path of the atom:

\[ \underbrace{x_i(s)}_{\text{rendered}} \;=\; \underbrace{x_i^{\mathrm{ref}}}_{\text{reference}} \;+\; \underbrace{\Delta_i^{\text{assembly}}(s_{\text{asm}})}_{\text{e.g. ratcheting mode, 8D}} \;+\; \underbrace{\Delta_i^{\text{chain}}(s_{\text{ch}})}_{\text{e.g. TLS rigid-body}} \;+\; \underbrace{\Delta_i^{\text{residue}}(s_{\text{res}})}_{\text{e.g. altloc A/B}} \;+\; \underbrace{\Delta_i^{\text{atom}}(s_{\text{atm}})}_{\text{e.g. ADP jitter}} \]

Each \(\Delta_i^{\ell}\) is indexed by the state variable at its own scope and contributes a vector in \(\mathbb{R}^3\). The underbraces spell out what a typical deposition actually stores at each scope: a Regime 3 continuous mode at the assembly, a Regime 3 rigid-body parameterization at the chain, a Regime 1 discrete choice at the residue, and a Regime 3 Gaussian at the atom. Four regimes, one formula.

There are two direct consequences. First, the regime chosen at one scope does not constrain the regime at any other scope: a chain-level continuous descriptor and a residue-level discrete descriptor coexist without friction. Second, the materialization mode is also per-scope: a chain-level descriptor can be stored as parameters (Materialization Mode C: Generative / External Reference) while a residue-level descriptor is stored as sparse deltas (Materialization Mode B: Delta Encoding) and a few named system-wide snapshots are stored as full enumerations (Materialization Mode A: Full Enumeration). Composition happens at render time, not at store time.

Assumption: independence or hierarchical nesting

We assume heterogeneity descriptors at different scopes are either independent or hierarchically nested, with explicit cross-scope factor potentials allowed as a rarely-used escape hatch. The two common cases:

Independence. The joint distribution factors: \(p(s) = \prod_\ell p(s_\ell)\). Storage is linear in the sum of per-scope state spaces, not in their product. The chain-level TLS parameters, residue-level altlocs, and atom-level jitter of a typical crystal are usually assumed independent under this model, which is exactly how crystallographic refinement treats them.

Hierarchical nesting. The state space of a child-scope descriptor is conditional on the parent-scope descriptor. The canonical case is a compositional-ligand \(\supset\) conformational-loop nesting: when the chain has ligand bound (compositional state \(X\)), the loop can take conformations \(\{A, B\}\); when the ligand is absent (compositional state \(Y\)), the loop takes conformation \(\{C\}\) only. The child descriptor carries a pointer to the parent state that activates it, and invalid combinations are rejected at render time. This is a restricted Cartesian product – a DAG of (scope, state) nodes with edges encoding which parent-child combinations are legal.

Explicit cross-scope factor. A factor \(\psi(s_\ell, s_m)\) that weights combinations of states across two scopes without strictly forbidding them. This is the factor graph formalism from the architecture chapter, now potentially acting between levels rather than within a single level. We allow it but expect it to be rare: in practice, biological heterogeneity is structured and is better encoded via nesting than via generic coupling tables.

The independence-or-nesting assumption is what makes storage scale with the sum, not the product, of scope state spaces. Without it, the format has to fall back on storing full Cartesian state tables, which is Heterogeneity Regime 1: Discrete Ensemble at the whole-system scope and defeats the point of scoping descriptors.

Worked examples

Five scenarios at increasing complexity. Each lists every heterogeneity descriptor, its scope, its regime, and its materialization mode, and notes the coupling pattern. The last two are the interesting ones – they mix all three heterogeneity regimes across the Hierarchy in a single deposition.

Example 1: Crystallographic kinase with inhibitor, 1.6 Å

A typical medium-resolution crystal structure. Two domains that rock relative to each other across the unit cell, sidechain rotamer flexibility in a subset of surface residues, and residual thermal motion everywhere.

Scope	Descriptor	Heterogeneity regime	Materialization mode	Stored
chain, N-terminal domain	rigid-body TLS	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	20 TLS params
chain, C-terminal domain	rigid-body TLS	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	20 TLS params
residue (~30 residues)	altloc (2 states each)	Regime 1: Discrete Ensemble	Mode B: Delta Encoding	sparse deltas per residue
atom (all atoms)	isotropic jitter	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	1 scalar per atom

Coupling: assumed independent across descriptors, which is how crystallographic refinement produces them. A render is obtained by sampling TLS parameters for each domain, independently sampling altloc for each of the \(\sim 30\) variable residues, and sampling isotropic jitter per atom. Total storage is \(O(N_{\mathrm{atom}})\), not \(O(N_{\mathrm{altloc\ states}} \cdot N_{\mathrm{atom}})\).

This is the most common layout and would cover a large fraction of existing PDB entries if mapped directly. Today it is deposited as a single-model multiconformer entry with an altloc column, TLS blocks, and a B-factor column – three descriptors encoded in three unrelated syntactic conventions. The architecture proposed here makes them three instances of the same abstraction.

Example 2: Protein with ambiguous ligand pose, training data for a docking model

The same binding site is observed with three distinct ligand poses across a batch of crystals, each pose associated with a different loop conformation of a nearby flexible region. This is an ML-training-data use case: the model consumer wants every valid combination as a training example, not an averaged blur.

Scope	Descriptor	Heterogeneity regime	Materialization mode	Stored
entity instance (ligand)	compositional pose (3 states)	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	full ligand coords \(\times 3\)
residue range 120–130	loop conformation	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	full loop coords, conditional on pose
atom (all atoms)	isotropic jitter	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	1 scalar per atom

Coupling: the loop-conformation descriptor is nested under the ligand-pose descriptor. Pose 1 activates loop state open, pose 2 activates half-open, pose 3 activates closed. A render first chooses a ligand pose, then chooses the unique loop state parented by that pose, then samples atom-level jitter.

The nesting collapses what would naively be a \(3 \times 3 = 9\)-state joint into 3 legal combinations, with storage linear in the size of the three loop conformations rather than quadratic. More importantly, an ML consumer iterating over states gets exactly the biologically meaningful combinations and no unphysical cross-products. The compositional/conformational hierarchy is visible in the data and legible at training time.

Example 3: 70S ribosome cryo-EM reconstruction

Large assembly with simultaneously active heterogeneity at several scopes: global ratcheting between the small and large subunits, local rRNA helix breathing, variable tRNA occupancy at three sites, and domain-level rearrangements within each subunit.

Scope	Descriptor	Heterogeneity regime	Materialization mode	Stored
assembly (subunit pair)	ratcheting + head swivel	Regime 3: Continuous Landscape (low-dim)	Mode C: Generative	2-axis basis + \(N_{\mathrm{sample}}\) coefficient vectors
subunit (small)	domain rigid-body motion	Regime 3: Continuous Landscape (low-dim)	Mode C: Generative	\(k\) modes + coefficients
subunit (large)	domain rigid-body motion	Regime 3: Continuous Landscape (low-dim)	Mode C: Generative	\(k\) modes + coefficients
secondary structure (rRNA helices, ~20 helices)	TLS breathing	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	20 TLS params per helix
entity instance (tRNA at A, P, E sites)	occupancy (absent / present-tRNA-X / present-tRNA-Y)	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	full tRNA coords when present
entity instance (tRNA, conditional on present)	pose within site	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	handful of full poses per tRNA
protein-protein interface (several)	mini-TLS	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	20 params per interface
atom (all atoms)	isotropic jitter	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	1 scalar per atom

Coupling: the tRNA pose descriptor at each site is nested under the occupancy descriptor for that site (“you have a pose only if there is a tRNA there”). The domain motions within each subunit may or may not be coupled to the global ratchet depending on how the data was analyzed; the format supports either by nesting the subunit-level descriptor under the assembly-level one, or by leaving them independent.

This layout demonstrates the payoff of the architecture at scale. A ribosome has tens of thousands of atoms and a rich, multi-scale heterogeneity that is the entire point of the biological system. Flattening it into a single-model coords[state, atom, 3] tensor throws away the structure; flattening it into a Heterogeneity Regime 1: Discrete Ensemble of explicit full-assembly conformers produces combinatorially many states that are not what the microscope measured. Storing each kind of variation at its natural scope, in its natural regime, with the materialization that fits, preserves both the compression and the semantic structure.

Example 4: GPCR with MD trajectory and ligand-binding substates

A G-protein-coupled receptor is studied by a combination of crystallography and molecular dynamics. A reference crystal structure defines the Hierarchy; a 2 µs all-atom MD trajectory captures the local dynamics around the orthosteric binding site; and three discrete ligand-bound substates are fit to density in the MD-sampled conformational basin. This exercises all three heterogeneity regimes in a single deposition.

Scope	Descriptor	Heterogeneity regime	Materialization mode	Stored
assembly	full MD trajectory	Regime 2: Trajectory	Mode C: External Reference	pointer to H5MD file, 2 µs at 10 ps stride = \(2 \times 10^5\) frames
assembly	tICA-derived 3-state MSM	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	3 representative structures + \(3 \times 3\) rate matrix
entity instance (ligand)	binding substate (3 poses)	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	3 ligand conformations
residue range (TM6 kink, residues 298–312)	backbone twist	Regime 3: Continuous Landscape (1D)	Mode C: Generative	1 mode vector + \(N_{\mathrm{sample}}\) scalar coefficients
atom (all atoms)	anisotropic ADP	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	6 params per atom

Coupling: the MSM descriptor is marked as derived from the trajectory descriptor (not nested – a provenance link, not a state-space restriction). The ligand substate descriptor is nested under the MSM: substate 1 appears only in metastable state A, substate 2 in state B, substate 3 in either B or C. The TM6 twist and the ADPs are independent of everything else.

A consumer reading this deposition has several legitimate views: a trajectory-only view (just the Regime 2 payload) for training an MD surrogate, an ensemble view (the 3 MSM states \(\times\) the nested ligand substates) for docking studies, a single static view (pick one MSM state, one substate, one twist coefficient, fold in atom-scope ADPs as B-factors) for PyMOL. None of these views requires converting the others.

Example 5: Cryo-ET subtomogram-averaged ribosome with multi-scale heterogeneity

A single-cell cryo-electron tomography reconstruction of bacterial ribosomes in situ: a few thousand particle instances embedded in the tomogram, each with its own rigid-body pose in the cell, each contributing to a subtomogram average that resolves the conformational landscape of translation. This is a case where the Hierarchy has an extra instance-scope axis above the assembly (the particles), and heterogeneity shows up at nearly every level.

Scope	Descriptor	Heterogeneity regime	Materialization mode	Stored
particle (instance in tomogram)	rigid-body pose	Regime 3: Continuous Landscape (6-DOF)	Mode C: Generative	quaternion + translation per particle, \(\sim 4000\) particles
particle (instance)	functional-state class	Regime 1: Discrete Ensemble	particle \(\to\) class index	class label per particle (4 classes)
assembly (averaged)	ratcheting latent	Regime 3: Continuous Landscape (8D, cryoDRGN)	Mode C: Generative	latent coords per particle + decoder checkpoint
entity instance (A-site tRNA)	occupancy + identity	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	5 possibilities (absent, initiator, three elongators)
entity instance (P-site tRNA)	occupancy + identity	Regime 1: Discrete Ensemble	Mode A: Full Enumeration	5 possibilities
secondary structure (intersubunit bridges)	hinge angle	Regime 3: Continuous Landscape (1D)	Mode C: Generative	1 mode + coefficients, derived from cryoDRGN latent
atom (all atoms of the averaged model)	isotropic jitter	Regime 3: Continuous Landscape (degenerate)	Mode C: Generative	1 scalar per atom

Coupling: the functional-state class (a discrete descriptor per particle) is correlated with but not nested under the continuous cryoDRGN latent – class labels were assigned by clustering in latent space, so the class descriptor is a named-landmark overlay on the continuous landscape, sharing provenance with the Regime 3 descriptor. The A-site and P-site tRNA descriptors are jointly constrained by a cross-scope factor (\(\psi(s_A, s_P)\): certain combinations like “initiator in A-site plus elongator in P-site” are disallowed by biology) – this is the rare case where the factor graph formalism genuinely earns its keep. The intersubunit hinge angle is marked as derived from the ratcheting latent; a consumer that reads the latent can regenerate hinges, while a consumer that only wants rigid-body positions can use the hinge descriptor directly.

The Hierarchy here has one level above the assembly – the particle instance – because each particle has its own 6-DOF pose and its own class assignment. Per-particle heterogeneity (pose, class) lives at that scope; per-averaged-structure heterogeneity (latent, hinge, ADP) lives at the assembly and below. The format treats these uniformly: a descriptor is just a (scope, regime, mode) triple, and adding a new scope level above the assembly does not break anything below.

What composition buys

Three things, in decreasing order of concreteness:

Storage linearity. Independence and nesting both keep total storage to the sum of per-scope costs rather than the product. An ensemble with heterogeneity at five different scopes and a few states each does not become a Cartesian product of twenty-thousand conformers.

Regime independence. Each scope picks the regime that fits its physics. A structure is not pushed into Regime 1: Discrete Ensemble just because one of its scopes is discrete, nor into Regime 3: Continuous Landscape just because another is continuous. The reverse pressure is also absent: there is no incentive to collapse everything into a single uniform regime for consistency.

Semantic legibility. Consumers – whether a visualization tool, an ML dataloader, or a refinement program – can ask “what varies at chain scope,” “what varies at residue scope,” “what varies at atom scope” and get back descriptors typed by scope. A PyMOL-like consumer can ignore everything above residue scope if it wants a single-model render. A training pipeline can iterate over the nested compositional/conformational combinations without flattening to 50,000 explicit states. A refinement engine can update the chain-scope TLS parameters without touching the residue-scope altlocs.

The next chapter walks through the annotations and computed-pass machinery that sits on top of this layered core, and how established tools (Zarr, DataFusion, TileDB, HDF5) map onto the storage side of the four layers plus the composition rule above.