Glossary: conformational ensembles, statistical mechanics, mmCIF encoding

Published

May 22, 2026

A working glossary of terms that come up when reading the Wankowicz–Bonomi 2026 Nature Methods perspective and adjacent literature on macromolecular conformational ensembles. Organized thematically rather than alphabetically because most of these concepts build on each other.

The Boltzmann framework

Microstate. One specific atomic configuration of a molecule — every atom in some position, every dihedral angle at some value. A microstate is the most fine-grained description a structural biologist talks about. A protein in solution is not in one microstate; it is constantly hopping between many of them. The list of all accessible microstates is the state space of the molecule.

Energy of a microstate, \(E_i\). Each microstate has an associated energy that depends on the geometry — bond stretches, angle distortions, electrostatic interactions, hydrogen bonds, solvent contacts. Low-energy microstates are “stable”; high-energy ones are strained.

Boltzmann factor, \(e^{-\beta E_i}\). A microstate’s relative statistical weight at temperature \(T\), where \(\beta \equiv 1/(k_B T)\) and \(k_B\) is Boltzmann’s constant. A state at energy 0 has weight 1; a state at energy \(k_B T\) has weight \(\approx 0.37\); a state at \(3 k_B T\) has weight \(\approx 0.05\). Boltzmann factors are not yet probabilities — they don’t sum to 1 — they’re the raw weights before normalization.

Partition function, \(Z\). The sum of Boltzmann factors over all microstates: \[Z \;=\; \sum_i e^{-\beta E_i}.\] Its operational role is to be the normalizing constant that turns weights into probabilities. The name is a (somewhat awkward) English translation of the German Zustandssumme, “sum over states.” What is “partitioned” is the total probability mass of 1, which gets split across all microstates in proportion to their Boltzmann factors. \(Z\) is also the central object of statistical mechanics — it determines the system’s free energy via \(F = -k_B T \ln Z\), and every macroscopic thermodynamic quantity (average energy, entropy, heat capacity, equilibrium constants) is derivable from \(Z\) or its derivatives. A key consequence: every microstate contributes a term to \(Z\), so dropping a state from the sum miscomputes the populations of every state, not just the one you missed.

Boltzmann-weighted ensemble. The collection of all accessible microstates, each present with probability \(p_i = e^{-\beta E_i} / Z\). This is the statistically correct description of a molecule in thermal equilibrium with its environment. It is the underlying object that every structural-biology measurement is some lossy projection of.

Ensemble average, \(\langle O \rangle\). Any observable quantity \(O\) measured experimentally is in general an average over the ensemble: \[\langle O \rangle \;=\; \sum_i p_i \, O_i \;=\; \frac{1}{Z} \sum_i O_i \, e^{-\beta E_i}.\] This is the equation that fundamentally connects structural ensembles to experimental data. Different techniques measure different observables \(O_i\), so each technique is sensitive to a different functional of the ensemble.

Free energy landscape. A surface (in some abstract conformational coordinate) whose height at each point is the free energy of that microstate, \(F_i = -k_B T \ln p_i + \text{const}\). Wells correspond to populated states, barriers to transition states. Figure 1 of the paper is exactly this picture: the protein “lives” in the wells, with the populations dictated by Boltzmann weighting.

Probability and statistics

Joint distribution, \(P(X, Y)\). The probability of two (or more) random variables taking specific values together. For two binary variables this is a 2×2 table summing to 1. The joint contains the complete information about how the two variables relate.

Marginal distribution, \(P(X)\). The distribution of one variable on its own, obtained by summing the joint over the other variable: \(P(X = x) = \sum_y P(X = x, Y = y)\). The name comes from old probability tables where row-sums and column-sums went in the margins. Marginalizing throws away correlation information. Many different joints share the same marginals, so you cannot reconstruct a joint from marginals alone — this is a recurring theme when discussing how experimental data (which often delivers marginals) constrains conformational ensembles (which are joint distributions over many coupled degrees of freedom).

Conditional probability, \(P(Y \mid X)\). “Given that \(X\) took some value, what’s the distribution of \(Y\)?” Definition: \(P(Y = y \mid X = x) = P(X = x, Y = y) / P(X = x)\). Conditional probabilities are where the correlation information that marginals lose is actually visible. For independent variables, \(P(Y \mid X) = P(Y)\) — knowing \(X\) tells you nothing about \(Y\). For perfectly correlated variables, \(P(Y \mid X)\) is deterministic.

Independence. Two variables are independent if \(P(X, Y) = P(X) \cdot P(Y)\). Equivalently: the joint is fully determined by the marginals, no extra information to find. In structural biology, treating residues’ altlocs as independent is the implicit assumption made by the mmCIF format, and it’s almost always wrong.

Covariance and correlation coefficient. Numerical summaries of how two variables co-vary. Covariance \(\mathrm{Cov}(X, Y) = \langle XY \rangle - \langle X \rangle \langle Y \rangle\) is positive when \(X\) and \(Y\) rise together, negative when they trade off, zero when independent. The correlation coefficient \(\rho = \mathrm{Cov}(X, Y) / (\sigma_X \sigma_Y)\) normalizes covariance to the range \([-1, +1]\). Both are scalar summaries — they reduce the joint to a single number and lose detail, but they’re enough to spot dependence.

Experimental techniques (acronyms spelled out)

Nuclear Magnetic Resonance (NMR). A spectroscopy that probes nuclei (most often \(^1\text{H}\), \(^{13}\text{C}\), \(^{15}\text{N}\)) by exposing them to strong magnetic fields and radiofrequency pulses. Sensitive to the local chemical environment of each nucleus and to through-space and through-bond interactions between nuclei. Inherently an ensemble- and time-averaging technique on the millisecond–nanosecond range.

Nuclear Overhauser Effect (NOE). An NMR observable where the spin of one nucleus affects another through space (via dipole-dipole coupling). Crucially, the signal scales as \(1/r^6\) where \(r\) is the distance between the two nuclei. The sixth-power dependence means a small fraction of molecules at short distance contributes vastly more signal than the bulk at longer distance — so NOE-derived “distances” are heavily skewed toward rare, short-distance conformers, which is a major issue for ensemble fitting.

Förster Resonance Energy Transfer (FRET). Optical analog of the NOE: energy transfer between a donor and acceptor fluorescent dye attached to two sites on the protein. Same \(1/r^6\) dependence, same sensitivity to rare close-distance states. Distance range ≈ 1–10 nm.

Double Electron–Electron Resonance (DEER), also called Pulsed Electron–Electron Double Resonance (PELDOR). Like NMR but using unpaired electron spins on attached spin labels (small molecules with a stable radical, usually a nitroxide). Measures distances between two labels in the range ≈ 2–8 nm. Less biased toward short distances than NOE/FRET (the coupling falls off as \(1/r^3\) rather than \(1/r^6\)).

Small-Angle X-ray Scattering (SAXS). Solution-state X-ray scattering at low angles, giving a low-resolution overall shape/size. Ensemble-averaged over all conformations in the sample.

Hydrogen–Deuterium Exchange (HDX). Place the protein in heavy water; protons in exposed/floppy regions exchange with deuterium quickly, protected/structured ones slowly. Usually read out by mass spectrometry. A coarse but powerful probe of which parts of the protein are structured.

Atomic Force Microscopy (AFM). Physical-probe imaging where a tiny cantilever tip drags across a surface or pulls on a single molecule. Used for single-molecule mechanics.

Cryogenic Electron Microscopy (cryo-EM). Imaging frozen-hydrated samples by transmission electron microscopy. Single-particle cryo-EM reconstructs 3D maps from thousands of individual particle images, each frozen in some microstate — so the dataset is, in principle, a sample of the ensemble.

X-ray crystallography. Diffraction from a crystal of the molecule. The observed electron density is a time- and space-average over all unit cells in the crystal, so it’s an ensemble average under crystal-packing constraints (which may distort the in-solution ensemble).

Multi-temperature crystallography. Collecting crystallographic data at multiple temperatures to disentangle thermal motion from static disorder, and to see which conformations are “frozen out” at low temperature versus populated at room temperature.

Diffuse scattering. The non-Bragg part of an X-ray diffraction pattern (the “noise” between the sharp spots) that arises from correlated motions in the crystal. Historically discarded; recently being mined for ensemble information.

Molecular Dynamics (MD). Simulation of molecular motion by numerically integrating Newton’s equations under an empirical force field (a parameterized energy function). Produces a trajectory of microstates that, in principle, samples the Boltzmann ensemble — in practice limited by force-field accuracy and by the (short) timescales reachable.

Maximum entropy reweighting. A class of methods that adjust an initial ensemble’s weights to fit experimental data while perturbing the distribution as little as possible (in the sense of Shannon entropy). Used to make MD trajectories or candidate ensembles consistent with NMR, NOE, SAXS, or cryo-EM data.

Cross-cutting acronyms

Root Mean Square Deviation (RMSD). Standard structural-similarity metric. Align two structures, compute the average squared per-atom displacement, take the square root. Bad at comparing distributions because it collapses everything to a single scalar.

Protein Data Bank (PDB). Global repository of macromolecular structural models. The “single static structure” is the unit of PDB content.

Protein Data Bank Exchange / macromolecular Crystallographic Information File (PDBx/mmCIF). The modern file format used by the PDB. Supersedes the legacy fixed-width PDB format. A text format encoding atomic coordinates, occupancies, B-factors, alternative location indicators, and a large amount of metadata.

Critical Assessment of protein Structure Prediction (CASP). Biennial blind-prediction competition that has historically driven progress in static structure prediction (and culminated in AlphaFold’s CASP14 result). The ensemble field lacks a CASP-equivalent — one of the gaps Wankowicz–Bonomi argue must be filled.

Probability Density Function (PDF). In the paper’s context, a continuous probability distribution over conformations (not the file format). Used in metrics like Jensen–Shannon divergence to compare ensembles.

mmCIF ensemble-encoding terms

Alternative location indicator (altloc). A field in PDBx/mmCIF that allows each atom to be listed in multiple positions with associated occupancies (fractional populations summing to 1 for each atom). The standard way to record per-residue structural heterogeneity from crystallography. Critically: altloc occupancies are per residue marginals. The format has no mechanism to record correlations between altlocs on different residues, so the joint distribution over global conformations cannot be expressed.

Occupancy. The fractional weight of a particular altloc for an atom, between 0 and 1. Treated as a marginal probability for that atom’s position.

B-factor (atomic displacement parameter, ADP). A scalar per atom that captures the magnitude of positional fluctuation. Conventionally interpreted as proportional to the mean squared displacement of the atom from its modeled position. In practice, B-factors mix several physically distinct contributions: true thermal motion, lattice disorder, refinement uncertainty, radiation damage, conformational heterogeneity not modeled by altlocs. Because of this mixing, B-factors are not a clean window onto the thermodynamic probability density.

Anisotropic Displacement Parameter (ADP). A more detailed version of the B-factor that captures directional anisotropy of atomic motion via a 3×3 tensor per atom. Still has the same physical-mixing problem.

Multimodal encoding (NMR-style multi-model files). Putting multiple complete coordinate sets in a single mmCIF file (each is a “model”). Common practice for NMR ensembles. The format has no field for the relative population of each model, so the Boltzmann weights \(p_i\) — the entire point of an ensemble — are absent.

Maximum parsimony principle. An encoding strategy that aims to explain the observed data using the fewest model parameters. Multiconformer modeling tools (like qFit) follow this principle: produce a small list of discrete altloc conformers that explain the electron density. Interpretable but tends to miss small-amplitude anharmonic motion.

Maximum entropy principle. An encoding strategy that aims to produce the broadest (highest-entropy) distribution consistent with the data. Tends to capture small-amplitude motions well but produces large, hard-to-interpret ensembles.

Compiler / IR vocabulary

Forward model / forward operator (a.k.a. measurement operator, projection function, imaging operator). A function that takes a candidate state of the system and predicts what you would observe if that state were true. The name marks the direction: “forward” goes from cause to effect; “inverse” goes from observation back to cause. The pattern is pan-disciplinary. In weather forecasting the forward model integrates the atmosphere from today’s state to tomorrow’s; the inverse is data assimilation from satellite observations. In CT imaging the forward model is the Radon transform from tissue density to detector counts; the inverse is reconstruction. In astronomy the forward model maps galaxy parameters to telescope images; the inverse is parameter estimation. In computer graphics the forward model is rendering; the inverse is “inverse rendering” (NeRF, structure-from-motion). In structural biology each experimental technique has its own forward operator on the ensemble: NOE has \(f_\text{NOE}(\{(x_i, w_i)\}) = \langle 1/r_{ab}^6 \rangle\); SAXS has \(f_\text{SAXS}(\text{ensemble}, q) = \langle |F(q)|^2 \rangle\) ; X-ray crystallography has \(f_\text{xray}(\text{ensemble}) = \langle \rho(x) \rangle\) convolved with point-spread function and noise; cryo-EM has \(f_\text{cryoEM}(\text{ensemble}) = \mathcal{B}(\Gamma(\text{ensemble}))\) where \(\Gamma\) places Gaussians per atom and \(\mathcal{B}\) convolves with the microscope’s CTF (literally equation 11 in Levy et al. 2025). The unifying property is computability: forward problems are usually a single function evaluation, while inverse problems are hard because the mapping is many-to-one and noisy. Treating forward operators as first-class IR objects (with scope, projector function, noise model, and value) is the design move that makes cross-technique integration tractable: given an ensemble and the operator, anyone can compute the predicted observable for likelihood-based inference, gradient-based fitting, or model comparison.

Projection in relational algebra (\(\pi\)). The relational-algebra operator that picks a subset of columns from a table, discarding the rest. In SQL, SELECT name, city FROM people is the projection \(\pi_{\text{name}, \text{city}}(\text{People})\). Projection is information-lossy — many different source tables can produce the same projection — which is why “what was the original table?” is an inverse problem. By analogy, an experimental observable is a projection of the joint conformational ensemble onto an observable subspace, and recovering the ensemble from its observables is the structural-biology inverse problem.

Bayes’ rule and its named pieces. The single formula that organizes everything in posterior-based inference: \(P(x \mid y) = P(y \mid x) P(x) / P(y)\). The prior \(P(x)\) is what you believed about the hypothesis \(x\) before seeing data — in our setting, “what protein structures look like in general” (encoded by geometry restraints or a pretrained diffusion model). The likelihood \(P(y \mid x)\) is the probability of observing \(y\) assuming \(x\) is true — produced by the forward model plus a noise model: if \(y = f(x) + \epsilon\) with Gaussian \(\epsilon\), then \(P(y \mid x) \propto \exp(-\tfrac{1}{2\sigma^2}\lVert y - f(x) \rVert^2)\). Critically, for fixed \(y\) the likelihood is a function of \(x\) but not a normalized probability distribution over \(x\) — it doesn’t integrate to 1. The posterior \(P(x \mid y)\) is a normalized distribution over \(x\), and it’s usually what you want. The evidence \(P(y) = \int P(y \mid x) P(x) \, dx\) is the normalizer; in high-dimensional structural problems it is intractable to compute, which is why almost all Bayesian methods work by sampling rather than by direct evaluation.

Log-likelihood. The logarithm of the likelihood, used pervasively for numerical reasons. Likelihoods for high-dimensional models are typically vanishingly small numbers (think \(10^{-1000}\)) that underflow floating-point arithmetic; taking the log brings them into a representable range, turns products into sums (so independent measurements give additive contributions), and gives cleaner gradients. Maximizing the log-posterior is equivalent to maximizing the posterior because log is monotonic — that is why MAP estimation is phrased in log space and why multi-modal data fitting is written as \(\sum_i \log p(y_i \mid x) + \log p(x)\).

Sampling vs computing a posterior. The reason structural-biology and ML papers say “sample the posterior” rather than “compute it”: the posterior is known only up to the intractable evidence \(P(y)\), so you can’t generally write down its density and integrate against it. What you can do is draw concrete realizations \(x^{(1)}, \ldots, x^{(K)}\) distributed according to the posterior, using methods like Markov Chain Monte Carlo (MCMC), Sequential Monte Carlo (SMC), Hamiltonian Monte Carlo, variational inference, or diffusion-based posterior sampling. From samples you can estimate ensemble averages, modes, credible intervals — anything you’d want from the distribution itself. “Sample the posterior” is precise jargon for “produce concrete instances from a distribution we can’t otherwise compute against.”

Maximum a Posteriori (MAP) estimation. Instead of drawing many samples from the posterior, find the single most probable \(x\): \(x^\ast = \arg\max_x \log p(x \mid y) = \arg\max_x \big[\log p(y \mid x) + \log p(x)\big]\). Sharper than mode-finding because the prior is included. Used by ADP-3D when the data is informative enough that the posterior is tightly peaked and a point estimate is what you want. Compared to full posterior sampling: faster, gives one structure instead of an ensemble, throws away uncertainty.

Diffusion model. A generative model that learns the data distribution by reversing a noising process. The forward (noising) process gradually adds Gaussian noise to data over many time steps until the signal is destroyed; the model is trained to undo this — given a noisy sample, predict the clean version (or equivalently, predict the noise to be subtracted). Sampling is iterative denoising: start from pure noise, run the trained model many times to progressively denoise toward a sample from the data distribution. Diffusion models are extremely effective as priors for structural data because they learn the manifold of valid structures implicitly. Protein-structure diffusion models include Chroma, RFdiffusion, AlphaFlow, AlphaFold3, Boltz, Chai.

Tweedie’s formula. A result from empirical Bayes that gives a closed-form expression for the posterior mean of the clean signal given a noisy observation: \(\hat{x}_0 := \mathbb{E}[x_0 \mid x_t]\). For Gaussian-noised diffusion processes this works out to \(\hat{x}_0 = (x_t + (1 - \bar{\alpha}_t)\nabla_{x_t} \log p_t(x_t))/\sqrt{\bar{\alpha}_t}\), which the diffusion model gives you for free at every denoising step. This is the trick that makes Diffusion Posterior Sampling (Chung et al. 2023) work: the likelihood \(p(y \mid x_t)\) at noisy intermediate states \(x_t\) is hard to evaluate directly, but \(p(y \mid \hat{x}_0)\) at the denoised estimate is tractable, and the gradient \(\nabla_{x_t} \log p(y \mid \hat{x}_0)\) flows through the denoiser via backprop.

Diffusion Posterior Sampling (DPS). A method (Chung et al. 2023) for using a pretrained diffusion model as a prior in inverse problems. At each denoising step, in addition to the standard denoising update, take a gradient step on \(\log p(y \mid \hat{x}_0)\) where \(\hat{x}_0\) is the Tweedie-formula posterior mean estimate. Handles both Gaussian and Poisson noise and both linear and nonlinear forward operators. Requires the forward operator to be differentiable.

Feynman-Kac (FK) Steering. A method (Singhal et al. 2025) for steering diffusion models with arbitrary reward functions, including non-differentiable ones and discrete-state diffusion. Replaces DPS’s gradient step with a Sequential Monte Carlo procedure: run \(K\) parallel denoising trajectories called particles, score them at intermediate steps with potentials (defined using intermediate rewards), and resample so that high-scoring particles are duplicated and low-scoring ones discarded. Targets the tilted distribution \(p_\theta(x) \exp(\lambda r(x))\), which equals the posterior up to normalization when \(r = \log p(y \mid x)\). Strictly more general than DPS (handles non-differentiable rewards, discrete diffusion, arbitrary scoring functions), at the cost of running multiple parallel trajectories.

Plug-and-play (PnP) framework. A class of methods (originally Venkatakrishnan et al. 2013, later applied to diffusion priors as DiffPIR / ADP-3D) for solving inverse problems using an off-the-shelf denoiser as a prior. The objective \(f(x) + g(x)\) is split via Half-Quadratic Splitting (HQS) into a likelihood term and a denoising term; the denoiser (which can be any pretrained model, including a diffusion model) acts as the proximal operator for the prior. ADP-3D (Levy et al. 2025) applies this to protein structure determination with Chroma or RFdiffusion as the prior and explicit cryo-EM / distance / substructure forward operators as the likelihood.

Posterior sampling vs MAP, for ensembles. Full posterior sampling produces a sample-based representation of the posterior — many candidate structures distributed according to \(p(x \mid y)\). This is naturally a Boltzmann-weighted ensemble in your terminology (R1 in the heterogeneity taxonomy), with samples acting as the discrete states. MAP gives you a single structure — the mode of the posterior — which is the modern equivalent of the classical “single deposited structure.” The same forward operators serve both; the difference is whether the downstream consumer wants uncertainty or a point estimate.

Noise model. A probability distribution \(p(y_\text{observed} \mid y_\text{predicted})\) describing how the observed value relates to what the forward operator predicts. Common forms: Gaussian (continuous symmetric errors, log-likelihood \(\propto -\lVert y_\text{obs} - y_\text{pred} \rVert^2 / \sigma^2\)); Poisson (photon-counting); log-normal (multiplicative errors, used for NOE and SAXS); Student-\(t\) (heavy-tailed, robust to outliers); heteroskedastic Gaussian (variance depends on signal magnitude or measurement position, e.g. per-Miller-index \(\sigma\) for X-ray, per-voxel \(\sigma\) for cryo-EM). Stored as a separate registry reference from the forward operator so the same operator can be used with different noise models.

Observation tensor. The actual experimental data attached to a forward-operator record, stored in whatever shape is native to the technique: a table of \((h, k, l, F_\text{obs}, \sigma_F)\) for X-ray; a 3D voxel array for cryo-EM maps; a vector of intensities indexed by atom pairs for NOE; a 1D scattering curve for SAXS. The format does not try to unify observation shapes across techniques — the unifying object is the atomic ensemble, and observations sit beside it in technique-specific shapes connected by the forward operator.

Computation-graph serialization formats. TorchScript (torch.jit.save, .pt files): zip archive containing graph IR, parameter tensors, and metadata; executable by any PyTorch installation; carries autograd metadata. ONNX: protobuf-based representation with a versioned operator spec (the “opset”); cross-framework via ONNX Runtime; the most portable. MLIR: text or binary representation of typed operations grouped into dialects; most extensible, used in compiler/accelerator stacks. For embedding novel forward operators in a deposition, all three are viable as content-hashed blobs referenced from the operator record; the format-level interface contract (input/output shapes, dtypes, differentiability) is declared separately from the blob body so consumers that can’t execute the blob can still read the contract.

Mixed-Integer Quadratic Program (MIQP). An optimization problem with a quadratic objective function (e.g. a squared residual \(\lVert y - Aw \rVert^2\)) and a mix of continuous and integer (typically binary 0/1) decision variables. The continuous part is a standard convex quadratic program (QP). The integer part — usually binary indicator variables that enforce “at most \(N\) items are selected” or “either this or that, not both” — makes the problem combinatorial and in general NP-hard, though small instances (small \(N\), sparse structure) are routinely solvable in seconds. Solvers use branch-and-bound or branch-and-cut: solve continuous QP relaxations at nodes of a search tree, prune branches by bounds. Common solvers: Gurobi and CPLEX (commercial), CBC and SCIP (open-source). CVXPY is a Python frontend that compiles a high-level problem description into the solver’s native form. In qFit, the MIQP picks at most ~4 conformers (binary indicators) with continuous occupancies, minimizing the squared electron-density residual.

Modeling tools

qFit. A multiconformer modeling tool (ExcitedStates lab, Fraser/Keedy/van den Bedem/Wankowicz) that fits a small ensemble of discrete conformers to local electron density rather than a single conformer. Three algorithmic stages: (1) sample — generate thousands of candidate conformers per residue by systematically rotating \(\chi\) angles, perturbing backbones, and enumerating ligand conformers; (2) score — compute each candidate’s predicted electron density and compare to the observed map; (3) select — solve a mixed-integer quadratic program (via CVXPY) that picks at most ~4 conformers with occupancies \(w_k\) minimizing \(\lVert \rho_\text{obs} - \sum_k w_k \rho_k^\text{pred} \rVert^2\) subject to \(w_k \ge 0\), \(\sum_k w_k \le 1\). Output is a standard mmCIF/PDB file with altlocs and occupancies, refined with Phenix as a final step. Works for X-ray crystallography and cryo-EM (via qfit_protein) and for ligands (via qfit_ligand). Lives squarely in the maximum-parsimony school of ensemble representation: discrete, interpretable conformers with explicit weights. Operates per-residue, so its output is fundamentally a marginal representation — there is no joint distribution across residues, and the mmCIF format it writes to cannot encode one. Key papers: van den Bedem et al. 2009 (original idea), Keedy et al. 2015 PLoS Comput Biol (backbone alternatives), Riley et al. 2021 Protein Sci (qFit 3.0 codebase, ligands, cryo-EM), Wankowicz et al. 2024 eLife (automated multiconformer modeling for X-ray and cryo-EM). Repo: https://github.com/ExcitedStates/qfit-3.0.

Composite omit map. A specially-computed electron density map where systematic regions of the model are excluded (“omitted”) during map calculation and the map is reassembled from the omitted patches. The result is largely free of model bias — i.e., the map doesn’t just reflect what was in the input model. qFit takes a composite omit map as input (generated by Phenix’s phenix.composite_omit_map) so that the conformers it identifies are supported by the data, not echoes of the starting structure.

Mixed-integer quadratic program (MIQP). A convex optimization problem with a quadratic objective and both continuous and integer (typically binary) decision variables. qFit’s selection step is an MIQP: the continuous variables are the occupancies \(w_k\), the binary variables enforce “at most \(N\) nonzero \(w_k\),” and the objective is the squared map-fit residual. Solved with CVXPY.

Duino format terms

Concepts specific to the Duino format. Linked from first use in the chapters.

Core layer / ensemble layers. The core layer describes the “what”: the primary physical components and their named groupings, invariant across every conformational state the deposition encodes. The ensemble layers describe how those primitives vary – which states are representable and how they are stored. A deposition with one structure and one with ten thousand samples share the same core; the ensemble layers are what differ.

Scope. The set of nodes in the core that a given heterogeneity descriptor applies to. Formally, a scope handle is either an inline reference into Hierarchy (one atom, one residue, one chain, the whole assembly; or a span like a residue range – a singleton-or-span grouping) or a reference to a previously declared Groupings entry (the atoms of a named TLS partition, the residues of a CATH domain). A descriptor’s scope determines (i) which atoms its displacement contribution \(\Delta_i^\ell(\cdot)\) affects, and (ii) where the descriptor sits in the scope DAG when it is composed with other descriptors.

Heterogeneity regime. A label for the kind of variation a descriptor describes, independent of how it is stored. Three regimes cover the space: R1 (discrete ensemble), R2 (trajectory), R3 (continuous landscape). Each has a different natural representation; the boundaries between them are meaningful design seams.

Materialization mode. A storage strategy for a heterogeneity descriptor: Mode A stores every state explicitly, Mode B stores a base plus sparse per-state deltas, Mode C stores an operator (parametric Gaussian, basis, neural decoder, or external reference) plus its inputs. The mode is independent of the regime; the same descriptor can be re-materialized between modes without changing its meaning.

Reference structure. \(x^{\mathrm{ref}}\) is the single anchor: a per-atom Cartesian position, one set, in \(\mathbb{R}^3\). It is what the deposition would render as if every heterogeneity descriptor were inactive. Every descriptor’s contribution is expressed as a displacement against it.

Discrete-nesting stack. The composition rule for Regime 1 descriptors at multiple scopes. Each child descriptor declares which parent state activates which child state set; only legal joint states exist at render time. Storage scales with the count of legal joint states, not the Cartesian product of per-scope state spaces.

Continuous-additive stack. The composition rule for Regime 3 Gaussian descriptors at multiple scopes. Each level contributes a per-atom \(3 \times 3\) covariance; the total atomic displacement covariance is the sum of those contributions, under the assumption that levels are uncorrelated. ECHT [@pearce2021echt] is the published instance.

Sample axes. A descriptor’s sample axis is the index along which the deposition stores its multiple samples (frames, models, particles, latent draws). Descriptors are aligned when they share a named sample axis (sample \(i\) of one corresponds to sample \(i\) of another), broadcast when they have no sample axis at all (the descriptor is parametric and renders by drawing on demand), and mixed when a deposition combines an aligned core with one or more broadcast satellites.

Operator interface. The contract every Mode C subcase satisfies: a function with the shape (state_input, reference) -> displacement, plus declared input shape, output shape, atom-id ordering, and reference frame. Parametric Gaussians, basis descriptors, neural decoders, and external references all fit this surface; they differ in what state_input is, what backend the call lands on, and whether the output is a draw or a distribution.

Selector. A query expression that names a molecular subset across the four core layers – atoms, residues, chains, domains, named groupings – and across conformational states. Selectors are the address-space of the format: every annotation, every cached pass output, every overlay attaches to a selector rather than to raw atom indices, so the binding survives renumbering.

Annotation overlay. A separate, additive layer of (selector, body, provenance) triples attached to a structure without modifying its core. Modeled after copick’s overlay filesystem: the base data is read-only, overlays are writable and namespaced by their producer, and multiple annotators coexist without stepping on each other or on the underlying structure.

Model-side artifact. A first-class deposition object whose origin is a model rather than an experiment: stored weights, latent coordinates, decoder checkpoints, predicted ensembles, training metadata. All four route through the same (selector, body, provenance) overlay machinery as human-curated annotations; the difference is what the provenance has to carry to make the artifact trustable – weights hash, training data version, evaluation metrics.

Differentiable read. An access pattern that delivers structural data directly into a tensor (numpy / torch / jax) without an intermediate parsed object. Format-level support means a Zarr or Arrow-backed array maps to a tensor in one zero-copy step; format-level failure means every consumer copies into its own representation between disk and tensor.

Cached pass. The stored output of a deterministic function over the core (a neighbour graph, a backbone-dihedral table, a SASA array). Lives in an annotation overlay namespace keyed by (producing pass id, parameters, version) rather than in the IR itself. Two consumers with different parameters produce two overlays; neither invalidates the core or the other.

Track A — ensemble representation. The deposition itself becomes a Boltzmann-weighted ensemble rather than a single conformation. The format provides explicit machinery for storing \(\{(x_i, w_i)\}\)-shaped objects: discrete states with weights (R1), trajectories (R2), continuous latent landscapes with stored decoders (R3), with composition rules that handle nested heterogeneity descriptors across scales. This is the Heterogeneity and Materialization work in chapter 2.

Track B — forward-operator infrastructure. The deposition standardizes the bridge between ensemble and experimental data. Each measurement modality contributes a uniform record: (operator_ref, scope, parameters, observation, noise_model_ref, provenance). The operator is a registry-referenced (or, for novel measurements, embedded-graph-referenced) function that computes a predicted observation from the ensemble. The noise model is a separately-referenced distribution describing measurement uncertainty. Together they produce the likelihood term \(\log p(y \mid \text{ensemble})\) that downstream inference engines consume.

Things still to add

  • FLEXR (electron-density-map sampling) — Stachowicz & Fischer 2023.
  • cryoENsemble (Bonomi’s Bayesian reweighting of MD against cryo-EM maps).
  • AlphaFlow / AlphaFold2 ensemble manipulations.
  • Boltzmann generators (Noé et al., normalizing flows for sampling Boltzmann distributions).
  • Diffuse scattering ensemble methods.
  • Three-dimensional variability analysis (3DVA) in cryo-EM.