Annotations and computed passes

Format and storage foundations

Structural biology data is fundamentally multidimensional: atoms have properties (element, charge, name), atoms belong to residues and chains (hierarchy), atoms have coordinates that vary across conformational states (ensemble dimension), and various scalar/vector/tensor fields can be associated with atoms, residues, or the entire system at any level of this hierarchy. A text file of atom records serializes this but sacrifices random access, compression, and dimensional structure.

Technologies from adjacent fields that offer better primitives:

Zarr v3 – chunked N-dimensional arrays on any storage backend. Simple (just arrays + groups + JSON metadata), cloud-native, implementations in Python/Rust/JS/C++. github.com/zarr-developers/zarr-python, github.com/zarrs/zarrs (Rust).

OME-Zarr (NGFF) – bioimaging conventions on Zarr: multiscale pyramids (same data at multiple resolution levels, enabling zoom-dependent loading), axes labels, coordinate transforms, labels/ROIs. Used by CZ CryoET Data Portal. Community-driven RFC process. ngff.openmicroscopy.org, github.com/ome/ome-zarr-py.

copick – storage-agnostic cryo-ET dataset API built on Zarr/OME-Zarr. Adds overlay filesystem (read-only data + writable annotations), typed annotation objects (picks, segmentations, meshes) with provenance, plugin CLI. github.com/copick/copick.

TileDB – array database with native sparse arrays, built-in query engine, time-travel/versioning. TileDB-SOMA data model (for single-cell genomics) solved a structurally analogous problem: (cells x genes) with metadata on both axes, at scale. github.com/TileDB-Inc/TileDB, github.com/single-cell-data/TileDB-SOMA.

Apache Arrow / DataFusion – columnar in-memory format + query engine (Rust). Predicate pushdown (filters pushed to storage layer), zero-copy cross-language interop. A zarr-datafusion crate already exists: github.com/jayendra13/zarr-datafusion. datafusion.apache.org.

Multiscale pyramids for molecular data. OME-Zarr stores the same image at progressively lower resolutions (full -> 2x downsampled -> 4x -> …) for efficient visualization. The analog for molecular structures: all-atom -> backbone-only -> domain centroids -> subunit centroids, with explicit mapping operators between levels. This is the coarse-graining hierarchy that IHM already represents in a limited way, and it maps directly onto the representation layer described in Architecture § Representation.

Tradeoffs (needs benchmarking on real structural biology workloads):

Concern Zarr + DataFusion TileDB HDF5 (status quo for MD)
Simplicity High (just arrays) Medium (database engine) Medium
Cloud-native Yes (chunked objects on S3) Yes Poor (no chunk-level access over HTTP)
Sparse arrays No (fake with compression) Native No
Query engine External (DataFusion) Built-in None
Versioning Manual Built-in (time-travel) None
Bioimaging precedent OME-Zarr, copick, CZ Portal cellxgene Census H5MD for MD trajectories
Random access to single state O(chunk) O(1) with index O(N) scan

Representing conformational heterogeneity

The problem in one sentence: we need a data model that can represent everything from a two-rotamer sidechain flip to a continuous distribution over ribosome conformations, with correlations between components, thermodynamic annotations, and multi-scale descriptions – without enumerating a full coordinate set for each distinguishable configuration of large systems.

The architecture section proposes three heterogeneity regimes with different natural representations, plus a factor graph regime folded into a note. This section surveys the existing tools and formalisms against that taxonomy.

Existing tools by regime.

Heterogeneity Regime 1 (discrete ensemble): mmCIF alt_id is a degenerate case – two conformers per residue, no correlations, no thermodynamics. IHM multi-state adds named discrete states with ordering and population, but states remain independent and uncorrelated. Wankowicz & Fraser (2024) push this as far as it goes within mmCIF – their proposed categories handle correlated multi-conformer models and ensemble validation within the text-CIF framework. Understanding exactly where their proposal hits the wall is essential for calibrating what requires a new format vs. what can be encoded in extended mmCIF.

Heterogeneity Regime 2 (trajectory): H5MD (github.com/h5md/h5md) handles this well for MD. MDTraj and MDAnalysis both read and write it. The main gap is the absence of cloud-native access: HDF5 doesn’t support chunk-level HTTP range requests, so trajectory analysis in the cloud currently requires downloading the full file. A Zarr-based trajectory convention that mirrors H5MD semantics but runs on object storage is the obvious gap to fill, and it is a format convention question, not a data model question.

Heterogeneity Regime 3 (continuous landscape): cryoDRGN (github.com/ml-struct-bio/cryodrgn) learns a continuous heterogeneity manifold from cryo-EM particle images and decodes conformations at arbitrary latent coordinates. There is no standard serialization for the latent-plus-decoder representation – each cryoDRGN run produces its own checkpoint format. The right abstraction for storage is: an \((N_{\mathrm{particle}}, d)\) latent coordinate array plus a decoder reference, with named landmarks as optional cluster assignments. Normal mode analysis produces a simpler version of the same thing: a \((k, N_{\mathrm{atom}}, 3)\) mode basis plus scalar mode coefficients. Both are Materialization Mode C. Markov State Models (PyEMMA, deeptime) sit at the border between Heterogeneity Regime 2 and Heterogeneity Regime 3: they discretize a continuous trajectory into metastable states and produce a transition matrix over those states, which can be stored as a sparse \((N_{\mathrm{state}}, N_{\mathrm{state}})\) matrix with equilibrium populations – a Heterogeneity Regime 1 coordinate set plus a kinetic annotation in the shape of a Heterogeneity Regime 3 descriptor.

The factor graph regime: this is where the existing literature on probabilistic graphical models (pgmpy, Stan, PyMC) is relevant. The key constraint that makes it tractable is sparsity: each factor couples only a small number of variables, and most variables are locally independent. The classic mmCIF limitation – “alt A of residue 50 co-occurs with alt A of residue 80” – is a two-variable factor between two discrete rotamer variables, a \(2 \times 2\) table. That is entirely representable. What is not representable this way is a collective motion involving hundreds of residues: that belongs in Heterogeneity Regime 3 by any reasonable criterion, and forcing it into the factor graph regime produces a factor whose table is exponential in the number of participating residues.

The factor graph as a narrow tool, not a general framework. The factor graph formalism is mathematically clean and expressive within its regime. The design discipline required to use it without blowing up is: factor graphs are for the narrow case, normal modes and latent decoders are for Heterogeneity Regime 3, and the format must make the boundary between them explicit rather than leaving it to the depositor’s judgment. The signal that a system is crossing from the factor graph regime into Heterogeneity Regime 3 is when someone wants to introduce a factor over more than roughly five to ten variables – at that point, a continuous parameterization almost always exists and is more compact.

Correlation and coupling. There is a third representation worth noting for cases that sit between regimes: a sparse pairwise correlation array or graphical model specifying which residue pairs are conformationally coupled, without committing to a full factor table for each pair. This is useful as a compact annotation – a summary of an MD ensemble or an NMR-derived coupling network – without requiring that the full joint distribution be stored.

Where we are. The existing literature covers Heterogeneity Regime 1 well (qFit, multi-conformer crystallography, IHM discrete states), Heterogeneity Regime 2 adequately (H5MD, MDTraj), and Heterogeneity Regime 3 experimentally but without standard serialization (cryoDRGN, manifold embedding methods). The factor graph regime has the mathematical machinery but no format-level convention. The gaps that need new work: a cloud-native trajectory convention (Heterogeneity Regime 2 on Zarr), a standard serialization for latent-plus-decoder representations (Heterogeneity Regime 3), and a compact factor graph encoding with explicit regime boundary conventions.

Query language and annotations

Query language. Need: select arbitrary molecular subsets (atoms, residues, chains, domains) at any level of hierarchy, across conformational states, composably. MolQL (github.com/molstar/molstar, embedded in Mol*) is the most expressive existing molecular selection language. VMD/MDAnalysis selections are simpler but widely used. Extension needed for: state selection, cross-state predicates, annotation predicates.

If storage is columnar/array-based, molecular selections can compile to predicates that a query engine (DataFusion) pushes down to the storage layer. “chain A and resid 50-80 in state 3” becomes: filter on chain_id column, range filter on res_seq column, index into state dimension – only relevant chunks are read. Spatial predicates (“within 5A of ligand”) need a stored spatial index (R-tree) and compile to a UDF call.

Annotations. Core model: (selector, body, provenance) triple. The selector is a query expression. The body is arbitrary typed data (text, scalar, vector, array, structured object, reference). The provenance records author, method, software, date, confidence.

Oli Clarke’s Coot residue annotation tool (bsky.app/olibclarke/post/3micblpuoxc2u) demonstrates the impulse: save notes associated with the active residue, persist in mmCIF. The generalization: associate any data with any molecular selection, in a separate layer (copick overlay pattern) so multiple annotators coexist and annotations are decoupled from the model version.

This subsumes: UniProt features, CATH domains, validation metrics, qFit conformers, ML confidence scores, manual curation notes, force-field parameters – all are (selector, body, provenance) triples with different body types and different provenance.

Key design question: how to make selectors persistent and portable across structures (use canonical identifiers like UniProt residue numbers rather than PDB-specific numbering).

ML-native structural operations

The fundamental pattern in all current structure ML models: coordinates + chemistry -> graph -> message passing -> prediction. The specifics vary but converge on shared primitives. This section first catalogs those primitives and the redundant work they require under current formats, then maps the heterogeneity regime taxonomy onto the training data requirements for the next generation of distribution-predicting models.

Graph construction. A molecular graph G = (V, E) where V are atoms or residues with feature vectors, E are edges (spatial proximity, covalent bonds, sequence connectivity) with features. Edge construction is the bottleneck:

  • Spatial: all pairs within cutoff r (typically 8-12A). Requires neighbor search, O(N log N) with k-d tree or O(N^2) brute force. Rebuilt for every structure at every training step.
  • Covalent: bond graph from topology. Currently inferred from coordinates + element types because mmCIF doesn’t reliably carry bonds.
  • Sequence: (i, i+1) backbone edges.

The Hierarchy layer proposed in Architecture § Hierarchy directly addresses the covalent edge problem: an explicit bond graph in the core eliminates the inference step entirely. Spatial edges remain dynamic (cutoff and metric are model choices) but a stored spatial index – R-tree or k-d tree serialized alongside the coordinate array – would reduce neighbor construction from O(N log N) per structure to O(k) lookups per atom, where k is the number of neighbors within the cutoff.

Equivariant features. E(3)-equivariant models (NequIP, MACE, Allegro, eSCN) decompose features into irreducible representations of \(\mathrm{SO}(3)\). The displacement vector between atoms \(i\) and \(j\) is expanded in spherical harmonics \(Y_l^m(\mathbf{r}_{ij}/|\mathbf{r}_{ij}|)\) via e3nn (github.com/e3nn/e3nn). This computation is identical for every structure with the same topology and the same cutoff. If the format stored a spatial index enabling \(O(k)\) neighbor lookup and precomputed spherical harmonic expansions at stored edges, feature construction would become a read operation rather than a compute operation. Whether this is actually worth storing depends on the ratio of training time to storage cost.

Pair representations. AF2’s pair representation is an \((N_{\mathrm{res}}, N_{\mathrm{res}}, 128)\) tensor updated through the evoformer. Initial pair features – residue-residue distance bins, relative sequence position encoding, template distance and angle features – depend only on the structure, not on model weights, and could in principle be precomputed. At 128 features and \(N_{\mathrm{res}} = 1000\) this is $$500MB per structure uncompressed. Feasible if compressed and chunk-read, but probably only worth precomputing for large training campaigns where the same structure is seen many times. The format should make this storable without requiring it.

Tokenization. AF3, Boltz-1, and Chai-1 tokenize the molecular system: each standard residue is one token, ligand atoms or functional groups become tokens, nucleotide bases are tokens. The token graph has heterogeneous node types and multiple edge types. The tokenization is model-specific but the underlying information it requires – atom types, bond connectivity, residue membership, entity type – is the same across models and is exactly what the Hierarchy layer stores. A format that carries this information explicitly enables tokenization as a deterministic read-and-map operation rather than an inference step.

Training data for distribution-predicting models. Current models predict single structures. The next generation will predict distributions over structures, conditioned on sequence and experimental context. The training data shape this requires is not (sequence, structure) pairs but (sequence, ensemble) pairs where the ensemble has associated populations or free energies.

Mapping this onto the heterogeneity regime taxonomy:

  • Heterogeneity Regime 1 ensembles (NMR bundles, qFit outputs, IHM discrete state models) are already close to the right shape – the format work needed is standardizing population and free energy annotations.
  • Heterogeneity Regime 2 trajectories are the largest existing source of ensemble data. The format work needed is efficient access to subsampled or stride-selected frames without reading the full trajectory, which is a chunking and indexing problem.
  • Heterogeneity Regime 3 representations (cryoDRGN, NMA) provide the most information-dense training signal because the latent coordinates are continuous and the decoder provides a differentiable generator. A model trained against these could in principle learn to predict latent coordinates directly rather than just Cartesian coordinates. The format work needed is standardizing the latent-plus-decoder serialization.
  • Factor graph ensembles are probably not a primary training data source for structure prediction models, but they are relevant for models that explicitly predict correlated sidechain or loop conformations.

The format implication: ensemble training data packaging is not a separate problem from the heterogeneity representation. If ensembles have a native representation in the format with standard population annotations, packaging training data for distribution-predicting models is a read operation. Under current formats it requires custom pipeline code for every model and every data source.

References

Wankowicz, Stephanie A., and James S. Fraser. 2024. “Comprehensive Encoding of Conformational and Compositional Protein Structural Ensembles Through the mmCIF Data Structure.” IUCrJ. https://doi.org/10.1107/S2052252524005098.