Architecture

Before surveying specific technologies or existing formalisms, it is worth stating what shape we think the right data model has. This is a design claim, not a survey – a stake in the ground that the explorations in the next sections should be read against.

The central problem with forcing heterogeneous molecular data into a single abstraction – whether a flat atom table, a coords[state, atom, 3] tensor, or a single DAG – is that the data actually contains several mathematically distinct kinds of structure, each with different algebra and different storage requirements. Conflating them produces a format that handles each badly.

We split the architecture into two halves. The core layers describe what is true of every state of the system: its physical parts, and the named groupings scientists impose on those parts. The ensemble layers describe how states differ from one another and how those differences are stored. The core is stable under heterogeneity; the ensemble layers are the ones that earn the format its keep for modern use cases.

Within those halves there are four numbered layers:

Three distinct DAG structures show up, one in each of three layers, and conflating them is how existing formats end up tangled. It is worth naming them once up front so we can refer to each without ambiguity later:

The three structures live in different layers, have different semantics, and serve different consumers. Separating them is most of the design work.

2.1 Core layers

The core describes what is invariant across every conformational state the system can take: the atoms that exist, what they are called, what contains them, and what named groupings have been defined over them. A deposition with one structure and a deposition with an ensemble of ten thousand samples share the same core; the ensemble layers sit on top.

2.1.1 Hierarchy

The Hierarchy layer is the physical nesting of the system: which atom is in which residue, which residue is in which chain instance, which chain instance is in which assembly. It answers “what is this system made of?” It is irreducibly a tree (or forest for multi-molecule assemblies): each atom belongs to exactly one residue, each residue to exactly one chain instance, each chain instance to one entity template, each entity to the assembly. Single parentage is not a limitation – it is the correct algebra for physical containment. This is the one place in the architecture where we insist the data structure is a tree, because nesting by physical parthood genuinely is one. Any non-physical hierarchy someone might want to impose on the same atoms – CATH domain within chain, rigid-body group within domain, secondary-structure element within chain – is a curatorial choice and lives in Representation (§2.1.2), not here.

In array-native form this is parent-index arrays:

atom_residue_id[N_atom]
residue_chain_id[N_res]
chain_entity_id[N_chain]
instance_assembly_id[N_inst]

These give stable integer IDs that every other layer references. An annotation, a graph edge, a heterogeneity variable – all attach to an atom index or residue index from this layer. Hierarchy should therefore be as minimal as possible. What belongs here: atom element, atom name, residue comp_id, chain type, entity type. What does not belong here: backbone dihedrals, contact distances, domain assignments, validation scores, force-field parameters, TLS groupings, CATH domains. Those are derivable, curatorial, or non-unique; they live in Representation (§2.1.2) or as annotation overlays. The reason this discipline matters is stated in §2.3.

One open question: bond connectivity. Bond orders for standard residues can be inferred from comp_id lookups against the CCD, but for ligands this inference is unreliable – mmCIF doesn’t carry bond orders inline, and the CCD is a separate dependency. The case for putting an explicit bond graph in the hierarchy core (rather than treating it as a derived pass) is that bond connectivity is not computable from coordinates and element types alone for arbitrary small molecules, making it genuinely irreducible information for heterogeneous systems. This is revisited in ML-native structural operations.

A single-chain protein of 300 residues has a four-level hierarchy tree: one assembly, one entity (the polypeptide), one chain instance, 300 residues, and $$2400 atoms. Parent-index arrays: atom_residue_id has length 2400 and points into a residue table of length 300; residue_chain_id has length 300 and points to chain 0; chain_entity_id[0] = 0; instance_assembly_id[0] = 0. Every query that needs “all atoms of residue 45” is an index lookup.

A bacterial 70S ribosome carries three polymer entities (16S rRNA, 23S rRNA, 5S rRNA) and ~50 protein entities, instantiated as 55+ chain instances in the assembly, plus bound tRNAs, mRNA, ions, and waters. Every chain instance has exactly one entity parent and exactly one assembly parent; every atom has exactly one residue parent; every residue has exactly one chain-instance parent. The tree is deeper in breadth than a monomer but the algebra is identical – no chain instance has two entities, no atom has two residues. All the interesting multi-parent relations (e.g., “this atom is part of the A site, but only when a tRNA is bound there”) are Representation (§2.1.2) or Heterogeneity (§2.2.1), not Hierarchy.

2.1.2 Representation

A molecular system has many valid abstractions at different levels of granularity, and these are not all compatible with a single tree. A residue simultaneously belongs to a chain (that’s Hierarchy), to a CATH domain, to an active-site set, to a rigid-body partition used in refinement, to a secondary-structure element, and possibly to a pharmacophore grouping used in docking. Forcing these into one parent hierarchy creates category confusion; a residue cannot have five Hierarchy parents.

The Representation layer is therefore not a tree but a collection of named projection DAGs – mappings from hierarchy nodes to groupings. Each mapping is a sparse bipartite graph, possibly weighted:

mapping/backbone_only:
  child_atom_id[E], parent_bead_id[E], weight[E]

mapping/domain_partition_cath_v4.3:
  child_residue_id[E], parent_domain_id[E], weight[E]

mapping/rigid_body_partition_refinement:
  child_residue_id[E], parent_body_id[E], weight[E]

mapping/tls_groups_as_refined:
  child_atom_id[E], parent_tls_id[E], weight[E]

Mappings are named and versioned – there is no universal “parent” relation in this layer, only separate namespaces. Two groups using different domain partition schemes produce different mappings; neither invalidates the other or touches Hierarchy. This is the molecular analog of OME-Zarr multiscales: the same system described at multiple resolutions with explicit, queryable operators between levels.

This is the layer where DAG structure genuinely lives in the core. Every representation mapping is a bipartite DAG from some Hierarchy level to a target grouping; multiple mappings coexist as named siblings. A coarse-graining operator and an anatomical-domain partition sit next to each other, both referencing the same atom IDs but projecting to different label sets. The weights on edges handle non-crisp assignments: a boundary residue between two CATH domains can contribute with weight 0.5 to each; a coarse-grained bead built from four atoms stores the weights used to compute its position.

Three kinds of representation mapping recur and are worth distinguishing:

  • Coarse-graining mappings project fine Hierarchy levels onto coarser labels (atom \(\to\) residue is already Hierarchy; atom \(\to\) backbone-bead or residue \(\to\) Martini-bead is Representation). These are what IHM’s multi-scale model attempts, and the layer where CG beads should live.
  • Grouping mappings project Hierarchy levels onto curatorial labels (CATH domains, secondary-structure elements, active-site membership). These are many-to-one or many-to-many and never unique; different sources produce different mappings of the same system.
  • Partition mappings project Hierarchy atoms or residues onto refinement-defined groups (TLS bodies, rigid-body partitions, symmetry-equivalent copies). These are often produced by a specific piece of software and are best stored with enough provenance to reproduce.

Every mapping in Representation references Hierarchy IDs as its left side and declares its own label set on the right. None of them alters Hierarchy. The storage discipline is the same as for a sparse matrix: edge arrays plus per-edge data.

CATH classifies the residues of a protein into hierarchical structural domains. For a 400-residue two-domain kinase, the mapping might be: child_residue_id = [0..220, 221..399], parent_domain_id = [cath_3.30.200.20, cath_1.10.510.10], weight = [1.0, 1.0]. A second mapping from the same deposition might point to SCOP instead, with different domain boundaries. Both coexist, both reference the same residue IDs, neither is the single truth.

A crystallographer assigns the N-terminal and C-terminal domains to separate TLS groups during refinement. The mapping is atoms \(\to\) group: child_atom_id covers every atom in the structure, parent_tls_id takes two values (tls_N, tls_C), and weights are 1.0 because each atom belongs to exactly one group. The 20-parameter (T, L, S) tensors themselves are not stored in this mapping – they live in the Heterogeneity layer as a Regime 3 (degenerate continuous) descriptor scoped to the TLS groups defined here. The mapping is the hierarchy-to-group projection; the heterogeneity descriptor is the parameters.

A Martini representation projects heavy atoms onto $$4-to-1 coarse-grained beads. For a lysine residue, four atoms (CA, CB, CG, CD) might project onto a backbone bead with weights that average their positions. A separate sidechain bead covers (CE, NZ) with its own weights. Stored as: child_atom_id, parent_bead_id, weight edge arrays of length $E = $ (atoms contributing to beads). The coordinates of the beads are not stored here – they are computed as a weighted average when needed, using these weights. This keeps the representation mapping frozen while coordinates vary across heterogeneity states (§2.2.1). This is the projection DAG in action: any mapping that can be expressed as “combine source node coordinates with these weights to get target node coordinates” lives here.

2.2 Ensemble layers

The ensemble layers describe how the system varies across the states the data captures, and how those variations are physically laid out in storage. Heterogeneity is the abstract description of variation; Materialization is the storage strategy. They are separated because the same abstract description can be stored several ways depending on ensemble size, access pattern, and whether the data came from a method that produces samples, parameters, or a trajectory.

The bridge between the core and the ensemble layers is scope.

Scope. The set of nodes in the core that a given heterogeneity descriptor applies to. Formally, a scope handle is a reference into Hierarchy or Representation – a specific Hierarchy node (one atom, one residue, one chain, the whole assembly), a span of Hierarchy nodes (a residue range), or a label from a Representation mapping (the atoms in a named TLS group, the residues in a CATH domain). A descriptor’s scope determines (i) which atoms its displacement contribution \(\Delta_i^\ell(\cdot)\) affects, and (ii) where the descriptor sits in the scope DAG when it is composed with other descriptors.

Every heterogeneity descriptor attaches to exactly one scope. A B-factor is atom-scoped (atoms are Hierarchy nodes); a TLS descriptor is scoped to a group declared in Representation; a cryoDRGN latent is scoped to the whole assembly. The scope is what makes a descriptor legible to composition (chapter 3): it is the mechanism by which “what varies at chain level” and “what varies at residue level” can coexist in the same deposition without stepping on each other.

2.2.1 Heterogeneity

The Heterogeneity layer describes how the system varies – both the geometric variation that matters for function and the thermodynamic variation that carries free-energy content single structures discard (Wankowicz and Fraser 2025; Lane 2023). This is the hardest layer to design because the right abstraction depends entirely on what kind of variation is being described, and different experimental and computational methods produce fundamentally different kinds of variation. Forcing them all into one formalism – a list of discrete states, a factor graph over per-residue variables, a continuous latent space – will produce something that handles each case awkwardly.

We distinguish three heterogeneity regimes, because each has a different natural representation and the transitions between them are meaningful design boundaries. A fourth formalism – the factor graph regime, collapsed into a note below – covers a narrow case that is rarely worth implementing as a first-class deposition feature. Orthogonal to the regime axis is scope: every heterogeneity descriptor attaches to a level of the Hierarchy tree (§2.1.1) or a grouping in Representation (§2.1.2). A B-factor is atom-scoped; a TLS group or ECHT hierarchical disorder component (Ploscariu et al. 2021) is scoped to a Representation grouping; the compositional/conformational nesting proposed by Wankowicz and Fraser (2024) is entity-instance scope containing residue-range scope. Both axes are features of the format.

When multiple heterogeneity descriptors coexist, their couplings (where they exist at all) form the scope DAG: nodes are (scope, state) pairs, and edges encode whether a child-scope state is legal only under some parent-scope state (hierarchical nesting) or whether a factor potential spans scopes (the rare cross-scope factor case). Chapter 3 walks through how descriptors at different scopes compose through this DAG to produce full coordinates.

Heterogeneity Regime 1: Discrete Ensemble. A finite, enumerable set of conformers – NMR bundles, qFit sidechain ensembles, multi-conformer crystallographic models, short MD-derived state libraries. The right representation is explicit: a state index plus coordinates or coordinate deltas per state, typically scoped to a residue or residue range. The number of states sits in the tens to low hundreds. No cleverness required.

The incumbent encoding in mmCIF is the altloc field paired with refined occupancy. What is broken is semantic, not structural: the same altloc letter overloads conformational variation (different geometry, same chemistry) and compositional variation (different chemistry entirely – ligand absent vs. present, two bound fragments in the same site) (Rosenberg, Marx, and Bronstein 2024; Wankowicz and Fraser 2024), and most analysis software silently discards alternates beyond A. Treating discrete states as a distinct regime rather than absorbing them into a continuous disorder parameter is empirically grounded: controlled simulations show refined B-factors can underestimate true positional heterogeneity by up to sixfold for mobile atoms (Kuzmanic, Pannu, and Zagrovic 2014), so the information carried by discrete conformers is not something B-factors ever recover. Binding events propagate rotamer rearrangements well beyond the binding site (Wankowicz et al. 2022), which is precisely the kind of coupling this regime is meant to capture.

Heterogeneity Regime 2: Trajectory. An ordered sequence of frames sharing a fixed topology – MD, coarse-grained simulation, time-resolved crystallography, a cryo-EM particle series treated as a time series. The Hierarchy, Representation, and per-atom properties (atom names, residue assignments, bond graph) are the same across every frame; what varies is the coordinate array. The right representation is what H5MD already does: a dense \((N_{\mathrm{frame}}, N_{\mathrm{atom}}, 3)\) array chunked along the frame axis, with per-frame metadata (time, energy, box vectors). The format question here is not about heterogeneity representation but about chunking strategy, compression, and efficient access patterns – and whether the coordinate array is inline or pointed to as an external artifact via Materialization Mode C (§2.2.2), which is usually the right call because trajectories dwarf everything else in the deposition.

The frame axis does not need to be interpreted as a state space, and for long trajectories it shouldn’t be. But derived state decompositions can coexist alongside the raw stream: a Markov state model reduces a trajectory to a small set of metastable states with a transition rate matrix (Bozovic et al. 2020), which is effectively a Heterogeneity Regime 1 ensemble with provenance pointing back to the Heterogeneity Regime 2 source. Both shapes are valid, and the format should support carrying them side by side rather than forcing a commitment to one.

You don’t store a few reference structures; you store one, plus deltas or raw coordinates for every frame.

“Trajectory” is broader than MD. Any ordered sequence of frames shares this shape: a cryo-EM time-resolved dataset, a photo-activated structural series, an electron microscope tomographic tilt series reduced to atomic models. The format doesn’t need to know whether the ordering is time, reaction coordinate, or experimental parameter — it just needs to carry the index.

Heterogeneity Regime 3: Continuous Landscape. The heterogeneity is not a finite state list but a distribution over a low-dimensional manifold – cryoDRGN latent spaces, normal mode deformations, rigid-body motions of domains, ribosome ratcheting. Here the critical insight is that a hundred residues moving together as a collective motion is not a hundred heterogeneity variables. It is one low-dimensional variable plus a stored mapping from that variable to the displacement field over all affected atoms:

mode_basis[k, N_atom, 3]      # k basis vectors
mode_coeff[N_sample, k]        # sample coordinates in latent space

or in the neural decoder case:

latent_z[N_sample, d]
decoder_ref                    # TorchScript blob or checkpoint reference

Named discrete states in this regime are best understood as landmarks or clusters in the continuous space – selected by a human or a clustering algorithm after the fact – rather than fundamental objects in the data model. The format stores both: the continuous representation (latent coordinates per sample) and any named landmarks, with provenance recording how the landmarks were defined.

One reality check: this regime is already the output of a growing class of structure-prediction and reconstruction methods, but the field has nowhere to put it. Diffusion-based structure priors (Chung et al. 2024; Levy et al. 2025) and cryo-EM-guided samplers (Raghu et al. 2025) produce either pretrained generative models usable as decoders or posterior sample sets with informative uncertainty – outputs whose natural representation is continuous, and whose current fate is being collapsed back into a Materialization Mode A sample of explicit structures because there is no format slot for “a decoder reference” or “samples from a posterior linked to this structure.” Lane (2023)’s framing of the single-structure frontier is the field-level complaint this regime is meant to answer; Materialization Mode C, below, is the materialization half of the fix.

A fourth formalism covers a narrow case: genuinely independent local discrete variables with sparse coupling – a ligand that is either absent or in one of a few binding poses (Flowers et al. 2025), two distal sidechains whose rotamers are experimentally established to be correlated. A factor graph is the formally right abstraction: the joint distribution \(P(z) \propto \prod_f \psi_f(z_f)\), with each factor \(\psi_f\) coupling a small subset of variables. The classic “alt A of residue 50 co-occurs with alt A of residue 80” mmCIF limitation is exactly a two-variable \(2 \times 2\) factor.

In practice, depositions with genuinely factor-graph structure almost always have a small joint state space that can be flattened into an explicit Heterogeneity Regime 1 enumeration with joint populations – which is in spirit what Wankowicz and Fraser (2024)’s hierarchical compositional/conformational nesting does, using the Hierarchy scope to carry the coupling. The factor-graph view is worth keeping as an analytical frame (a factor coupling more than roughly five to ten variables is a reliable signal the problem has actually crossed into Heterogeneity Regime 3, where a continuous parameterization will be more compact), but it should not be implemented as a first-class deposition feature. Heterogeneity Regime 1 flattening plus hierarchical scope covers the real cases.

A solution NMR structure of a 100-residue protein is deposited as a bundle of 20 models. Each model is a full set of ~800 atom coordinates. The heterogeneity descriptor is a single whole-assembly-scoped Regime 1 variable with cardinality 20; the state axis is a discrete label. No correlations to encode, no couplings to other scopes. Materialization is Mode A (full enumeration) because composition is identical across states and the ensemble is small. This is the simplest real case of the architecture: one heterogeneity descriptor, one scope, one regime, one mode.

A 1 µs explicit-solvent MD trajectory of a PDZ domain produces 500,000 frames stored as a Regime 2 \((500000, \sim 1000, 3)\) array with per-frame time and energy metadata. Downstream analysis clusters the trajectory into 5 metastable states using tICA + k-means, and fits a transition rate matrix between them. Both representations coexist in the deposition: the raw trajectory (Regime 2, whole-assembly scope, chunked along the frame axis), and a Regime 1 descriptor with five states plus a \(5 \times 5\) rate-matrix annotation, with provenance linking the reduced descriptor back to the source trajectory. A consumer interested in training data reads the trajectory; a consumer interested in kinetics reads the MSM. Neither is forced to convert between them.

A cryo-EM reconstruction of an elongating ribosome uses cryoDRGN to learn an 8-dimensional continuous latent space over $$100,000 particles. The heterogeneity descriptor is an assembly-scoped Regime 3 variable: the stored data is the \((100000, 8)\) latent-coordinate array and a decoder reference (a checkpoint file), plus a handful of named landmarks (classical, hybrid, post-translocation) picked out as cluster centers. Materialization is Mode C: coordinates at any latent point are computed on demand by running the decoder. Sub-scope descriptors can coexist – e.g., a separate Regime 1 descriptor on the tRNA entity instance capturing presence/absence at the A, P, and E sites, nested under the assembly-scope latent if the latent distribution correlates with occupancy.

2.2.2 Materialization

Given a heterogeneity description, how are actual coordinates produced? Materialization is a separate concern from the heterogeneity model because the same abstract description can be realized by different storage strategies depending on ensemble size, composition variability, and access pattern.

Three modes cover the space. They are mutually substitutable for a given heterogeneity descriptor – switching from one to another is a storage choice that does not change what the descriptor means.

Materialization Mode A: Full Enumeration. coords[N_state, N_atom, 3] with per-state metadata (population, free energy, latent vector). Appropriate for Heterogeneity Regime 1 and for factor-graph ensembles where the joint assignment space is small. Simple consumers – visualization tools, PDB depositions – only need to handle this mode.

Materialization Mode B: Delta Encoding. A base conformation plus sparse per-state or per-variable-assignment coordinate deltas. Appropriate when most states differ only locally – qFit-like sidechain ensembles where the backbone is unchanged, loop substates where the rest of the chain is unaffected. Far more compact than Materialization Mode A for large sparse ensembles.

base_coords[N_atom, 3]

delta_table:
  owner_type     # state | variable_value
  owner_id
  atom_id[K]
  delta_xyz[K, 3]

Materialization Mode C: Generative / External Reference. Coordinates are produced by applying a stored operator to a compact representation. Subcases: rigid-body transforms (quaternion + translation per instance per state), normal mode coefficients (linear sum over a stored basis), and learned decoders (latent vector fed to a stored TorchScript or checkpoint). The descriptor stores the operator plus its inputs; coordinates are computed on demand. This is where the operator graph from the architecture intro lives: a decoder or basis is a stored morphism from latents to displacements, and it can itself be versioned, referenced, and shared across depositions.

Mode C also covers the obvious-but-underformalized case of referencing an external artifact. An MD trajectory stored as H5MD or Zarr, a cryoDRGN decoder checkpoint, a normal-mode basis file – any of these can be the heavy payload, with the structure file carrying a stable pointer (artifact identifier, producing method, version, content hash) and the coupling back to atom IDs. The structure does not need to swallow the trajectory; it needs to know where the trajectory lives and how to align its indices.

The three operator subcases – linear basis, neural decoder, trajectory lookup – differ in what is stored but share the same interface: given an input and a reference, produce displacements. Swapping a linear basis for a neural decoder does not change the descriptor’s identity, only its compression.

The key design principle is that materialization mode is a storage choice, not a property of the physics. The format should not force Heterogeneity Regime 3 systems into Materialization Mode A just because Materialization Mode A is simpler to implement.

A 1.3 Å crystal structure processed by qFit has ~50 residues with two or three modeled sidechain conformations each, while the backbone and most sidechains are unchanged across states. Materialization Mode A would store ~N_state full copies of the ~5000-atom structure (\(5000 \cdot N_{\mathrm{state}}\) coordinates), most of which are identical. Mode B stores one reference structure plus a delta table listing only the atoms that differ per state: \(\sim 50 \cdot 3 \cdot 10 \approx 1500\) delta entries total, two orders of magnitude smaller. Both encode the same heterogeneity descriptor (a Regime 1 ensemble scoped per affected residue); only the storage differs.

A microsecond MD trajectory of a 200-residue domain is analyzed by principal component analysis into 10 dominant modes accounting for ~80% of the positional variance. The deposition stores two things under Mode C: the mode basis \((10, N_{\mathrm{atom}}, 3)\) array with per-mode eigenvalues, sufficient to reconstruct any PCA projection of the trajectory; and an external reference to the raw H5MD file (artifact URI, content hash, chunking metadata), so a consumer that wants original frames can still reach them. The mode basis is the operator; the PCA coefficients are its inputs; coordinates are generated on demand. Neither Mode A nor Mode B would compress a microsecond trajectory to the \((10 \cdot N_{\mathrm{atom}} \cdot 3)\)-number basis without losing the non-principal variance, so the mode basis plus the external reference together cover both the compact representation and the full fidelity one.

2.3 What is not in the core

The LLVM discipline – elaborated in Modularity examples – states that the IR should contain only what cannot be derived from anything more primitive, and that derived information should live in passes rather than in the core. This is worth stating explicitly for the structural biology case because the pressure to put derived information in the core is constant and has demonstrably destabilized mmCIF through category accretion.

Backbone dihedrals are a function of Cartesian coordinates. Contact maps are a function of coordinates and a cutoff. Spatial neighbor graphs are a function of coordinates, a cutoff, and a choice of distance metric. Spherical harmonic edge features are a function of neighbor graphs and a choice of \(l_{\max}\). AF2-style pair representations are a function of sequence and structure. Validation scores are a function of coordinates and a reference database. Domain assignments are a function of coordinates and a domain classification algorithm. All of these are passes over Hierarchy + Representation + Heterogeneity + Materialization. None of them belongs in the core.

The practical payoff: a format whose core is topology, coordinates, and heterogeneity variables has no reason to grow when a new experimental method produces a new derived quantity. The new quantity becomes an annotation overlay (see Query language and annotations) or a computed pass. The core is stable because it is genuinely irreducible, not because it was protected by committee inertia.

References

Bozovic, Olga, Claudio Zanobini, Adnan Gulzar, Brankica Jankovic, David Buhrke, Matthias Post, Steffen Wolf, Gerhard Stock, and Peter Hamm. 2020. “Real-Time Observation of Ligand-Induced Allosteric Transitions in a PDZ Domain.” Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.2012999117.
Chung, Hyungjin, Jeongsol Kim, Michael T. McCann, Marc L. Klasky, and Jong Chul Ye. 2024. “Diffusion Posterior Sampling for General Noisy Inverse Problems.” arXiv. https://doi.org/10.48550/arXiv.2209.14687.
Flowers, Jessica, Nathaniel Echols, Galen J. Correy, Priyadarshini Jaishankar, Takaya Togo, Adam R. Renslo, Henry van den Bedem, James S. Fraser, and Stephanie A. Wankowicz. 2025. “Expanding Automated Multiconformer Ligand Modeling to Macrocycles and Fragments.” eLife. https://doi.org/10.7554/eLife.103797.
Kuzmanic, Antonija, Navraj S. Pannu, and Bojan Zagrovic. 2014. “X-Ray Refinement Significantly Underestimates the Level of Microscopic Heterogeneity in Biomolecular Crystals.” Nature Communications. https://doi.org/10.1038/ncomms4220.
Lane, Thomas J. 2023. “Protein Structure Prediction Has Reached the Single-Structure Frontier.” Nature Methods. https://doi.org/10.1038/s41592-022-01760-4.
Levy, Axel, Eric R. Chan, Sara Fridovich-Keil, Frédéric Poitevin, Ellen D. Zhong, and Gordon Wetzstein. 2025. “Solving Inverse Problems in Protein Space Using Diffusion-Based Priors.” arXiv. https://doi.org/10.48550/arXiv.2406.04239.
Ploscariu, Nicoleta, Tom Burnley, Piet Gros, and Nicholas M. Pearce. 2021. “Improving Sampling of Crystallographic Disorder in Ensemble Refinement.” Acta Crystallographica Section D: Structural Biology. https://doi.org/10.1107/S2059798321010044.
Raghu, Rishwanth, Axel Levy, Gordon Wetzstein, and Ellen D. Zhong. 2025. “Multiscale Guidance of Protein Structure Prediction with Heterogeneous Cryo-EM Data.” arXiv. https://doi.org/10.48550/arXiv.2506.04490.
Rosenberg, Aviv A., Ailie Marx, and Alexander M. Bronstein. 2024. “A Dataset of Alternately Located Segments in Protein Crystal Structures.” Scientific Data. https://doi.org/10.1038/s41597-024-03595-4.
Wankowicz, Stephanie A., and James S. Fraser. 2024. “Comprehensive Encoding of Conformational and Compositional Protein Structural Ensembles Through the mmCIF Data Structure.” IUCrJ. https://doi.org/10.1107/S2052252524005098.
———. 2025. “Advances in Uncovering the Mechanisms of Macromolecular Conformational Entropy.” Nature Chemical Biology. https://doi.org/10.1038/s41589-025-01879-3.
Wankowicz, Stephanie A., Saulo H. de Oliveira, Daniel W. Hogan, Henry van den Bedem, and James S. Fraser. 2022. “Ligand Binding Remodels Protein Side-Chain Conformational Heterogeneity.” eLife. https://doi.org/10.7554/eLife.74114.