Structural Bio Format

Some notes on mmCIF and what an ML-native successor might look like

2026-04-15

Structural Bio Format

Some notes on mmCIF and what an ML-native successor might look like

Early draft, still coalescing. The Core — Layout, Forward operators, Evaluation model — is where the real content lives; the Primer and the Secondary chapters are rougher. Refs are still patchy and there’s plenty of koolaid.

I’m trying to sketch here a kernel of idealized infrastructure for structural bio in 2026 and beyond that is not a meshwork of disparate databases and resources, is not bursting at the seams from the influx of new datatypes and data volume, is not constrained in its evolution by obsolete conventions, accommodates the new computational fabric well.

This is not a proposal or a concrete plan or an attack on anyone. There are motifs from database, compiler and ML literature here so it’s not exactly obvious where the “format” ends and auxilary software (from parser to the backend to storage to exchange platform) begins for now.

The map below orients the Core; every box links to its chapter.

Contents

The Core is Duino itself — read these in order:

  • Layout — Track A, the ensemble representation. Four layers: Hierarchy and Groupings (the invariant core), Heterogeneity (the three regimes R1/R2/R3) and Materialization (the three storage modes A/B/C).
  • Forward-operator infrastructure — Track B. The operator records that turn an ensemble into an experimental likelihood, and the registry/storage discipline around them.
  • Evaluation model — how scope-local descriptors compose into rendered coordinates: the discrete-nesting and continuous-additive stacks, independence vs nesting vs provenance, sample-axis classification, Mode C operator interfaces, cross-backend integration.

Prerequisite:

  • Primer — Boltzmann ensembles, marginals vs joints, qFit, forward models, and the Bayesian posterior-sampling picture the Core takes as given. Start here if the conformational-ensemble framing is new.

Secondary — equally part of the mission, but supporting the Core rather than constituting it: