
Core
Prerequisite
Secondary
Structural Bio Format
Some notes on mmCIF and what an ML-native successor might look like
Early draft, still coalescing. The Core — Layout, Forward operators, Evaluation model — is where the real content lives; the Primer and the Secondary chapters are rougher. Refs are still patchy and there’s plenty of koolaid.
I’m trying to sketch here a kernel of idealized infrastructure for structural bio in 2026 and beyond that is not a meshwork of disparate databases and resources, is not bursting at the seams from the influx of new datatypes and data volume, is not constrained in its evolution by obsolete conventions, accommodates the new computational fabric well.
This is not a proposal or a concrete plan or an attack on anyone. There are motifs from database, compiler and ML literature here so it’s not exactly obvious where the “format” ends and auxilary software (from parser to the backend to storage to exchange platform) begins for now.
The map below orients the Core; every box links to its chapter.
Contents
The Core is Duino itself — read these in order:
- Layout — Track A, the ensemble representation. Four layers: Hierarchy and Groupings (the invariant core), Heterogeneity (the three regimes R1/R2/R3) and Materialization (the three storage modes A/B/C).
- Forward-operator infrastructure — Track B. The operator records that turn an ensemble into an experimental likelihood, and the registry/storage discipline around them.
- Evaluation model — how scope-local descriptors compose into rendered coordinates: the discrete-nesting and continuous-additive stacks, independence vs nesting vs provenance, sample-axis classification, Mode C operator interfaces, cross-backend integration.
Prerequisite:
- Primer — Boltzmann ensembles, marginals vs joints, qFit, forward models, and the Bayesian posterior-sampling picture the Core takes as given. Start here if the conformational-ensemble framing is new.
Secondary — equally part of the mission, but supporting the Core rather than constituting it:
- Problems with mmCIF and the current ecosystem – single-structure assumption, parsing fragility, missing experimental linkage, no ML-native access.
- Annotations and the artifact engine – query language, the (selector, body, provenance) annotation overlay, ML-native featurization passes, training-data shapes.
- Technology survey – Zarr/OME-Zarr/copick/TileDB/Arrow/DataFusion as storage primitives, the three heterogeneity regimes mapped onto existing tools (H5MD, cryoDRGN, qFit, IHM).
- Open questions
- ML-native data: access patterns and precomputed features – random access, neighbor lookup, equivariant feature caches, tokenization, dataloader chunking.
- Appendix: “black-box” data and model-side artifacts – weights, decoders, latents, model outputs, training metadata co-located with the structure.
- Appendix: modularity examples in software ecosystems – LLVM, DataFusion, AnnData, ECS, ONNX, OpenTelemetry, CF Conventions.