Structural Bio Format

Some notes on mmCIF and what an ML-native successor might look like

This is a very early draft that’s still coalescing. The only worthwhile chapters are 2 and 3 for now. Figures are shit, refs are missing, lots of assumptions and koolaid.

I’m trying to sketch here a kernel of idealized infrastructure for structural bio in 2026 and beyond that is not a meshwork of disparate databases and resources, is not bursting at the seams from the influx of new datatypes and data volume, is not constrained in its evolution by obsolete conventions, accommodates the new computational fabric well.

This is not a proposal or a concrete plan or an attack on anyone. There are motifs from database, compiler and ML literature here so it’s not exactly obvious where the “format” ends and auxilary software (from parser to the backend to storage to exchange platform) begins for now.

TODO: add a survey of operations across pdbtbx, ccp4, coot, isolde, chimerax

Problems with mmCIF and the current ecosystem – some scattered complaints about the mmcif format in no particular order of importance.. single-structure assumption, parsing fragility, missing experimental linkage, no ML-native access, pain points.
Layout – a stab at the alternative “from the ground up” format architecture. Four layers are described: two for the core data structure and two for representing various heterogeneities.
Evaluation model – realistic mixing scenarios, the two-stack composition principle (discrete-nesting and continuous-additive), independence vs nesting vs provenance, sample-axis classification (aligned/broadcast/mixed), Mode C operator interfaces, cross-backend integration, open design questions.
Annotations and the artifact engine – the engine half: query language, the (selector, body, provenance) annotation overlay, ML-native featurization passes (graphs, equivariant features, pair representations, tokenization), training data shapes for distribution-predicting models.
Technology survey – the survey half: Zarr/OME-Zarr/copick/TileDB/Arrow/DataFusion as storage primitives, with the three heterogeneity regimes mapped onto existing tools (H5MD, cryoDRGN, qFit, IHM).
Open questions
ML-native data: access patterns and precomputed features – random access, neighbor lookup, equivariant feature caches, tokenization, dataloader chunking; the discipline for caching derived data without polluting the core.
Appendix: “black-box” data and model-side artifacts – weights, decoders, latents, model outputs, training metadata co-located with the structure.
Appendix: modularity examples in software ecosystems – LLVM, DataFusion, AnnData, ECS, ONNX, OpenTelemetry, CF Conventions.

Structural Bio Format

Contents