Toward a Modern Foundation for Structural Biology Data

Notes on mmCIF, its limits, and what an ML-native successor might look like

structural-biology
data-formats
mmcif
Published

April 15, 2026

Modified

April 19, 2026

Working draft – exploratory, incomplete. A scaffold for collecting references, usecases, and tradeoffs.

This is a multi-part post. Use the sidebar to jump between sections.

Contents

  1. Problems with mmCIF and the current ecosystem – single-structure assumption, parsing fragility, missing experimental linkage, no ML-native access, pain points.
  2. Architecture – the four layers (hierarchy, representation, heterogeneity, materialization) split into a core / ensemble halves, and what stays out of the core.
  3. Composition – how heterogeneity at multiple scopes composes; scope-local regimes and mixed materialization modes; worked examples (crystallography, protein+ligand, ribosome).
  4. Annotations and computed passes – Zarr/DataFusion/TileDB storage, heterogeneity regimes mapped to existing tools, query language, ML-native operations.
  5. Open questions
  6. Appendix: “black-box” data
  7. Appendix: modularity examples in software ecosystems – LLVM, DataFusion, AnnData, ECS, ONNX, OpenTelemetry, CF Conventions.