Toward a Modern Foundation for Structural Biology Data
Notes on mmCIF, its limits, and what an ML-native successor might look like
structural-biology
data-formats
mmcif
Working draft – exploratory, incomplete. A scaffold for collecting references, usecases, and tradeoffs.
This is a multi-part post. Use the sidebar to jump between sections.
Contents
- Problems with mmCIF and the current ecosystem – single-structure assumption, parsing fragility, missing experimental linkage, no ML-native access, pain points.
- Architecture – the four layers (hierarchy, representation, heterogeneity, materialization) split into a core / ensemble halves, and what stays out of the core.
- Composition – how heterogeneity at multiple scopes composes; scope-local regimes and mixed materialization modes; worked examples (crystallography, protein+ligand, ribosome).
- Annotations and computed passes – Zarr/DataFusion/TileDB storage, heterogeneity regimes mapped to existing tools, query language, ML-native operations.
- Open questions
- Appendix: “black-box” data
- Appendix: modularity examples in software ecosystems – LLVM, DataFusion, AnnData, ECS, ONNX, OpenTelemetry, CF Conventions.