Parsing fragility. No reference implementation. Every parser is different. Links below.
No random access. Cannot read chain B without scanning chain A. For 200k+ atom structures this is a real bottleneck. Not a huge problem for a structural/comp biologist yet in practice, but already a tightening bottleneck for the ML models.
Residue numbering. Author vs. label vs. UniProt, insertion codes, no canonical mapping. See bioinformatics.stackexchange.com/questions/14210, proteopedia, also SAbR renumbering tool: github.com/delalamo/SAbR.
Ligand representation. 3-to-5-char Component IDs, bond orders in a separate dictionary, no inline SMILES, insufficient for conformational sampling (ligands frequently need SMILES as separate input).
Future scale. Current PDB deposition is one-structure-per-experiment. When cryo-ET subtomogram averaging routinely produces per-particle structures, or when high-throughput crystallography campaigns produce thousands of related structures, the deposition model will need to change. This is not yet a data storage problem for atomic models but an access, association, and transformation problem.
mmCIF models one experiment -> one structure. Insufficient
The alt_id mechanism is an ok tool for conformational heterogeneity: individual atoms carry a single-character label (A, B, C…) and an occupancy (summing to 1.0 within a residue). This means: each residue independently declares its own conformers; there is no way to state “alt A of residue 50 co-occurs with alt A of residue 80”; and occupancies are per-atom scalars with no associated uncertainty, free energy, or kinetic information. No way to capture continuous motion however and doesn’t yet scale to larger systems.
The CIF dictionary itself lacks conditionals and has no automatic schema validation. entity_poly_seq can’t be mandatory for files containing polymers because the dictionary language cannot express “if a polymer entity exists, require this category.” Documentation is scattered across IUCr, CCP4, and wwPDB sites with inconsistent coverage.
Atomic models are largely untethered from the experimental data that produced them. Maps in EMDB, raw images in EMPIAR, models in PDB – three archives, three accession systems, cross-referenced but not structurally integrated. A tighter connection between atomic coordinates and the density/potential from which they were derived would enable validation workflows, re-refinement, and heterogeneity analysis that currently require manual assembly of data from multiple sources.
Every GNN-based structure model constructs a molecular graph from parsed coordinates. This graph construction is reimplemented in every codebase (Graphein, AF2 data pipeline, OpenFold, ProteinMPNN, ESM-IF) and is a major preprocessing bottleneck. Specific problems:
BinaryCIF (David Sehnal / Mol* team, github.com/molstar/BinaryCIF) addresses the performance problem: binary-encoded mmCIF with column-wise compression, significantly smaller files and faster parsing than text mmCIF. Used by the PDB for Mol* visualization. But it’s a serialization optimization, not a data model change – it encodes the same categories and the same single-structure assumption.
The IHMCIF dictionary (the IHM working group extension, github.com/ihmwg/IHMCIF). This adds: discrete multi-state models, ordered ensembles (states connected by time or other ordering), multi-scale representations (atomic + coarse-grained beads in the same model), and spatial restraint descriptions. But: states remain discrete and independent, there is no thermodynamic annotation (or other flexible heterogeneity annotations either), no per-state uncertainty, no support for continuous distributions, and the text-file serialization doesn’t scale to large ensembles.