Problems with mmCIF and the current ecosystem

Problems with mmCIF and the current ecosystem

Some pain points in no particular order

  • Parsing fragility. No reference implementation. Every parser is different. Links below.

  • No random access. Cannot read chain B without scanning chain A. For 200k+ atom structures this is a real bottleneck. Not a huge problem for a structural/comp biologist yet in practice, but already a tightening bottleneck for the ML models.

  • Residue numbering. Author vs. label vs. UniProt, insertion codes, no canonical mapping. See bioinformatics.stackexchange.com/questions/14210, proteopedia, also SAbR renumbering tool: github.com/delalamo/SAbR.

  • Ligand representation. 3-to-5-char Component IDs, bond orders in a separate dictionary, no inline SMILES, insufficient for conformational sampling (ligands frequently need SMILES as separate input).

  • Future scale. Current PDB deposition is one-structure-per-experiment. When cryo-ET subtomogram averaging routinely produces per-particle structures, or when high-throughput crystallography campaigns produce thousands of related structures, the deposition model will need to change. This is not yet a data storage problem for atomic models but an access, association, and transformation problem.

The single-structure assumption has to evolve

mmCIF models one experiment -> one structure. Insufficient

The alt_id mechanism is an ok tool for conformational heterogeneity: individual atoms carry a single-character label (A, B, C…) and an occupancy (summing to 1.0 within a residue). This means: each residue independently declares its own conformers; there is no way to state “alt A of residue 50 co-occurs with alt A of residue 80”; and occupancies are per-atom scalars with no associated uncertainty, free energy, or kinetic information. No way to capture continuous motion however and doesn’t yet scale to larger systems.

Text format, parsing fragility

The CIF dictionary itself lacks conditionals and has no automatic schema validation. entity_poly_seq can’t be mandatory for files containing polymers because the dictionary language cannot express “if a polymer entity exists, require this category.” Documentation is scattered across IUCr, CCP4, and wwPDB sites with inconsistent coverage.

Round-trip and edge-case damage

The legacy PDB <-> mmCIF round-trip silently mangles information. Chain IDs longer than one character get truncated. The 99,999-atom limit forces hex hacks. Residue numbers above 9999 break. Insertion codes vanish in some pipelines. Encoding and line-ending bugs cause half-loaded structures whose symptoms surface downstream in the viewer, not at parse time. gemmi 25 is the strictest modern mmCIF library and flags edge cases other parsers swallow.

Content-level errors that survive deposition

Even files that parse cleanly carry content-level errors that wwPDB validation does not catch. PDB-REDO 29 systematically re-refines the X-ray archive and finds measurable improvements often enough to be a standing critique of the deposited model. Recurring categories:

  • Missing or extra atoms per residue; atom names that do not match the CCD; modified residues mis-typed.
  • Geometric outliers: bond length, angle, clash, cis/trans peptide, chirality. MolProbity 27 for protein geometry; Mogul for ligand geometry.
  • Connectivity gaps: missing LINK / _struct_conn records for disulfides, covalent ligands, glycosidic bonds, modified-residue linkages.
  • Composition errors: wrong CCD ligand identifier; protonation/tautomer guesses baked in silently; Mg2+ modelled as water at low resolution. CheckMyMetal and CheckMyBlob are dedicated checks.
  • Numerical inconsistencies: altloc occupancies not summing to 1.0; negative B-factors; coordinates outside the unit cell.

Each error class has a tool that catches it; the tool’s output does not travel with the deposition.

Validation tooling is fragmented

Coverage is broad and outputs do not aggregate.

  • MolProbity 27 – Ramachandran, rotamer, clash, geometry Z-scores.
  • Coot 26 – interactive density-and-geometry validation during refinement.
  • Phenix 28phenix.molprobity, phenix.validate_geometry, phenix.real_space_correlation.
  • WHAT_CHECK / WHATIF – broad stereochemistry and packing checks, the original of the genre.
  • Mogul – ligand bond/angle distributions from the CSD.
  • gemmi 25 – parser-level diagnostics.
  • PDB-REDO 29 – systematic re-refinement of the X-ray archive.
  • wwPDB Validation Report – the deposition-time bundle; PDF plus machine-readable XML.
  • EDS – experimental maps for visual re-inspection.
  • ChimeraX 30 – bundled validation panels and a Python API for custom checks.

Each tool runs separately and produces its own report. The end user assembles validation themselves, the same shape as the SIFTS / CATH / UniProt integration burden in the annotation layer. ML-based validators (AF2 pLDDT repurposed as a suspicion signal, learned rotamer libraries, ML-assisted ligand identification) are research-grade; the obstacle is less the modelling than the absence of an archive linking errors back to the structures they appeared in.

Implicit information and missing connections

Atomic models are largely untethered from the experimental data that produced them. Maps in EMDB, raw images in EMPIAR, models in PDB – three archives, three accession systems, cross-referenced but not structurally integrated. A tighter connection between atomic coordinates and the density/potential from which they were derived would enable validation workflows, re-refinement, and heterogeneity analysis that currently require manual assembly of data from multiple sources.

No ML-native access patterns

Every GNN-based structure model constructs a molecular graph from parsed coordinates. This graph construction is reimplemented in every codebase (Graphein, AF2 data pipeline, OpenFold, ProteinMPNN, ESM-IF) and is a major preprocessing bottleneck. Specific problems:

  • No random access to subsets of atoms or residues without reading the full file
  • No stored spatial index for neighbor lookup (every model rebuilds k-d trees / radius graphs)
  • No standard for internal coordinates (phi/psi/omega/chi angles) alongside Cartesian – every model recomputes these
  • Pair representations ((N_res, N_res, d) tensors) are the most expensive object in AF2-family models and are never stored/shared
  • No convention for packaging ensembles as training data for ensemble-aware models
  • Tokenization for heterogeneous systems (protein + ligand + nucleic acid + ions) is done ad-hoc in AF3/Boltz/Chai with no shared vocabulary

BinaryCIF (github.com/molstar/BinaryCIF) addresses the performance problem: binary-encoded mmCIF with column-wise compression, significantly smaller files and faster parsing than text mmCIF. Used by the PDB for Mol* visualization. But it’s a serialization optimization, not a data model change – it encodes the same categories and the same single-structure assumption.

The IHMCIF dictionary 41 (the IHM working group extension). This adds: discrete multi-state models, ordered ensembles (states connected by time or other ordering), multi-scale representations (atomic + coarse-grained beads in the same model), and spatial restraint descriptions. But: states remain discrete and independent, there is no thermodynamic annotation (or other flexible heterogeneity annotations either), no per-state uncertainty, no support for continuous distributions, and the text-file serialization doesn’t scale to large ensembles.