Subject: re: common PDB file errors and detection tools

Steve,

Your question hits a recurring sore spot. There isn’t a single canonical list of common PDB errors, but the validation tooling that exists is reasonably comprehensive – the problem is that it’s scattered across tools whose outputs don’t aggregate. A few pointers grouped by error type:

File-syntax breakage (your whitespace/duplicate cases). gemmi (Marcin Wojdyr) is the strictest modern mmCIF parser and surfaces issues other parsers silently swallow. Often the first place to look when a file half-loads.

Geometric errors (bonds, angles, clashes, Ramachandran, rotamers). MolProbity is the Richardson lab’s geometry battery and the most widely cited. Phenix bundles it. Mogul (CCDC) does ligand geometry against the small-molecule crystal database.

Composition and connectivity (missing LINK records, ligand identity, Mg vs water). CheckMyMetal and CheckMyBlob handle metal-site sanity. The wwPDB Validation Report bundles many other checks at deposit time.

Systematic re-validation (your “files that survived deposition but are still wrong” case). PDB-REDO re-refines deposited X-ray structures and publishes the deltas – effectively a standing corpus of “errors caught after the fact”.

Comprehensive but old-school. WHATIF / WHAT_CHECK (Gert Vriend) is the original of the genre and still useful for stereochemistry and packing checks.

Inside ChimeraX. Built-in validation panels plus a Python API. You can wrap any of the above as a tool or write residue-level checks directly. The check command and the clashes / contacts tools are the obvious starting points; for atom-count-per-residue against the CCD, a short Python script using chimerax.atomic is straightforward.

For the AI-agent angle: the bigger obstacle is corpus, not models. Existing tools produce one-shot reports that don’t ride with the deposition, so there is no archive linking errors back to the structures they appeared in – which is exactly the labelled data a learned validator would need. Some recent work uses AF2 pLDDT as a suspicion signal and learned rotamer libraries are getting better, but it’s research-grade.

I have a draft blog post on adjacent format-level issues – happy to share once it’s less of a sketch.

Cheers, Artem