Rewriting the file “on the fly” is hard due to lack of modularity and other:
- trajectories for a single chain
- adding a few waters etc.
Background | The Unfurling Landscape of Structural Biology
What’s the problem that needs solving?
Data organization at the confluence of CryoET, protein and rna engineering and LLMs’ ability to manipulate types/ontologies.
- gargantuan-scale images at angstrom-resolution
- [structural] “digital twins”
- molecular force fields
- near-perfect polymer folding
- ligand/binding affinity prediction
Interfaces between
- MD
- EM/Crystallography
- atomic/crystallographic data encoding
- sequence
What our part in it might be:
The implicit proposition here is that by having a common substrate (what is it? a format? a framework? a type system? a library? an application?) the friction is reduced.
Does any kind of study benefit from this improved substrate?
Yes, i think compositional and conformational heterogeneity studies would be impossible without a framework under which to track the artifacts. By that, i mean studies of type “motion of molecule X in the presence of Y” or “conformational change of Z in the presence of W”, spliceosome .
- “modularity at biological hierarchy boundaries”
- Who is going to use it?
- Who is going to pay you for it?
- What is the job here that won’t need doing in 5-10 years?
- What is the job that will need doing doing in 5 years but doesn’t exist now?
entity_poly_seq can’t be mandatory since you can produce a mmCIF file without any polymeric molecular entity. You could write a mmCIF file with a single ion in it, no protein, no nucleic acid and it still would be a valid mmCif file while that file can’t have entity_poly_seq because… no polymer ;) I guess once you have a linear polymer in a mmCIF file, entity_poly_seq should be in there, too.__ That can’t be reflected by mmCIF dictionaries since they don’t know conditionals.___
Semi-useful Links
https://github.com/chanzuckerberg/cryoet-data-portal-backend/blob/main/schema/v1.1.0/metadata-docs/TiltSeries.md https://github.com/chanzuckerberg/cryoet-data-portal-backend/tree/main/schema/v1.1.0 https://ngff.openmicroscopy.org/about/index.html https://datascience.codata.org/articles/10.5334/dsj-2016-003
Format Specs
https://www.iucr.org/resources/cif https://datascience.codata.org/articles/10.5334/dsj-2016-003
Formal specs
https://www.iucr.org/resources/cif/cif2 https://www.iucr.org/resources/cif/spec/version1.1 https://legacy.ccp4.ac.uk/newsletters/newsletter37/13_harvest.html
Implementations and random docs
https://www.ccp4.ac.uk/html/cciflib.html http://comcifs.github.io/cif_api/index.html
History of
https://www.ccp4.ac.uk/html/mmcifformat.html
- Implementation of Data Harvesting in the CCP4 Suite : https://legacy.ccp4.ac.uk/newsletters/newsletter37/13_harvest.html
“HARVESTING” :https://www.ccp4.ac.uk/html/harvesting.html
Complaining
format spec is ugly, says biopython developer: https://stackoverflow.com/a/11686524/10697358 People routinely write their own parsers to reconcile these and curse the design choices and lack of documentation. Ex. here are guys from model-angelo ( GNN for model building that chenwei looked at) fixing this on their onw : https://github.com/3dem/model-angelo/issues/51 Here is Deep Mind implementing their own parser https://huggingface.co/spaces/simonduerr/ProteinMPNN/blob/f969e9cfb6f11ba299c7108aadff124e9cf38b1f/alphafold/alphafold/data/mmcif_parsing.py Here the gemmi guy trying to connect his work to chimerax and stumbling over this : https://mail.cgl.ucsf.edu/mailman/archives/list/chimerax-users@cgl.ucsf.edu/thread/XOO3G5MUOHLQBHSVF2NENWYNG3SOUF2O/ https://bioinformatics.stackexchange.com/questions/14210/pdb-residue-numbering
Old tune, but the eminence of AF/Rosetta/etc. recognized with the nobel (not saying it’s right or wrong) contrasted to the quality of the “dataset” they spring from, which is just what the PDB is increasingly is pretty staggering to me. They are stuck in 2000s. I’d bet an eye that if not for the institutional inertia and the OneDeposition workflow, a single company like Meta or HuggingFace or one of amazon’s 100 bio divisions could come up with a better solution in a span of year. By better i mean support for binary, direct integration with source em-maps, sane modern format, biological hierarchy integration and much MUCH richer data points: in terms of landmarks and shapes like we are doing or something more chemically minded a la Wilson/Polikanov or something dynamism-minded a la cryodrgn. It is crazy crazy crazy that people use 5-10mil$ machines from the cutting edge of science, thousands of dollars worth of cloud computer and i dont know how many biologist human hours to arrive at a crappy plaintext file almost entirely untethered to the fabric from which it came.
I think every day about what CZII is going to do the moment they are consistely getting 3-4A subtomograms and need to “deposit” 1000 ribosomes at once for some horizontal study of a mouse brain cancer cell or something. Whatever it is it’s not PDB.
- just shitty documentation, no automatic schema, ad-hoc updates, no explanation for datatypes
https://github.com/google-deepmind/alphafold/issues/252
https://github.com/wwpdb-dictionaries/mmcif_pdbx/issues
Ex. here are guys from model-angelo ( GNN for model building that chenwei looked at) fixing this on their onw : https://github.com/3dem/model-angelo/issues/51
Here is Deep Mind implementing their own parser https://huggingface.co/spaces/simonduerr/ProteinMPNN/blob/f969e9cfb6f11ba299c7108aadff124e9cf38b1f/alphafold/alphafold/data/mmcif_parsing.py
Here the gemmi guy trying to connect his work to chimerax and stumbling over this : https://mail.cgl.ucsf.edu/mailman/archives/list/chimerax-users@cgl.ucsf.edu/thread/XOO3G5MUOHLQBHSVF2NENWYNG3SOUF2O/
https://bioinformatics.stackexchange.com/questions/14210/pdb-residue-numbering
https://proteopedia.org/wiki/index.php/Unusual_sequence_numbering ““”
- 5afi.y is a trna sequence but is tagged with “F” at the end just because it carreis a phenylalanine. How crazy is that. No respect for the standardization of RNA alphabet.
https://docs.rosettacommons.org/docs/latest/development_documentation/tutorials/robust
Stephanie Wankowitz:
https://diffuse.science/posts/encoding/ –> https://pmc.ncbi.nlm.nih.gov/articles/PMC11220883/pdf/m-11-00494.pdf https://x.com/stephanie_mul/status/1955319804539150665
To go over: -https://pubmed.ncbi.nlm.nih.gov/40586518/ -https://pmc.ncbi.nlm.nih.gov/articles/PMC12208665/pdf/elife-103797.pdf -https://arxiv.org/pdf/2505.01919
Translation, Librations, Screws https://onlinelibrary.wiley.com/doi/abs/10.1107/S0567740868001718
Diego del Alamo:
To give one example, if you receive a model of an antigen against which you want to design antibodies, and some of the surface loops look weird, you'd like to know if they are justified by experimental data or were modeled w/ Rosetta, AF3, etc, ...
.. because you don't want to design against some computational artifact. If you got the model from a human, that's fine, you can go ask. But structbio agents are ephemeral and can't be consulted a week later. So ideally that info gets stuffed in the PDB/CIF header. ...
... But then that raises the question of how much metadata you can pack in there. Are you putting in MSAs? Etc. Anyway my guess is that this will soon be scaled up such that human supervision becomes unwieldy, so an agent should be able to answer these kinds of questions
Gabriele Corso:
“Is anyone developing open-source MCP frameworks for llms to parse and understand biomolecular structures accurately? If so, please reach out to me, i would love to discuss, help and support!”
ML features:
- Native graph representations with k-nearest neighbor graphs, edge type annotations, and node features for GNNs like GearNet and DeepRank-GNN
- Multi-resolution tokenization (AlphaFold 3 uses flexible schemes: standard residues as single tokens, modified residues as atoms, ligands as individual atoms)
- Pre-computed MSA embeddings as feature channels
- Confidence metrics (pLDDT, PAE) as first-class data alongside coordinates
- Versioned model parameters linking predictions to the networks that generated them
? Can this also be used a “paging cache” for the QKV store of transformers etc, msa embeddings that would facilitate modern models’ operation?
atomworks
https://rosettacommons.github.io/atomworks/latest/tutorial/index.html