Rewriting the file “on the fly” is hard due to lack of modularity and other:
- trajectories for a single chain
- adding a few waters etc.
Background | The Unfurling Landscape of Structural Biology
What’s the problem that needs solving?
Data organization at the confluence of CryoET, protein and rna engineering and LLMs’ ability to manipulate types/ontologies.
- gargantuan-scale images at angstrom-resolution
- digital twins
- molecular force fields
- near-perfect polymer folding
- ligand/binding affinity prediction
Interfaces between
- MD
- EM/Crystallography
- atomic/crystallographic data encoding
- sequence
What our part in it might be:
The implicit proposition here is that by having a common substrate (what is it? a format? a framework? a type system? a library? an application?) the friction is reduced.
Does any kind of study benefit from this improved substrate?
Yes, i think compositional and conformational heterogeneity studies would be impossible without a framework under which to track the artifacts. By that, i mean studies of type “motion of molecule X in the presence of Y” or “conformational change of Z in the presence of W”, spliceosome .
- “modularity at biological hierarchy boundaries”
- Who is going to use it?
- Who is going to pay you for it?
- What is the job here that won’t need doing in 5-10 years?
- What is the job that will need doing doing in 5 years but doesn’t exist now?
entity_poly_seq can’t be mandatory since you can produce a mmCIF file without any polymeric molecular entity. You could write a mmCIF file with a single ion in it, no protein, no nucleic acid and it still would be a valid mmCif file while that file can’t have entity_poly_seq because… no polymer ;) I guess once you have a linear polymer in a mmCIF file, entity_poly_seq should be in there, too.__ That can’t be reflected by mmCIF dictionaries since they don’t know conditionals.___
Semi-useful Links
https://github.com/chanzuckerberg/cryoet-data-portal-backend/blob/main/schema/v1.1.0/metadata-docs/TiltSeries.md https://github.com/chanzuckerberg/cryoet-data-portal-backend/tree/main/schema/v1.1.0 https://ngff.openmicroscopy.org/about/index.html https://datascience.codata.org/articles/10.5334/dsj-2016-003
Format Specs
https://www.iucr.org/resources/cif https://datascience.codata.org/articles/10.5334/dsj-2016-003
Formal specs
https://www.iucr.org/resources/cif/cif2 https://www.iucr.org/resources/cif/spec/version1.1 https://legacy.ccp4.ac.uk/newsletters/newsletter37/13_harvest.html
Implementations and random docs
https://www.ccp4.ac.uk/html/cciflib.html http://comcifs.github.io/cif_api/index.html
History of
https://www.ccp4.ac.uk/html/mmcifformat.html
- Implementation of Data Harvesting in the CCP4 Suite : https://legacy.ccp4.ac.uk/newsletters/newsletter37/13_harvest.html
“HARVESTING” :https://www.ccp4.ac.uk/html/harvesting.html
Complaining
format spec is ugly, says biopython developer: https://stackoverflow.com/a/11686524/10697358 People routinely write their own parsers to reconcile these and curse the design choices and lack of documentation. Ex. here are guys from model-angelo ( GNN for model building that chenwei looked at) fixing this on their onw : https://github.com/3dem/model-angelo/issues/51 Here is Deep Mind implementing their own parser https://huggingface.co/spaces/simonduerr/ProteinMPNN/blob/f969e9cfb6f11ba299c7108aadff124e9cf38b1f/alphafold/alphafold/data/mmcif_parsing.py Here the gemmi guy trying to connect his work to chimerax and stumbling over this : https://mail.cgl.ucsf.edu/mailman/archives/list/chimerax-users@cgl.ucsf.edu/thread/XOO3G5MUOHLQBHSVF2NENWYNG3SOUF2O/ https://bioinformatics.stackexchange.com/questions/14210/pdb-residue-numbering
Old tune, but the eminence of AF/Rosetta/etc. recognized with the nobel (not saying it’s right or wrong) contrasted to the quality of the “dataset” they spring from, which is just what the PDB is increasingly is pretty staggering to me. They are stuck in 2000s. I’d bet an eye that if not for the institutional inertia and the OneDeposition workflow, a single company like Meta or HuggingFace or one of amazon’s 100 bio divisions could come up with a better solution in a span of year. By better i mean support for binary, direct integration with source em-maps, sane modern format, biological hierarchy integration and much MUCH richer data points: in terms of landmarks and shapes like we are doing or something more chemically minded a la Wilson/Polikanov or something dynamism-minded a la cryodrgn. It is crazy crazy crazy that people use 5-10mil$ machines from the cutting edge of science, thousands of dollars worth of cloud computer and i dont know how many biologist human hours to arrive at a crappy plaintext file almost entirely untethered to the fabric from which it came.
I think every day about what CZII is going to do the moment they are consistely getting 3-4A subtomograms and need to “deposit” 1000 ribosomes at once for some horizontal study of a mouse brain cancer cell or something. Whatever it is it’s not PDB.
- just shitty documentation, no automatic schema, ad-hoc updates, no explanation for datatypes
https://github.com/google-deepmind/alphafold/issues/252
https://github.com/wwpdb-dictionaries/mmcif_pdbx/issues
Ex. here are guys from model-angelo ( GNN for model building that chenwei looked at) fixing this on their onw : https://github.com/3dem/model-angelo/issues/51
Here is Deep Mind implementing their own parser https://huggingface.co/spaces/simonduerr/ProteinMPNN/blob/f969e9cfb6f11ba299c7108aadff124e9cf38b1f/alphafold/alphafold/data/mmcif_parsing.py
Here the gemmi guy trying to connect his work to chimerax and stumbling over this : https://mail.cgl.ucsf.edu/mailman/archives/list/chimerax-users@cgl.ucsf.edu/thread/XOO3G5MUOHLQBHSVF2NENWYNG3SOUF2O/
https://bioinformatics.stackexchange.com/questions/14210/pdb-residue-numbering
https://proteopedia.org/wiki/index.php/Unusual_sequence_numbering ““”
- 5afi.y is a trna sequence but is tagged with “F” at the end just because it carreis a phenylalanine. How crazy is that. No respect for the standardization of RNA alphabet.
https://docs.rosettacommons.org/docs/latest/development_documentation/tutorials/robust