Labeled dataset of X-ray protein ligand images in 3D point cloud and validated deep learning models

Background & Summary

Ligands are small molecules that bind to proteins, generally modifying their function. These molecules represent the active principles of known medicines (active pharmaceutical ingredient, API), or drug prototypes (e.g. natural products, fragments and other synthetic small molecules) in drug discovery pipelines. Currently, the 3D structure of protein-ligand complexes is mostly obtained by X-ray protein crystallography[1](https://www.nature.com/articles/s41597-025-06002-8#ref-CR1 “Papageorgiou, A. C., Poudel, N. & Mattsson, J. Protein Structure Analysis and Validation with X-Ray Crystallography. in 377–404. https://doi.org/10.1007/978-1-0716-0775-6_25

(2021).“). Experimental X-ray protein crystallographic data are freely available at the data centers of …

Background & Summary

(2021).“). Experimental X-ray protein crystallographic data are freely available at the data centers of the global Protein Data Bank (PDB, https://www.wwpdb.org/)[2](https://www.nature.com/articles/s41597-025-06002-8#ref-CR2 “Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Mol. Biol. 10, 980–980, https://doi.org/10.1038/nsb1203-980

(2003).“),[3](https://www.nature.com/articles/s41597-025-06002-8#ref-CR3 “Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 28, 235–242, https://doi.org/10.1093/nar/28.1.235

(2000).“), a worldwide archive of macromolecular structure data.

The electron density map is the primary result of an X-ray protein crystallography experiment[4](https://www.nature.com/articles/s41597-025-06002-8#ref-CR4 “Wlodawer, A., Minor, W., Dauter, Z. & Jaskolski, M. Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J. 275, 1–21, https://doi.org/10.1111/j.1742-4658.2007.06178.x

(2008).“). It is a continuous function ρ(x,y,z) of intensity values in the real space, being measured in electrons per cubic angstrom (eÅ−3). It represents the electron cloud around each atom of the protein and of its ligands in the 3D space, allowing for deciphering the protein-ligand atomic 3D structure[4](https://www.nature.com/articles/s41597-025-06002-8#ref-CR4 “Wlodawer, A., Minor, W., Dauter, Z. & Jaskolski, M. Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J. 275, 1–21, https://doi.org/10.1111/j.1742-4658.2007.06178.x

(2008).“),[5](https://www.nature.com/articles/s41597-025-06002-8#ref-CR5 “Kleywegt, G. J. & Alwyn Jones, T. Model building and refinement practice. in Methods Enzymol. 208–230, https://doi.org/10.1016/S0076-6879(97)77013-7

(1997).“). The presence of ligands in X-ray protein structures can be detected in the calculated difference electron density map, the Fo-Fc map, which highlights the presence of additional molecules binding to the protein, such as the ligands[4](https://www.nature.com/articles/s41597-025-06002-8#ref-CR4 “Wlodawer, A., Minor, W., Dauter, Z. & Jaskolski, M. Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J. 275, 1–21, https://doi.org/10.1111/j.1742-4658.2007.06178.x

(2008).“),[6](#ref-CR6 “Aguda, A. H. et al. Affinity Crystallography: A New Approach to Extracting High-Affinity Enzyme Inhibitors from Natural Extracts. J. Nat. Prod. 79, 1962–1970, https://doi.org/10.1021/acs.jnatprod.6b00215

(2016).“),[7](#ref-CR7 “Mooij, W. T. M. et al. Automated Protein–Ligand Crystallography for Structure-Based Drug Design. ChemMedChem 1, 827–838, https://doi.org/10.1002/cmdc.200600074

(2006).“),[8](#ref-CR8 “Shumilin, I. A. et al. Identification of Unknown Protein Function Using Metabolite Cocktail Screening. Structure 20, 1715–1725, https://doi.org/10.1016/j.str.2012.07.016

(2012).“),[9](https://www.nature.com/articles/s41597-025-06002-8#ref-CR9 “Meneghello, R. et al. High-throughput protein crystallography to empower natural product-based drug discovery. Acta Crystallogr. Sect. F Struct. Biol. Commun. 81, https://doi.org/10.1107/S2053230X25001542

(2025).“). The 3D image of a ligand is observed in high intensity regions of the Fo-Fc map, being named blob or density cluster. A ligand blob is displayed by applying a contour to the Fo-Fc, usually using the sigma (σ) scale, which filters the points above a cutoff value (e.g. 3σ), highlighting the ligand structural features[4](https://www.nature.com/articles/s41597-025-06002-8#ref-CR4 “Wlodawer, A., Minor, W., Dauter, Z. & Jaskolski, M. Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J. 275, 1–21, https://doi.org/10.1111/j.1742-4658.2007.06178.x

(2016).“),[7](#ref-CR7 “Mooij, W. T. M. et al. Automated Protein–Ligand Crystallography for Structure-Based Drug Design. ChemMedChem 1, 827–838, https://doi.org/10.1002/cmdc.200600074

(2006).“),[8](#ref-CR8 “Shumilin, I. A. et al. Identification of Unknown Protein Function Using Metabolite Cocktail Screening. Structure 20, 1715–1725, https://doi.org/10.1016/j.str.2012.07.016

(2025).“). The interpretation of the chemical structure of a ligand in the Fo-Fc map is a central task for understanding the functionality of the protein and guiding structure-based drug design (SBDD) in modern drug discovery pipelines.

Existing solutions for known ligand building are based on up to 200 known and common molecules from PDB. These solutions use mathematical and topological descriptors of Fo-Fc maps and suggest a list of molecules that best explain and fit into a blob[10](#ref-CR10 “Carolan, C. G. & Lamzin, V. S. Automated identification of crystallographic ligands using sparse-density representations. Acta Crystallogr. Sect. D Biol. Crystallogr. 70, 1844–1853, https://doi.org/10.1107/S1399004714008578

(2014).“),[11](#ref-CR11 “Beshnova, D. A., Pereira, J. & Lamzin, V. S. Estimation of the protein–ligand interaction energy for model building and validation. Acta Crystallogr. Sect. D Struct. Biol. 73, 195–202, https://doi.org/10.1107/S2059798317003400

(2017).“),[12](#ref-CR12 “Kowiel, M. et al. Automatic recognition of ligands in electron density by machine learning. Bioinformatics 35, 452–461, https://doi.org/10.1093/bioinformatics/bty626

(2019).“),[13](#ref-CR13 “Terwilliger, T. C., Adams, P. D., Moriarty, N. W. & Cohn, J. D. Ligand identification using electron-density map correlations. Acta Crystallogr. Sect. D Biol. Crystallogr. 63, 101–107, https://doi.org/10.1107/S0907444906046233

(2007).“),[14](https://www.nature.com/articles/s41597-025-06002-8#ref-CR14 “Zwart, P. H., Langer, G. G. & Lamzin, V. S. Modelling bound ligands in protein crystal structures. Acta Crystallogr. Sect. D Biol. Crystallogr. 60, 2230–2239, https://doi.org/10.1107/S0907444904012995

(2004).“). While identifying known ligands with such approaches, the accuracies range from 32% to 72.5% for the best prediction[12](https://www.nature.com/articles/s41597-025-06002-8#ref-CR12 “Kowiel, M. et al. Automatic recognition of ligands in electron density by machine learning. Bioinformatics 35, 452–461, https://doi.org/10.1093/bioinformatics/bty626

(2019).“). This indicates that the ligand building problem still lacks accurate solutions, even for known ligand building, and that there is potential for improvement. A very recent development was reported by Karolczak and coworkers[15](https://www.nature.com/articles/s41597-025-06002-8#ref-CR15 “Karolczak, J. et al. Ligand Identification in CryoEM and X-ray Maps Using Deep Learning. bioRxiv https://doi.org/10.1101/2024.08.27.610022

(2024).“) using deep learning and point clouds, with average accuracies reaching 67.2% and 93.6% in the top-10 known ligand suggestions. However, their approach still relies on the whole known protein ligand structures as the training and searching sets. When the ligand is unknown, as the case of novel natural products[6](https://www.nature.com/articles/s41597-025-06002-8#ref-CR6 “Aguda, A. H. et al. Affinity Crystallography: A New Approach to Extracting High-Affinity Enzyme Inhibitors from Natural Extracts. J. Nat. Prod. 79, 1962–1970, https://doi.org/10.1021/acs.jnatprod.6b00215

(2016).“),[9](https://www.nature.com/articles/s41597-025-06002-8#ref-CR9 “Meneghello, R. et al. High-throughput protein crystallography to empower natural product-based drug discovery. Acta Crystallogr. Sect. F Struct. Biol. Commun. 81, https://doi.org/10.1107/S2053230X25001542

(2025).“),[16](https://www.nature.com/articles/s41597-025-06002-8#ref-CR16 “Bazzano, C. F. et al. NP 3 MS Workflow: An Open-Source Software System to Empower Natural Product-Based Drug Discovery Using Untargeted Metabolomics. Anal. Chem. 96, 7460–7469, https://doi.org/10.1021/acs.analchem.3c05829

(2024).“) or other molecules with yet unknown chemical structures in X-ray protein databases, the current automated solutions are not accurate and cannot provide a reliable support for the crystallographer’s interpretation of the ligand blob and of the chemical structure of this new ligand.

PDB[2](https://www.nature.com/articles/s41597-025-06002-8#ref-CR2 “Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Mol. Biol. 10, 980–980, https://doi.org/10.1038/nsb1203-980

(2003).“),[3](https://www.nature.com/articles/s41597-025-06002-8#ref-CR3 “Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 28, 235–242, https://doi.org/10.1093/nar/28.1.235

(2000).“) is a leading global resource for experimental data which is growing tremendously fast, with around 10 thousand deposits per year[17](https://www.nature.com/articles/s41597-025-06002-8#ref-CR17 “Vollmar, M. & Evans, G. Machine learning applications in macromolecular X-ray crystallography. Crystallogr. Rev. 27, 54–101, https://doi.org/10.1080/0889311X.2021.1982914

(2021).“). However, mining PDB data is difficult, mainly due to human errors in ligand interpretation and local low-quality blobs[5](https://www.nature.com/articles/s41597-025-06002-8#ref-CR5 “Kleywegt, G. J. & Alwyn Jones, T. Model building and refinement practice. in Methods Enzymol. 208–230, https://doi.org/10.1016/S0076-6879(97)77013-7

(1997).“),[11](https://www.nature.com/articles/s41597-025-06002-8#ref-CR11 “Beshnova, D. A., Pereira, J. & Lamzin, V. S. Estimation of the protein–ligand interaction energy for model building and validation. Acta Crystallogr. Sect. D Struct. Biol. 73, 195–202, https://doi.org/10.1107/S2059798317003400

(2017).“),[18](https://www.nature.com/articles/s41597-025-06002-8#ref-CR18 “Pozharski, E., Weichenberger, C. X. & Rupp, B. Techniques, tools and best practices for ligand electron-density analysis and results from their application to deposited crystal structures. Acta Crystallogr. Sect. D Biol. Crystallogr. 69, 150–167, https://doi.org/10.1107/S0907444912044423

(2013).“),[19](https://www.nature.com/articles/s41597-025-06002-8#ref-CR19 “Dauter, Z., Wlodawer, A., Minor, W., Jaskolski, M. & Rupp, B. Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining. IUCrJ 1, 179–193, https://doi.org/10.1107/S2052252514005442

(2014).“). In addition, retrieving and manipulating ligand data in 3D grid-like format requests specific knowledge and crystallographic packages capable of reading crystallographic data. This may explain the lack of deep learning (DL)[17](https://www.nature.com/articles/s41597-025-06002-8#ref-CR17 “Vollmar, M. & Evans, G. Machine learning applications in macromolecular X-ray crystallography. Crystallogr. Rev. 27, 54–101, https://doi.org/10.1080/0889311X.2021.1982914

(2021).“),[20](https://www.nature.com/articles/s41597-025-06002-8#ref-CR20 “LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444, https://doi.org/10.1038/nature14539

(2015).“) approaches for ligand prediction from their blobs. DL with 3D point cloud have been showing remarkable results in other fields[21](#ref-CR21 “Choy, C., Gwak, J. & Savarese, S. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3070–3079, https://doi.org/10.1109/CVPR.2019.00319

(IEEE, 2019).“),[22](#ref-CR22 “Guo, Y. et al. Deep Learning for 3D Point Clouds: A Survey. arXiv https://doi.org/10.48550/arXiv.1912.12033

(2019).“),[23](#ref-CR23 “Singh, R. D., Mittal, A. & Bhatia, R. K. 3D convolutional neural network for object recognition: a review. Multimed. Tools Appl. 78, 15951–15995, https://doi.org/10.1007/s11042-018-6912-6

(2019).“),[24](https://www.nature.com/articles/s41597-025-06002-8#ref-CR24 “Ahmed, E. et al. A survey on Deep Learning Advances on Different 3D Data Representations. arXiv https://doi.org/10.48550/arXiv.1808.01462

(2018).“), and has started to be used for ligand interpretation in X-ray protein crystallography[15](https://www.nature.com/articles/s41597-025-06002-8#ref-CR15 “Karolczak, J. et al. Ligand Identification in CryoEM and X-ray Maps Using Deep Learning. bioRxiv https://doi.org/10.1101/2024.08.27.610022

(2024).“). However, no chemical labeling of the ligand blobs is available nor has been validated (i.e., is capable of being learned by a supervised machine learning – ML – model) to reconstitute novel protein ligand chemical structures.

To fill these gaps, we have created and validated the first chemically labeled dataset of experimental 3D images of protein ligands in 3D point clouds representations, named LigPCDS, with 244,226 ligand entries from PDB. The workflow for obtaining LigPCDS and its validation through successfully trained DL models is presented in Fig. 1.

Fig. 1

Workflow used to obtain LigPCDS, the deep learning models training and the validated labeling approaches. (a) LigPCDS creation schema. In step 1, a list of PDB entries, with resolutions ranging from 1.5 to 2.2 Å, was retrieved from RCSB (.pdb and.mtz) and their free and organic ligands were downloaded, filtered and validated (.sdf). It resulted in the list of valid ligands with 244,226 entries. In step 2, Dimple v2.6.1 was used to refine the PDB entries and calculate their Fo-Fc maps. Next, for each ligand, a grid sizing was defined to cover its entire blob. Each ligand’s grid was interpolated from its Fo-Fc map to a 3D point cloud and processed to create the final 3D representations of the ligands. In step 3, vocabularies of chemical classes were created and used for labeling the structure of the valid ligands atom-wise. They were based on the chemical atoms themselves and on cyclic substructures of the ligands. Finally, in step 4 the labels of the structure of the ligands were extrapolated pointwise, using an atomic sphere model, for labeling the final 3D representations of the ligands, resulting in LigPCDS. (b) General schema used to train and obtain the validated DL models. A stratified training dataset was created from LigPCDS with n = 78,902 ligand entries, separated in k = 13 similar groups (step 5). The LigPCDS entries of this dataset were used to train DL models in semantic segmentation tasks using the Minkowski Engine[47](https://www.nature.com/articles/s41597-025-06002-8#ref-CR47 “Choy, C., Gwak, J. & Savarese, S. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. arXiv https://doi.org/10.48550/arXiv.1904.08755

(2019).“) architecture and networks based on the 3D U-Net52. Cycles of training, evaluation and changes continued until good performance DL models were obtained and validated (step 6). (c) Four of the proposed labeling approaches were validated and are illustrated with ligand FUL from PDB (entry 4Z4T). The average performance in the cross-validation of the best DL model trained with each vocabulary is presented by the mIoU and the mF1 metrics, with corresponding SEM and confidence interval (CI). k = 1 was used in the tests except for the model trained with the vocabulary of “Generic Atoms and Cycles C347CA56”, which used the average k-fold value and k = 13. Image “Machine Learning” is by Srinivas Agra and image “intelligence” is by Gacem Tachfin from the Noun Project (CCBY3.0).

For LigPCDS construction, a list of valid ligands from the Research Collaboratory for Structural Bioinformatics Protein Data Bank[3](https://www.nature.com/articles/s41597-025-06002-8#ref-CR3 “Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 28, 235–242, https://doi.org/10.1093/nar/28.1.235

(2000).“) (RCSB PDB, the US data center at https://www.rcsb.org/) was filtered and downloaded with experimental data. The entries were refined with Dimple v2.6.1 (https://ccp4.github.io/dimple/)[25](https://www.nature.com/articles/s41597-025-06002-8#ref-CR25 “Winn, M. D. et al. Overview of the CCP4 suite and current developments. Acta Crystallogr. Sect. D Biol. Crystallogr. 67, 235–242, https://doi.org/10.1107/S0907444910045749

(2011).“) in a standardized procedure, without any added ligand (no heteroatoms), intended to normalize data quality and evidence the ligand blob in the Fo-Fc maps. The 3D image of the ligands were derived from their Fo-Fc maps with Gemmi[26](https://www.nature.com/articles/s41597-025-06002-8#ref-CR26 “Wojdyr, M. GEMMI: A library for structural biology. J. Open Source Softw. 7, 4200, https://doi.org/10.21105/joss.04200

(2022).“) v0.5.8, based on the atomic positions of the ligand entries. Gemmi v0.5.8 was further used to create their representations in 3D point clouds, with an adequate scale, background removal, mask and contours. Finally, ligand 3D point clouds were labeled pointwise using an atomic sphere modelling, and designed chemical vocabularies. Different labeling approaches were proposed as vocabularies based on the atoms themselves and their cyclic structural arrangements, representing building blocks to construct the entire ligand chemical structure (Fig. 1a).

For validation of the labeling approach, a stratified training dataset (n = 78,902) from LigPCDS was used to train DL models for the semantic segmentation of the ligand’s 3D representation (Fig. 1b). Four vocabularies led to good performance DL models (Fig. 1c): (i) the “Vocabulary of the Ligand Region”, composed by generic atoms of any type; (ii) the “Vocabulary of Generic Atoms and Cycles”, composed by generic atoms outside cyclic arrangements and generic atoms into cyclic structures (called here cycles); (iii) the “Vocabulary of Generic Atoms and Cycles C347CA56”, composed by generic atoms outside cyclic arrangements, generic atoms in non-aromatic cyclic structures of size 3 to 7 and in aromatic cyclic structures of sizes 5 and 6; and (iv) the “Vocabulary of Atom Symbols with Groups”, composed by the atom symbols with groupings. All vocabularies also contain the background class (regions in the images with no ligand atom), which is an important category to separate the background noise from the ligand itself. The mean accuracy of the validated models in their cross-validation, ranged from 49.7% (SEM = 0.4, CI = [−19.4, 20.2]) to 77.4% (SEM = 0.2, CI = [−11.7, 12.1]) in terms of the Intersection over Union (mIoU) metric27; and from 62.4% (SEM = 0.4, CI = [−18.8, 19.7]) to 87.0% (SEM = 0.2, CI = [−8.4, 8.8]) in F1-score (mF1)[28](https://www.nature.com/articles/s41597-025-06002-8#ref-CR28 “Dice, L. R. Measures of the Amount of Ecologic Association Between Species. Ecology 26, 297–302, https://doi.org/10.2307/1932409

(1945).“). The accuracy of the validated models reinforces the reliability of the methods used to construct LigPCDS and suggests its future use by other machine learning tasks.

The robustness, size and labeling approaches of LigPCDS, together with the validated DL models, expands the possibility of interpreting unknown protein ligands, and further opens avenues for other DL applications based on protein ligands (e.g. in basic biology, natural product and drug discovery). As a first application using the validated DL models from LigPCDS, we have developed the NP³ Blob Label (https://github.com/danielatrivella/np3_ligand/tree/master/np3_blob_label), an open source application designed to assist unknown ligand building in high performance drug discovery pipelines, including those focused on novel natural products (to be published). LigPCDS may also be used to address the problem of known ligand building, by using the ligands codes (unique structures) as labels for training DL classification tasks.

Methods

The LigPCDS dataset creation followed six major steps (Fig. 1), which are summarized below and explained in detail in the next subsections.

Creation of a list of valid ligands from RCSB PDB.

Creation of the representations of the ligand 3D image in 3D point clouds.

Creation of chemical vocabularies and ligand structure labeling.

Labeling ligand 3D point clouds.

Creation of a stratified training dataset from LigPCDS.

Training, optimization and validation of DL models.

The validation steps (steps 5 and 6 in Fig. 1b,c) of LigPCDS methodology are presented in the Technical Validation section.

Hardware

The hardware used to execute the LigPCDS creation and the DL models training is a computer with the following configuration: AMD Ryzen 9 3950X CPU, 16 cores and 32 threads, 128 Gb RAM and 2x GeForce RTX 2080 SUPER GPUs with 8 Gb of dedicated RAM each (hardware A). Exceptions were for specific DL models analyses that are point out in the text and used hardware B, a cluster with the following configuration: AMD EPYC 7742 CPU with 64 cores and 80 threads, 384 Gb RAM and 4 GPUs NVIDIA HGX A100 with 40 Gb of dedicated RAM each.

List of valid ligands

To obtain a list of ligands (step 1, Fig. 1), the advanced search tool of the RCSB PDB (https://www.rcsb.org/) was initially used to retrieve all entries with resolution between 1.5 Å and 2.2 Å, in December 2019. The chosen resolution range aligns with the most frequent resolution values found in the PDB (Supplementary Figure 1) and those typically obtained in structural biology and drug discovery projects. Additional selections to the retrieved RCSB PDB files were: the presence of free ligands (non-covalent), availability of experimental data (entries with electron density maps also deposited), data originated from X-ray experiments with proteins, and deposited at PDB after January 2008 (more stringent validation metrics in PDB). For the free ligands, we have selected organic molecules formed by atoms of carbon, oxygen, nitrogen, phosphor, sulfur, iodine, fluorine, chlorine, bromine or selenium; hydrogen atoms were omitted here due to their poor detection by X-ray crystallography at the chosen resolution range. At this stage, this resolution range would reduce data variations caused by large differences in resolution for LigPCDS construction, while keeping ligand information that is still difficult to predict. Other ranges were not tested so far, and may be used in the future.

A total of 39,353 PDB entries were selected using the above criteria, containing 13,189 unique ligand codes (unique ligand structure). The.pdb and.mtz files of these RCSB PDB entries were downloaded automatically. The coordinate lines representing the ligands present in the protein chains of these PDB entries were isolated from the retrieved files and saved into individual.pdb files. This procedure resulted in a total of 293,822 available ligand entries from 39,169 PDB entries, containing 13,074 unique ligand codes.

The Structure Data Format (SDF) file of each ligand entry was also downloaded from RCSB PDB. An SDF file is a chemical file format for molecular data based on the MOL-file format - which can store single or multiple molecules, describing all their atoms in 3D coordinates. Each ligand’s SDF file was used to build and validate the ligand representative molecular graph (chemical structures). The free ligand entries with validated SDF files were used to propose chemical vocabularies for labeling the structure of protein ligands using a building block-like approach. This structure validation resulted in a total of 259,606 ligand entries from 39,052 PDB entries, containing 12,972 unique ligand codes.

To validate the experimental data of each PDB entry, a standardized procedure was proposed to refine the datasets downloaded from RCSB PDB (.mtz and.pdb files), without the ligand atomic entries, aiming to improve the blob imaging and to remove any failed PDB entry (described in the next subsection). In addition, the ligand entries with validated SDF files were also used to extract the ligand’s 3D representations from their correctly refined Fo-Fc maps (described in the next subsections). The ligand entries that raised an error in any step were removed from the list of valid ligands.

The final list of valid ligands contains 244,226 entries of ligands from 36,202 PDB deposits. These ligands represent non-covalent protein ligands composed by C, O, N, P, S, Se, F, Cl, Br and/or I atoms, where 12,239 are unique ligand codes (unique structures) with frequencies ranging from 1 to 33,063 occurrences (20 ± 526). Single atoms or ions (e.g. Cl-) correspond to 8.6% of the ligand entries (n = 21,003), while the other 91.4% are valid molecular structures (n = 223,223). The median size of valid ligands is 6 atoms and the mean size is 11 non-hydrogen atoms, with sizes ranging from 1 to 140 non-hydrogen atoms. These statistics indicate a great imbalance problem in the list of valid ligands, which is related to the diversity of non-covalent ligands deposited in PDB. They also highlight the diversity of potential protein ligands with importance in biology and drug discovery. Many of such ligands are still to be discovered and will have to be interpreted in the future, as novel X-ray protein structures in complex with ligands are obtained.

The RCSB PDB downloads were automated with Python v3.8 scripts, and the ligand entries validation used the functionalities of the RDKit package v2019.09.3 (https://www.rdkit.org). 16.9% of the ligand entries and 8% of the PDB entries were excluded during validation, 11.6% of the ligand entries due to invalid SDF files (minor download errors are also included), 4.0% due to refinement errors and 1.3% due to errors in the creation and labeling of the ligand’s 3D representation. This indicates poor quality of part of the ligand entries, further highlighting the difficulties for directly applying data mining techniques on PDB data[19](https://www.nature.com/articles/s41597-025-06002-8#ref-CR19 “Dauter, Z., Wlodawer, A., Minor, W., Jaskolski, M. & Rupp, B. Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining. IUCrJ 1, 179–193, https://doi.org/10.1107/S2052252514005442

(2014).“).

Ligand 3D representation in point cloud

Next in LigPCDS creation, the 3D representations of the ligands present in the list of valid ligands were designed and created. Considering the variability and flexibility in the size and conformation of ligands, the ease and speed of manipulating point clouds[29](https://www.nature.com/articles/s41597-025-06002-8#ref-CR29 “Zhou, Q.-Y., Park, J. & Koltun, V. Open3D: A Modern Library for 3D Data Processing. arXiv https://doi.org/10.48550/arXiv.1801.09847

(2018).“), and the availability of many good performance deep learning architectures for 3D point clouds[30](https://www.nature.com/articles/s41597-025-06002-8#ref-CR30 “Guo, Y. et al. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4338–4364, https://doi.org/10.1109/TPAMI.2020.3005434

(2021).“), we have chosen point clouds as the format to represent the 3D images of ligands in LigPCDS.

The point clouds were initially extracted from the Fo-Fc maps using a ligand grid. For this, a 3D grid box was drawn around the ligand and the electron density intensity values in each x,y,z coordinate of the grid was computed and stored in the color channels of the point cloud. Then, contours and scales were applied to extract the 3D representations of the ligand images, without background and noise. Nine types of 3D representations (at different contours and scales) were generated to each ligand and are available at LigPCDS. The representation type to be used in a given application will depend on the desired application of the user, in a case-by-case basis. For our deep learning model of ligand chemical structure prediction, the qRankMask_5 representation showed the best results.

The detailed schema used in LigPCDS for creating the 3D representations of ligands in 3D point cloud format (step 2, Fig. 1) is shown in Fig. 2. A step-by-step explanation of this process is given below.

Fig. 2

Schema for creating the labeled representations of ligands in 3D point cloud format for LigPCDS. The ligand FUL of PDB (entry 4Z4T) was used to exemplify the creation of the ligand’s 3D point cloud starting from the grid up to the final 3D representations. (1) The ligand’s grid representation is sized and interpolated from its Fo-Fc map in all its x,y,z positions, using the Gemmi package[26](https://www.nature.com/articles/s41597-025-06002-8#ref-CR26 “Wojdyr, M. GEMMI: A library for structural biology. J. Open Source Softw. 7, 4200, https://doi.org/10.21105/joss.04200

(2022).“). The ligand’s grid is stored in point cloud format (.xyzrgb) with the density value of each point saved in its RGB channels (feature as colors). (2) The density values of the ligand’s grid 3D point cloud are transformed and normalized using the quantile rank scale[33](https://www.nature.com/articles/s41597-025-06002-8#ref-CR33 “Urzhumtsev, A., Afonine, P. V., Lunin, V. Y., Terwilliger, T. C. & Adams, P. D. Metrics for comparison of crystallographic maps. Acta Crystallogr. Sect. D Biol. Crystallogr. 70, 2593–2606, https://doi.org/10.1107/S1399004714016289

(2014).“). (3) The points of the ligand’s grid within a contour of 0.95 (value > 0.95) are selected and only the points near the ligand’s atomic positions and closely connected (with a distance between points smaller than grid space * 1.42 + 0.15) are retained, the rest is removed as noise. This creates the fine ligand blob representation. (4) The ligand’s mask point cloud is created from this fine ligand blob by applying a 1.1 Å radius expansion from its borders and is named “qRankMask”. (5) The final representations of the ligands are created by applying different contours in the ligand’s mask representation and extracting the selected 3D point cloud. The final representations are named as “qRank” followed by the contour value, e.g. “qRank0.95”. Additionally, a representation equal to the ligand’s mask and with all values below 0.5 set to zero is created and named “qRankMask_5”. This schema corresponds to the procedures used to complete step 2 in the LigPCDS creation workflow (Fig. 1a). (6) Finally, the labels of the ligand’s structure are used for pointwise labeling the final 3D representations of the ligands, which corresponds to step 4 of the LigPCDS workflow (Fig. 1a).

Refinement of the Fo-Fc maps (experimental data preparation)

Before extracting the 3D representations of the ligand’s blob in 3D point clouds, each PDB entry in the list of valid ligands were first refined using the Dimple software v2.6.1 (https://ccp4.github.io/dimple/), a macromolecular crystallographic pipeline for refinement incorporated into the CCP4 program suite[25](https://www.nature.com/articles/s41597-025-06002-8#ref-CR25 “Winn, M. D. et al. Overview of the CCP4 suite and current developments. Acta Crystallogr. Sect. D Biol. Crystallogr. 67, 235–242, https://doi.org/10.1107/S0907444910045749

(2011).“). A standardized Dimple refinement was performed for each PDB entry using their respective downloaded.mtz and.pdb files, with the option of removing heteroatoms (it removes all ligands from the.pdb file) and with two refinement cycles (longer refinement). The other parameters of Dimple received their default values. Dimple refinement was carried out with two primary objectives: first, to highlight the presence of any ligand blob in the crystal structure. With the “remove heteroatom” parameter active, the unmodeled electron density related to the ligands (high values in the Fo-Fc maps) could be revealed, and any bias related to incorrect ligand structure modelling on the PDB deposit would be removed. Second, to improve the overall Fo-Fc map and the local quality of the ligand blob, further normalizing the model refinement standards for the different crystal structures present in the list of valid ligands. The PDB entries that presented errors in the refinement were excluded. The list of valid ligands at this point contained 36,325 PDB entries successfully refined, with 247,878 ligand entries listed, from which 12,250 were unique ligands.

Extraction of the ligand grid representation in 3D point cloud (procedure 1, Fig. 2)

A ligand grid was then created to extract the 3D image of each ligand blob (found in the refined Fo-Fc map) into the 3D point cloud format. The ligand grid is a bounding box defined on the boundary of the ligand’s atomic positions, plus a gap, designed to cover the complete shape of the ligand blob. This procedure used the original SDF coordinates of the ligand to locate the center of its molecular structure in the refined Fo-Fc map, and to retrieve the ligand’s atomic 3D coordinates, thus computing the bounding box on the boundary of its atomic positions. Through experimental inspection, this box was expanded with an additional gap equal to 4.2 Å in its boundaries (equal to the diameter of the largest theoretical radius[31](https://www.nature.com/articles/s41597-025-06002-8#ref-CR31 “Batsanov, S. S. Van der Waals Radii of Elements. Inorg. Mater. 37, 871–885, https://doi.org/10.1023/A:1011625728803

(2001).“) - Supplementary Table 1), and then, a second 120% expansion of its size was performed. The obtained dimensions defined the size of the ligand grid in the Fo-Fc map, centered on the ligand boundary box.

The Gemmi package[26](https://www.nature.com/articles/s41597-025-06002-8#ref-CR26 “Wojdyr, M. GEMMI: A library for structural biology. J. Open Source Softw. 7, 4200, https://doi.org/10.21105/joss.04200

(2022).“) v0.5.8 was then used to interpolate the values of the Fo-Fc map for all x,y,z positions of the ligand grid. The obtained 3D grid was stored in a point cloud format, named the ligand grid representation. The difference electron density value of each point was chosen as the feature for the ligand 3D representation. The interpolated density value of each point (feature) was stored in the color channels of the 3D point clouds of the ligand grid representation. A spacing equal to 0.5 Å for the points of the ligand grid was tested and chosen. This value is smaller than the distance of a chemical bond (a sigma C-C bond measures around 1.54 Å) and allows to retain more details in the final 3D representations.

The Gemmi v0.5.8 Python package[26](https://www.nature.com/articles/s41597-025-06002-8#ref-CR26 “Wojdyr, M. GEMMI: A library for structural biology. J. Open Source Softw. 7, 4200, https://doi.org/10.21105/joss.04200

(2022).“) for structural biology provides a framework of functions to manipulate electron density maps in indexable 3D grids, behaving like standard numerical vectors. Gemmi v0.5.8 allows extracting 3D grids from specific regions of an electron density map with different spacing between the points. It uses an implementation of the trilinear interpolation of the 8 closest points[32](https://www.nature.com/articles/s41597-025-06002-8#ref-CR32 “Afonine, P. V. et al. Real-space refinement in PHENIX for cryo-EM and crystallography. Acta Crystallogr. Sect. D Struct. Biol. 74, 531–544, https://doi.org/10.1107/S2059798318006551

(2018).“) of a given position of a map to compute its electron density value.

Transformation and scale of the ligand grid representation (procedure 2, Fig. 2)

The quantile rank scale was then used to transform and scale the ligand grid to allow for their correct comparison. This is an equivalent approach to histogram equalization[33](https://www.nature.com/articles/s41597-025-06002-8#ref-CR33 “Urzhumtsev, A., Afonine, P. V., Lunin, V. Y., Terwilliger, T. C. & Adams, P. D. Metrics for comparison of crystallographic maps. Acta Crystallogr. Sect. D Biol. Crystallogr. 70, 2593–2606, https://doi.org/10.1107/S1399004714016289

(2014).“),[34](https://www.nature.com/articles/s41597-025-06002-8#ref-CR34 “Hawkes, P. W. Digital image processing. Nature 285, 174–175, https://doi.org/10.1038/285174b0

(1980).“) in image processing. This scale normalizes the values in the range from 0 to 1. The quantile rank scale is used in other crystallography applications[33](https://www.nature.com/articles/s41597-025-06002-8#ref-CR33 “Urzhumtsev, A., Afonine, P. V., Lunin, V. Y., Terwilliger, T. C. & Adams, P. D. Metrics for comparison of crystallographic maps. Acta Crystallogr. Sect. D Biol. Crystallogr. 70, 2593–2606, https://doi.org/10.1107/S1399004714016289

(2014).“), and replaces the density value ρ(x,y,z) of each point by its position in the quantile distribution of the points for the region being considered. This scale does not change the shape of the electron density, all points that have the same ρ density values have the same value in this function. Furthermore, unlike the sigma scale, which must be applied globally across the entire electron density map, the quantile rank scale can be applied locally within a box to compare the same region. The sigma and quantile rank scales are comparable, with 1σ, 2σ or 3σ contours corresponding to quantile positions that vary approximately between 0.85, 0.95 and 0.98[33](https://www.nature.com/articles/s41597-025-06002-8#ref-CR33 “Urzhumtsev, A., Afonine, P. V., Lunin, V. Y., Terwilliger, T. C. & Adams, P. D. Metrics for comparison of crystallographic maps. Acta Crystallogr. Sect. D Biol. Crystallogr. 70, 2593–2606, https://doi.org/10.1107/S1399004714016289

(2014).“). The use of the quantile rank scale allows to speed up calculations for data extraction, improves comparison, and excludes noise from the electron density map of distant regions, since the resolution of X-ray protein crystallographic data varies locally[35](https://www.nature.com/articles/s41597-025-06002-8#ref-CR35 “Lamb, A. L., Kappock, T. J. & Silvaggi, N. R. You are lost without a map: Navigating the sea of protein structures. Biochim. Biophys. Acta - Proteins Proteomics 1854, 258–268, https://doi.org/10.1016/j.bbapap.2014.12.021

(2015).“).

A fast implementation of the quantile rank scale function was created for this project: first it sorts the density values inside the ligand grid representation and then replaces the value of each point by its position in the ranked quantile distribution of the 3D-grid. Ties receive the first occurring position to the left. The scaled ligand grid representation for 247,424 ligand entries, 12,245 being unique ligands, were successfully created at this step.

Extraction of the fine ligand blob 3D representation (procedure 3, Fig. 2)

The next step consisted in removing noise from the scaled ligand grid. For this, the scaled ligand grid representation was filtered to retrieve only the points within a contour of 0.95 (value > 0.95). Then, only the points near the ligand atomic positions and closely connected (with a distance between points smaller than the grid space × 1.42 + 0.15) were retained. By applying a neighborhood searching approach it was possible to remove the noisy points filtered from the ligand grid representation at 0.95 contour; in other words, the points that were not closely connected to the ligand atomic positions were removed here. This created the fine ligand blob 3D representation with a strong signal level and without noise. Python’s Open3D package[29](https://www.nature.com/articles/s41597-025-06002-8#ref-CR29 “Zhou, Q.-Y., Park, J. & Koltun, V. Open3D: A Modern Library for 3D Data Processing. arXiv https://doi.org/10.48550/arXiv.1801.09847

(2018).“) v0.12 functionality was used to create the 3D point cloud of the ligand grid, mask and final representations (described in the next section). This package has an implementation of KDTrees using the FLANN library36 for quick access of the closest neighborhood of the point clouds. This allowed searching with good performance.

Creation of the ligand mask representation (procedure 4, Fig. 2)

The fine ligand blob 3D representation at 0.95 contour was then used as a reference for the blob location and shape. This 3D representation was expanded from its boundary points with a radius equal to 1.1 Å in the scaled ligand grid. The resulting 3D point cloud was stored as the final ligand mask representation and was named qRankMask. By doing this expansion on the “fine ligand blob 3D representation”, instead directly on the scaled ligand grid representation at 0.95 contour (no filters), we could prevent distant noisy points from being included in the qRankMask and further in the final representations of the ligands.

Creation of the final representations of the ligands in 3D point cloud (procedure 5, Fig. 2)

Finally, the 3D representations of the list of valid ligands in 3D point cloud were created. Nine types of 3D representations were generated per ligand entry by exploring different contour levels. All of them compose LigPCDS. The representation types were named: qRank0.5, qRank0.7, qRank0.75, qRank0.8, qRank0.85, qRank0.9, qRank0.95, qRankMask, and qRankMask_5. These fine sliced 3D point clouds were obtained by applying, to the ligand mask representation (qRankMask), contours at 0.5, 0.7, 0.75, 0.8, 0.85, 0.9 and 0.95 on the quantile rank scale. The different contours used are related to the representation name suffix. These point clouds have as a single feature the scaled density value of the qRankMask normalized again from 0 to 1, where each contour value is the new 0 in the final representation. For qRankMask_5 a different approach was used, aiming to join types qRank0.5 and qRankMask which gave better results in the models training: values below 0.5 were set to 0 in the qRankMask, and all the normalized values of contour 0.5 were directly used as feature. In other words, week points (below 0.5) were clipped.

The ligand mask representations (qRankMask and qRankMask_5) and the representations with a quantile rank contour ≤ 0.8 (qRank0.5, qRank0.7, qRank0.75, qRank0.8) gave better results when training the validated deep learning models, with a very small difference between their accuracies. The representation qRankMask_5 was chosen as the best result for the validated segmentation models; it maintains the ligand mask shape with good accuracy. Depending on the usage goals of this dataset, different representation types may give the best results.

A total of 244,283 ligand entries, 12,239 being unique ligands, had their final 3D representations successfully created. The first and fourth columns of Fig. 3 show the final 3D point clouds of two different ligands in four different representation types and the ligand mask. This figure illustrates the impact of the contour value on the final 3D point cloud of the ligands.

Fig. 3

Example of a ligand’s 3D point cloud labeling for five different representation types. Two ligands are used for illustration: 4ZV (PDB entry 5cc6, resolution 2.1 Å) and FUL (PDB entry 4z4t, resolution 1.8 Å). Their blobs from their Fo-Fc maps are shown in the top of the panel with a contour of 3σ (image created with Coot). The LigPCDS visualization script was used to draw the ligands’ 3D point clouds. For ligand FUL, it is possible to see the pattern of a ring in the qRank0.95 representation; it results from the cyclic substructure of size six, present in its structure. In ligand 4ZV this pattern is not clear, possibly due to the mobility of this molecule - which is indicated by the presence of noise around its image (blob) and its representations (bottom left and top right of the ligand region – black points labeled as background). Furthermore, the qRank0.95 representation of ligand 4ZV is partially fragmented, with missing points, while for ligand FUL all points with labels are completely covered. There is more visual correspondence between the ligand’s image in the 3σ Fo-Fc maps and the qRank0.95 point cloud.

The mean time to create the ligand grid representation in 3D point cloud was 0.33 seconds per ligand. The mean time to create all representation types was 0.39 seconds per ligand (mean time for a spacing of the points equal to 0.5 Å). Other ways to create the 3D representations of ligands in 3D point clouds may also be tested in the future. This work provides one of the possible frameworks of functions to create 3D representations of protein ligands in 3D point clouds (imaging approach), which were successfully tested to be used in ML approaches.

Chemical vocabularies and ligand structure labeling

Chemical vocabularies were designed (step 3, Fig. 1) to compose the building blocks to label the created 3D representations of ligands in 3D point clouds from LigPCDS. The set of uniquely used labels is referred to as vocabulary and the unique labels are referred to as classes.

Data labeling can be very difficult depending on the amount of data and on the availability of validated references[37](https://www.nature.com/articles/s41597-025-06002-8#ref-CR37 “Fredriksson, T., Mattos, D. I., Bosch, J. & Olsson, H. H. Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies. in Morisio, M., Torchiano, M., Jedlitschka, A. (eds) Product-Focused Software Process Improvement. PROFES 2020. Lecture Notes in Computer Science(), vol 12562. Springer, Cham. 202–216, https://doi.org/10.1007/978-3-030-64148-1_13

(2020).“). The labeling in LigPCDS was designed to first label the ligand’s structure atom-wise with building blocks (classes) and then to extrapolate it to the ligand 3D representations (the ligand chemical structure – next subsection). The implemented structure labeling approach was inspired by ML solutions that model chemical structures of small molecules for drug design[38](https://www.nature.com/articles/s41597-025-06002-8#ref-CR38 “Jin, W., Barzilay, R. & Jaakkola, T. Jun

Background & Summary

Background & Summary

Methods

Hardware

List of valid ligands

Ligand 3D representation in point cloud

Refinement of the Fo-Fc maps (experimental data preparation)

Extraction of the ligand grid representation in 3D point cloud (procedure 1, Fig. 2)

Transformation and scale of the ligand grid representation (procedure 2, Fig. 2)

Extraction of the fine ligand blob 3D representation (procedure 3, Fig. 2)

Creation of the ligand mask representation (procedure 4, Fig. 2)

Creation of the final representations of the ligands in 3D point cloud (procedure 5, Fig. 2)

Chemical vocabularies and ligand structure labeling

Similar Posts