An important aspect of drug discovery and design is QSAR, or Quantitative Structure Activity Relationship studies. QSAR attempts to create mathematical models that illustrate the relationship of the chemical and physical properties of compounds to their chemical structures. QSAR is commonly used for the prediction of pharmacokinetic properties, such as absorption, distribution, metabolism and excretion (ADME), as well as toxicity. QSAR has been applied for decades as an approach for lead optimization in drug discovery research, and as more labs get more digitized, applications for machine learning (ML) will undoubtedly multiply.
To get a better gauge on the state of ML methods, such as QSAR, and the advantages of deep learning ML methods, IBO chatted with Matt Repasky, Ph.D., vice president of Life Science Products, and leader of scientific and technical support groups at Schrödinger, a company that develops chemical simulation software used in pharmaceutical, biotechnology and materials research.
Expanding Applications for ML
Applications of ML techniques in drug discovery and design, which have traditionally been important tools through QSAR/QSPR (quantitative structure–property relationships), have been evolving and, as Dr. Repasky stated, it is currently “a very exciting time” in the field. “Largely due to new deep learning methods, applications have significantly broadened to include novel molecule creation, retrosynthetic analysis that can be tied to automated synthesis machinery, automated chemical perception from patents, papers and images, and combined physics-based and ML methods to rapidly and accurately predict assay endpoints of interest for very large chemical spaces,” he explained. “While there is significant hype regarding the potential impact of deep learning ML methods, i.e. AI, its benefits to the areas noted is pretty clear. In addition to more capabilities, automation and recognition of best-practices approaches have enabled more timely updates to QSAR models and creation/curation of models using broader ranges of data sources than in the past.” Together, Dr. Repasky noted, this broadens the use of QSAR models throughout the drug discovery process.
QSAR models can be used in pharmacological studies, providing an in silico methodology where a user is able to test or rank compounds for targeted properties. But that raises an important question: can ML models such as QSAR decrease the need for wet lab testing of compounds and the use of lab equipment and instruments? “The use of ML models in the drug discovery process, both in hit identification and lead optimization, is designed to efficiently recognize molecules that meet the property requirements of the project,” said Dr. Repasky. “In a world with perfectly performing models for every property required to be satisfied, computationally a large number of compounds could be examined, and only one compound would need to be synthesized and assayed to have a drug. We clearly do not live in that world yet.”
“While there is significant hype regarding the potential impact of deep learning ML methods, i.e. AI, its benefits to the areas noted is pretty clear.”
According to Dr. Repasky, in order to reach this goal, there is a need for much more experimental data to train ML models that are broadly applicable for all properties that are of interest, as well as ML techniques that are credible and accurate given large training sets. “So we believe the need for wet lab testing is unlikely to be reduced due to the need for data to inform ML models; however, the number of compounds that must be synthesized as part of hit identification and lead optimization projects that are informed by existing ML models is likely to decrease, due to accuracy improvements in ML modeling.”
At Schrödinger, this method is utilized for not only greater efficiency, but also to reduce costs. “[Schrödinger employs] an active learning-based approach combining deep learning ML models with physics-based free-energy calculations that has reduced the number of compounds that need to be synthesized and assayed significantly1,” Dr. Repasky explained. “This results in large financial savings within the standard design-make-test-analyze process widely employed in lead optimization, while maximizing the chemical space explored.”
Challenges with QSAR and ML
QSAR is a valuable tool in the drug design and discovery process, but it is not without its limitations. From false correlations, to issues with experimental design, to cross-correlations between various physicochemical parameters, QSAR and other ML applications can be complex to analyze and interpret. As Dr. Repasky indicated, the most pressing limitations of the application of ML to drug discovery are data quantity and quality, and interpolation. “Having consistent data for training/validation where experimental measurements are repeatable and are from an assay that aligns with physical understanding of the system is essential,” noted Dr. Repasky. “Particularly when dealing with biological assays, measurement error can be large relative to the differences in end point values being compared. End points, such as the half maximal effective concentration, EC50, are sometimes used in drug discovery projects to gauge efficacy, though EC50 may or may not directly relate to protein-ligand binding, which is often being chemically optimized.”
“In a world with perfectly performing models for every property required to be satisfied, computationally a large number of compounds could be examined, and only one compound would need to be synthesized and assayed to have a drug. We clearly do not live in that world yet.”
The need for sufficient data to minimize overfitting is vital, Dr. Repasky stated, not only for traditional deep learning ML approaches, but also newer ones. “Pharmaceutically relevant chemical space is incredibly vast, estimated to be on the order of 1060 compounds; thus, the creation of global models with applicability domains sufficient to cover large swaths of chemical space requires very large amounts of data,” he continued. “Such datasets are generally not available in the biotech or pharmaceutical industries leading to the widespread use of local model, for which applicability domains should be clearly communicated for effective use.”
Dr. Repasky explained that ML models’ reliance on interpolation also manifests as a limitation for ML models. He gave the example of ligand-based QSAR models (ranging from 2D to 6D), which are, in general, unable to forecast the occurrence of “magic methyls,” where orders of magnitude increase in binding affinity due to the addition of a methyl group to a ligand. In this situation, Dr. Repasky explained, physically, the increase in binding affinity takes place because of the displacement of an unstable and high-energy water molecule in the protein environment by the methyl group that does not form a significantly strong interaction with the protein. “This effect cannot be predicted by an interpolation-based ML model unless it has already been observed in another ligand, because displacement of the water leads to the physical effect and not interactions of the methyl group,” he continued. “For end points that are one-body problems, such as logP, this isn’t an issue, though even properties like solubility can exhibit this effect, due to the physical dependence of logS on both the crystal state, where it forms many-body interactions with copies of the same ligand, and the ligand in solution.”
Advantages of Deep Learning ML
There are numerous innovations within the field of QSAR, such as 3D-QSAR and multidimensional QSAR (i.e, HQSAR, G-QSAR, MIA-QSAR, multi-target QSAR), but, as Dr. Hall indicated, due to insufficient studies and literature available to accurately gauge the effectiveness and impact of these techniques on drug discovery and design, expectations for them are “modest at best.” For example, Dr. Hall outlined how 3D QSAR techniques depend greatly on ligand alignments, which themselves rely on at least one protein-ligand complex available for the chemical series, limiting the ease and accessibility of the technique; 4D–6D methods require a large amount of data to create models that are accurate for real-world usage in pharmaceutical projects. “For example, group-based QSAR requires the definition of fragmentation rules for the structures under study which are tedious to create and difficult to automate,” he said.
“Having consistent data for training/validation where experimental measurements are repeatable and are from an assay that aligns with physical understanding of the system is essential.”
With the growing prevalence of AI capabilities in the informatics space, deep learning ML methods can both expand in their range of applications and also improve prevailing applications, including for QSAR/QSPR, as Dr. Repasky indicated. “Deep learning methods are showing consistent advantages over traditional ML methods for smaller datasets typical in drug discovery projects, as has been observed at Merck2,” he said. “The use of deep learning ML enables training against very large, heterogeneous datasets with data deconvolution of a scale not possible with non-deep learning based ML methods. Given the paucity of such large datasets in drug discovery, approaches such as active learning are often employed, whereby physics-based calculations are used to train an ML model in an iterative fashion to achieve a desired accuracy and applicability domain.”
As Dr. Repasky stated, a combination of physics-based modeling and deep learning ML has demonstrably proven to significantly contribute to time and cost savings of lead optimization. “Deep learning ML methods have been used to generate synthetically tractable novel molecules that exhibit desired characteristics facilitating de novo design applications,” he said. “[They] have also demonstrated utility in extracting chemical structures from papers, patents and images, sharply reducing the time necessary to curate that data manually.”
Schrödinger is also focused on optimizing QSAR through ease of use and automation so as to increase its use. “From our perspective, democratization of QSAR through greater automation enabling the creation and deployment of QSAR models, with less expert human time and the use of deep learning methods to improve accuracy and expand the range of applications for ML modeling, are the most significant factors expanding QSAR application in drug discovery at this time.”
- Konze, K., Bos, Pieter, Dahlgren, M., Leswing, K., Tubert-Brohman, I., Bortolato, A., Robbason, B., Abel, R., Bhat, S.; ChemRxiv, 2019 doi.org/10.26434/chemrxiv.7841270.v2
- Ma, J., Sheridan, R., Liaw, A., Dahl, G., Svetnik, V.; J. Chem. Inf. Model., 2015, 55 (2), pp 263–274.
Feinberg, E., Sheridan, R., Joshi, E., Pande, V., Cheng, A.; arXiv.org/arXiv:1903.11789