Noé Sturm, Andreas Mayr, Thanh Le Van, Vladimir Chupakhin, Hugo Ceulemans, Joerg Wegner, Jose-Felipe Golib-Dzib, Nina Jeliazkova, Yves Vandriessche, Stanislav Böhm, Vojtech Cima, Jan Martinovic, Nigel Greene, Tom Vander Aa, Thomas J Ashby, Sepp Hochreiter, Ola Engkvist, Günter Klambauer, and Hongming Chen
In this work, we investigated the transferability of machine learning models trained on publicly available data to private pharmaceutical industry data. Public and private datasets may differ substantially due to differences in measurement techniques, sample quantity, and assay specialization. We focused on machine learning models for drug target prediction, trained them on public data, transferred to industrial data, and evaluated their performance. For training the models, we used a benchmark dataset for drug discovery ExCAPE-DB. This dataset is extracted from ChEMBL and PubChem public databases, containing bioactivity data for small molecules, including protein-ligand activity. We considered several established machine learning methods: feed-forward fully connected deep neural networks (DNNs), gradient boosting as an ensemble-based approach, and Bayesian matrix factorization approach. The performance of these methods was evaluated on industrial datasets from AstraZeneca and Janssen databases. The main performance measure was the area under the receiver operating characteristic (ROC) curve, which reflects the model’s ability to rank active compounds higher than inactive compounds. Our results show that machine learning models trained on public data show only a small decrease in performance when applied to industry data.
Journal of Cheminformatics, 12, 1-13, 2020-04-19.