The LSC Benchmark Dataset: Technical Appendix and Partial Reanalysis
Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter
In an earlier published article ‘Large-scale comparison of machine learning methods for drug target prediction on ChEMBL ‘, we compared several machine learning methods for drug target prediction.
In public databases, different chemical compounds are screened by diverse assays to measure bioactivities. Overall, only a few bioactivities are measured by assays for most compounds in the database. Machine learning techniques can compliment these data by virtually screening compounds for bioactivities. We developed the large-scale comparison (LSC) benchmark dataset for evaluating machine learning methods in predicting bioactivities represented by features, graphs or strings. Using a large dataset from the ChEMBL database, we compared standard deep feed-forward neural networks to Support Vector Machines, Random Forests, graph convolutional networks, and a recurrent neural network.
In this work, we present an overview of the dataset and discuss the common challenges of large-scale studies, including data preprocessing and clustering, and strategies for avoiding potential biases. We provide the preprocessed parts of the database, a description of our pipeline, and the tools developed for this study.
* AM, GK, and TU contributed equally to this work.