Hackathon: Bioinformatics: Drug Target

Submission Dates:
09/17/2021 to 10/03/2021
Citation Author(s):
Submitted by:
Sorush Omidvar
Last updated:
Sun, 10/03/2021 - 18:04
Data Format:
Creative Commons Attribution


Challenge Description

The traditional drug discovery process is expensive and time-consuming. Accelerating this process a significant challenge, for which taking a ML/AI-based pre-screening approach could assist us with high-throughput virtual pre-screening of huge amount of drug candidates to identify highly potent candidates for experimental testing and further validation.


The goal of this challenge is to build effective ML/AI-based surrogate models that can accurately predict the docking scores of candidate drug molecules on SARS-CoV-2 protein targets.



Please register at https://easychair.org/account/signin?l=pStWPAd56eImr92BxqJrMt#


The datasets to be used in this Challenge have been generated and curated by researchers at the Brookhaven National Laboratory (BNL) in collaboration with other DOE (Department of Energy) National Laboratories who have joined forces to combat the COVID-19 pandemics. The provided datasets contain the docking scores of molecules (i.e., drug candidates) on SARS-CoV-2 protein targets. Drug candidates are represented in SMILES strings and selected from known drug databases such as ENAMINE [1], ZINC [2] and DrugBank [3]. All SMILES strings have been canonicalized. The COVID-19 protein targets were provided by researchers at Argonne National Laboratory (ANL). The Docking scores were obtained from Autodock 4.2 [4] and then collected and organized into a CSV file, where rows represent molecules and columns represent different docking targets.
The whole dataset includes docking scores of 300,457 molecules on 18 different COVID-19 related protein docking targets. Part of this data will be provided for training and initial validation. The rest will be held out for evaluating the performance of the models trained by the participants. The training/validation dataset will include the SMILES string representing 270,000 molecules and their docking scores against different targets. In the test set, only the SMILES strings will be provided for 30,457 molecules without their docking scores. The participants will need to train their own model that can be used for accurate prediction of the docking scores of these molecules on different targets. The dataset can be obtained from the github repo [5] created for the Challenge: https://github.com/BC3D/BC3D_2021

Evaluation criteria:

The participants should submit the docking scores for the molecules in the test set predicted by their trained model. CSV format (same as the training/validation dataset provided to the participants) should be used for submitting the predicted scores. The predicted scores will be compared with the ground truth docking scores, based on which the model accuracy will be assessed in terms of the averaged mean absolute error (MAE) over all the targets.


References & Resources:

[1] https://enamine.net/compound-libraries/diversity-libraries

[2] https://zinc.docking.org/

[3] https://go.drugbank.com/releases/latest

[4] The original docking score data used for creation of the training/testing/validation datasets have been obtained from AutoDock. Further information regarding AutoDock can be found at the following URL: http://autodock.scripps.edu/

[5] https://github.com/BC3D/BC3D_2021