Here is something I've been meaning to do for a long time: measure the performance of a docking code on a standard benchmarking dataset. The choice was DUDE, the Dud38 subset. From the Dud38, I've decided to look at actives and decoys both the "scaffolds" and "final". Here, only partial results for the "scaffolds" are shown.

The code tested is autodock_vina, version 1.1.2 with the default settings (including exhaustiveness). This is the only publicly available docking code that I know of. rdkit was used to prepare the molecules from smiles strings.

The performance is measured over 38 protein targets (hence the Dud38). Experimentally known active compounds as well a fake, decoy molecules are docked and scored to each target. The scoring is then converted into a ranking of molecules. The performance metric is AUC, or area under the curve, commonly used in signal processing. The AUC of 1.0 indicates a perfect classifier (all actives score higher than the decoys), while the AUC of 0.5 indicates a random classifier.

Here is the main result


The left-hand plot is the AUC per target, shown as a boxplot with 10 bootstrap samples used to estimate the error on the AUC. The right-hand side is the aggregate information on the 38 targets: the mean AUC is 0.6 with a standard deviation of around 0.1.