With gromacs there are all these things you have to do when benchmarking, it's a little bit of a mess. One question that I always wondered about was – well – how long do you have to run to get a reliable performance number, in ns/day? 1e0 MD step is certainly too short but 1e7 steps seems like unnecessarily excessive.

Second question was, could one get away with measuring benchmarking speed of just a box of water? Currently people use system like DHPR or APOA1 (protein in water) to asses but those are arbitrary.  A box of water has dramatically simpler topology than a protein but maybe it doesn't matter?

One thing that this asses is absolute performance of gromacs: the ns/day below are low but that's because an old workstation is used, without GPU acceleration.

The answer are pretty simple: 1e4–1e5 steps are required to 'converge' the performance estimate, for my test system. Also,  a water box of very similar dimensions (and # of particles) but without the protein runs a little faster than. The effect is not huge, 20-30% so it could easily be used as a meaningful benchmark but maybe I'm missing something.

The simplicity of the water box benchmark (where the size of the box can be varied) is pretty cool. In my mind, it totally offsets the intellectually infertile discussion about "oh which protein big or small to choose to show that my favorite code is better".