Friday, May 04, 2012

Parallel benchmarking in Planner

For Drools Planner 5.5.0.Beta1, the Benchmarker will support multi-threading benchmarking, thanks to a pull request of Lukáš Petrovický 2 weekends ago.

Basically, if you a have quad core computer, your benchmarks now take half the time, with little loss of accuracy. If you have an 8 core computer, it only needs one forth of the time now.

The benchmarker also handles benchmark failures gracefully now. There's nothing fun about waiting 8 hours for a benchmarker, just to find out there are no results because one of them failed. So even if one benchmark failed, the benchmarker now reports the other non-failing benchmarks.

How to use

Parallel benchmarking needs to be explicitly turned on. By default, it's off. Here's how you turn it on:
<plannerBenchmark>
  ...
  <parallelBenchmarkCount>AUTO</parallelBenchmarkCount>
  ...
</plannerBenchmark>
AUTO will automatically pick the best value based on the number of available processors on your machine. On a quad core, it will use 2 cores. This way, there are 2 cores left for the OS, JVM garbage collection, ... so the benchmarks aren't disrupted and their results are reasonably accurate.

Alternatively, you can also specify the number of processors to use (statically or through a formula based on the availableProcessorCount).

Results

So how accurate are the benchmarks if we parallelize them to do more in less time?

Here are the results of the nurse rostering medium benchmark config, which runs 3 solver configurations on 4 datasets for 700 seconds each, compared between:
  1. Not parallel: parallelBenchmarkCount 1. Took over 2 hours.
  2. Parallel: parallelBenchmarkCount 2. Took only over 1 hour: twice as fast.
The hardware is a linux computer with 4 Intel Xeon 3Ghz CPU's.

Summary

Notice how the tabuSearch results (red and blue) are nearly identical between the parallel and non-parallel benchmark.
The simulated annealing results differ greatly (because small difference in the timeGradient can greatly impact the path of simulated annealing), but they don't consistently favor either benchmarker configurations.

With parallelBenchmarkCount 1 (on quad core)
With parallelBenchmarkCount 2 (on quad core)

Dataset medium_hint01

This proves that the Tabu Search implementations act pretty much the same in the 2 benchmarker configs.
The Simulated Annealing implementations act differently, due to a single or multiple timeGradient differences sending it a totally different direction. Note that running the same simulated annealing benchmark without parallelization twice, could even suffer from that timeGradient effect.
With parallelBenchmarkCount 1 (on quad core)
With parallelBenchmarkCount 2 (on quad core)

Dataset medium02

Again we see that Tabu Search acts pretty much the same way. And again Simulated Annealing acts differently, but comes to a similar result with and without parallelization.
With parallelBenchmarkCount 1 (on quad core)

With parallelBenchmarkCount 2 (on quad core)

Conclusion

Should we turn on parallel benchmarking? Yes. Even with simulated annealing we get reliable, representative results in half the time (on a quad core machine).

These changes lay the ground work for multi-machine benchmarking in the future (for example in the cloud), because the new SingleBenchmark class is a single benchmark task that can be offloaded to other machines.