Job replication

The output of a job instance may be incorrect because:

Some hosts have consistent or sporadic hardware problems, typically causing errors in floating-point computation.
Some volunteers may maliciously return wrong results; they may even reverse-engineer your application, deciphering and defeating any internal validation mechanism it might contain.

BOINC offers several mechanisms for validating results. However, there is no "one size fits all" solution. The choice depends on your requirements, and on the nature of your applications (you can use different mechanisms for different applications).

No replication

The first option is to not use replication. Each job gets done once. The validator examines single results, possibly parsing their output file and looking for some sign of correctness (e.g. conservation of energy in a simulation).

This approach is useful if you have some way (application-specific) of detecting wrong results with high probability.

Replication

BOINC supports replication: each job gets done on at least 2 different hosts, and a result is considered valid if 2 hosts return results that agree.

One problem with replication is that there are discrepancies in the way different computers do floating-point math. This makes it hard to determine when two results "agree"; two different results may be equally correct.

There are several different ways of dealing with this problem.

Eliminate discrepancies

It may be possible to eliminate numerical discrepancies entirely. To do so you'll need to select appropriate compiler, compiler options, and math libraries, and to make sure that your checkpoint files are full precision. This lets you do bitwise comparison of results. However, it is difficult and generally reduces the performance of your application.

Some notes on how to do this for Fortran programs are given in a paper, Massive Tracking on Heterogeneous Platforms and in an earlier text document, both by Eric McIntosh from CERN.

Fuzzy comparison

If your application is numerically stable (i.e., small discrepancies lead to small differences in the result) you can write a "fuzzy comparison function" for the validator that considers two results as equivalent if they agree within some tolerance.

However, applications involving physical simulation are typically not stable.

Homogeneous replication

With this variant of replication, once an instance of a job has been sent to a host, additional instances are sent only to hosts that are "numerically equivalent" (i.e. that will return bit-identical results).

Details

Homogeneous app version

This mechanism ensures that instances of a given job are run using the same app version; e.g. it won't use CPU and GPU versions for a given job.

Details

Adaptive replication

This is a refinement of the replication policy that reduces the computational overhead of replication. It randomly decides whether to replicate jobs, based on the measured error rate of hosts. If the first instance of a job is sent to a host with a low error rate, then with high probability no further instances will be sent.

Adaptive replication is independent of the comparison policy; you can use it with either bitwise comparison, fuzzy comparison, or homogeneous replication.

Details

Home

Job replication

No replication

Replication

Eliminate discrepancies

Fuzzy comparison

Homogeneous replication

Homogeneous app version

Adaptive replication

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!