-
Notifications
You must be signed in to change notification settings - Fork 514
Homogeneous Redundancy
BOINC provides a feature called homogeneous redundancy (HR) to do replication-based validation of unstable applications. HR divides hosts into 'numerical equivalence classes': two hosts are in the same class if they return identical results for your applications. The BOINC scheduler will send results for a given workunit only to hosts in the same class; this lets you use strict equality to compare replicated results.
You can enable HR for a project by including the line
<homogeneous_redundancy>N</homogeneous_redundancy>in the project configuration file, where N is the "HR type" to use (see below).
Alternatively, you can enable HR for a single application
by setting the homogeneous_redundancy field
in its database record to the HR type for use with that application.
An "HR type" is a host classification. Currently the following HR types are defined:
No homogeneous redundancy (all hosts are numerically equivalent)
A fine-grained classification with ~80 classes.
A coarse-grained classification ~15 classes.
Types 1 and 2 divide hosts by OS (Linux, Windows, Mac, FreeBSD, Android). Type 2 subdivides by CPU architecture (Intel, PPC, ARM). Type 1 subdivides by a finer CPU classification that distinguishes Celeron, Pentium, AMD Athlon, AMD Opteron, etc..
NOTE: this is out of date; it doesn't reflect current CPU models.
The proper classification depends on your application, and how it's compiled (compiler, compiler options, math libraries) on the various platforms. For example, WCG reports that the following gcc options (on Linux) cause their apps to produce identical results on all processor types:
-mieee-fp -O3 -fno-rtti -ffor-scope -DNDEBUG
This allows them to use HR type 2.
There are two ways to find what HR type is needed for a given application. The bottom-up approach is to use a fine classification, and (by manually examining result files) identify classes that can be merged. The top-down approach is to use a coarse classification (e.g., 0) and (by analyzing the hosts involved in validation failures) identify host types that must go in separate classes.
You can modify the pre-defined HR types, or add your own, by editing the file sched/hr.cpp.
When HR is used, once an instance of a job has been sent to a host, the job is "committed" to the HR class of that host. This can potentially lead to a situation where the scheduler's job cache contains only jobs committed to a particular HR class, and hosts of other HR classes won't get jobs. You can use the show_shmem command to check whether this is happening.
For most projects this doesn't occur. If it does, BOINC provides a mechanism that allocates slots in the job cache to different HR classes, in proportion to the aggregate processing rate of hosts in each class. To enable this, put
<hr_allocate_slots/>in your config.xml file.
If you use this mechanism, you must periodically run a program called census that computes the shares for each HR class. To do so, add the following config.xml entry:
<task>
<cmd>census</cmd>
<period>1 day</period>
</task>The BOINC distribution includes a file sched/sample_hr_info.txt containing host-distribution data from a large project. You can use this e.g., during the period when your project is starting up and doesn't have a lot of hosts yet. Copy it to your project's root directory as hr_info.txt.
If you send the feeder a SIGUSR1 signal, it will write a summary of shared-memory contents, and allocations among HR classes, to its log file. This may be useful in debugging problems related to HR.
Normally a job's HR class is determined on the fly; it's determined by the host that is issued the first instance of the job.
You can also specify the HR class of jobs when they're created. If you do this, put
<hr_class_static/>in your config.xml file. This suppresses a mechanism that clears the HR class of a job if an instance fails and there are no other instances in progress or finished.
Don't change an application's HR type while there are jobs in progress; the meaning of HR classes will change, and jobs will be "stranded" in non-existent HR classes.
Instead, you can either:
- Wait until there are no jobs in progress, either by waiting for them to finish or by canceling workunits using the administrative web interface.
- Create a new application.