We also contribute to SETI@Home with several other machines--- a few Sun sparcs, a macintosh, a few PCs, and some HP workstations. Not one of these is nearly as fast as a single processor on the Oynx2, so the contributions do not even add up to 5% of our total contribution.
| People Involved | |
|---|---|
| Lars Nyland | http://www.cs.unc.edu/~nyland/ |
| Jan Prins | http://www.cs.unc.edu/~prins/ |
| Others | Who are you? |
On agressive processor implementations, the restart rate is 1. Most processors do not achieve this for all instructions, especially the complicated floating-point instructions like divide and square root. The latency is usually some constant, like 3, 5 or 12, representing common stages of instructions (decode, read, compute, store, branch).
Cache is usually N-way associative, which means that there are N different places that can be used to store each real memory location. In fully-associative cache, any memory address can be located anywhere (this is the ideal case). Most caches are 2 or 3-way associative. The speed gain is substantial when desired memory values are cached. If a value is in L1 cache, it typically can be used by the CPU just 1 clock tick after it has been requested. With prefetching, the value can be in the CPU when it is needed with no delay. If it happens to be in L2, the delay is something like 6--10 ticks. If it is in neither, the delay to get it from main memory is anywhere from 25 ticks to several hundred, depending on the memory architecture.
On the Pentium III, an FADD (FP-add) can be started every clock cycle, and has a latency of 3 cycles. Just as you would hope. The FMULL instruction has a latency of 5 cycles, but can only be issued every other clock cycle. And finally, the FDIV instruction, in single-precision, has a latency and reissue rate of 18 cycles (32 for double, and 38 for extended). This instruction is not pipelined (thus the large reissue rate), which means that no other floating-point operation can be in the pipeline. The Pentium III has only one FPU.
On the MIPS R12000, there are 2 FPUs, 2 ALUs, and a load/store unit, allowing up to 5 instructions to be issued per clock cycle. It can speculatively execute up to 4 branches deep, and has instruction reordering for better CPU utilization. It also has 64 integer and 64 FP registers that are dynamically assigned from the non-busy pool. There is a branch prediction unit with a branch prediction table (assoc. mem) that holds 2048 branch-target addresses.
There is an FP-add unit, an FP-multiply unit and an FP-divide/square root unit (I see 3; how do they count 2?). The add and multiply units can execute instructions with a 2-cycle latency at a 1-cycle reissue rate. The FDIV instruction is not pipelined, and has a latency of 12 cycles with a repeat rate of 14 (why the extra 2?).
Finally, the cache size differs drastically between the Pentium and the MIPS R12K. The Pentium II dropped its L2 cache to 512K from the Pentium Pro's larger value of 1M, although the newer PIIIs can have 512K, 1M, or 2M of L2 cache. The MIPS R12K processors we use have an L2 cache size of 8M.The difference this makes is that all of the data can reside in cache on the MIPS CPU, but on the Pentium, it must be occasionally fetched from main memory.
If the instruction stream is optimally pipelined (all operands available when the instruction is ready to be executed), and the reissue rate is 1, then the latency is completely hidden--- this is rarely the case. The extra processing hardware, larger cache memory, shorter latency and faster reissue capabilities of the mips processor allow it to execute about 3-4 times as many fp adds and multiplies per cycle compared to a Pentium III. But given the slower clock rate, the speed improvement is closer to a factor of 2 in actual execution speed. Thus, our PentiumIIs (dual Xeons at 400 mhz) complete one WU in about 9 hours, while a single MIPS R12000 (at 300 mhz) does the same work in 4 hours. Our PentiumIIIs running at 600 mhz improve only marginally over the 400 mhz Xeon PIIs, finishing one WU in 8.5 hours.
All the SETI@Home contributors started contributing at different times, but as time passes, we notice a stabilization in the top-100 statistics. This is because the slope of our line representing our contributions over time has minimized the effect of the date when we started.
Each contributor can roughly compute their accomplishment on a particular date with an equation such as:
In our case, m might be about 800 WU/week, while b might be as large as 10000.
We all started at different times and the differing b values were the primary indicator of our relative ranks. Now, however, the slope m dominates, and all participants slowly diverge while maintaining their relative status. Every so often, a new group or user will start up and have a large value of b due to their late entry, but may also have a large value of m. As they contribute at their rate of m, they will eventually pass all other contributors with lower values of m (given enough time). They will always be parallel to other contributors with equal values of m, and will never catch contributors with larger values of m. Thus the rankings of steadily contributing members will eventually stabilize. The picture below is meant to show a graphical version of the words above. You can see that once the variations in the starting times work themselves out, all the contributors will maintain their same ranking (and only grow further and further apart).
I wrote a Perl script that pokes around from time to time, looking for an idle processor. During the day (9-5), it thinks there is only 1 cpu available, so if the load average is above 0.00, there are no SETI@Home jobs. In the evening, the number of cpus available goes up to 16 or 24, and from midnight to 8am, it tries to use all 32. But it doesn't take all of them, just the ones it thinks are unused. Here's how it figures out how many Seti@Home processes to run.
If N is the target number of cpus, and LA is the current load average (1 minute), then the number of SETI@Home jobs, S, to run is:
The script expects a certain file structure, which looks a little different than you might expect. In our Seti@Home directory, we have sub-directories from xy00 to xyPP, where PP = #processors-1 on the machine, and 'xy' is the first 2 letters of the machine name running the Seti@Home executable. The name of the Seti@Home executable in the directory xyPP is setiPP, so we can match the output of 'ps' with a particular 'state.txt' file. The script nice_seti sits above all of these in our Seti@Home directory. On our machine, named evans, our seti directory looks like
% ls -l total 456 8 drwxrwxr-x 2 nyland 4096 Nov 26 10:44 ev00/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 21:13 ev01/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 19:20 ev02/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 19:01 ev03/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 19:33 ev04/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 19:04 ev05/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 17:04 ev06/ 8 drwxrwxr-x 2 nyland 4096 Nov 29 19:29 ev07/ ... 16 -rwxrwxr-x 1 nyland 4370 Nov 29 10:10 nice_seti* 128 -rw-r--r-- 1 nyland 13392 Nov 29 22:03 seti.outThe script is based on the idea that for a given hour during the day (0..23), there is a ideal number of processes to run, and the script knows the difference between weekdays and weekend days. The ideal value is reduced by the current load average (rounded up, or ceiling). The basic trip through the loop consists of the following steps:
There is some disagreement about the impact of jobs run with the 'npri -w' command. The manual and some anecdotal information indicate that these jobs do not appear to have any impact on jobs with a finite 'nice' value. Performance problems on our SGI have been blamed the existence of these jobs, but a thorough investigation has never been done. Thus, to keep interactive users happy, the ideal number of processors is typically 1 during the day to keep all blame away from our Seti@Home processes.
The script also depends on the Irix/Unix commands 'hinv' to get the number of processors, 'date' to get the day-name and hour, 'uptime' to get the 1, 5, and 15-minute load averages, and 'uname' to get the name of the system.
There is also a fudge-factor so that jobs are not killed for incidental work created by deamons such as nfsd, telnetd, or ftpd. We assume that a steady load average of 0.15 (or k+0.15) or less probably indicates "incidental" work that will go away the next time we ask. Thus we don't really kill any jobs until we exceed the target value by more than 1.15, not 1. This keeps small variations in the load average from constantly killing and restarting our Seti@Home jobs. Real work, such as complations, interactive graphics, or large rendering jobs immediately bump the load average by more than 0.15, so these jobs cause almost instant killing of Seti@Home jobs.
One more note: The Perl script nice_seti seems to need the latest, greatest version of Perl. We use the one in /usr/freeware/bin. I think it is not the version of Perl, but the version of some of the support files, such as those that handle process id information.
- Revised: Mon Nov 29 21:30:12 1999 by nyland@cs.unc.edu